878 Matching Annotations

Feb 2024
www.biorxiv.org www.biorxiv.org

Vulture: Cloud-enabled scalable mining of microbial reads in public scRNA-seq data

2
1. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  ** Reviewer 2 Jingzhe Jiang ** Original submission
  
  In this study, Chen et al. introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. And they further applied Vulture to COVID-19, HCC, and gastric cancer human patient cohorts with public sequencing reads dataand discovered cell-type specific enrichment of SARSCoV2, hepatitis B virus (HBV), and H. pylori positive cells. Generally speaking, this study is innovative, has good application potential, and can better assist the work of single cell research from the point of view of infection. I only a few minor questions that need the author to reply: 1. Background: The first appearance of H. pylori should be replaced with its full name. 2. Methods-Downstream analysis of scRNA-seq samples: Why use different tools (SCANPY/Seurat, BBKNN/Harmony) to analyze different datasets instead of using the same tool to analyze different datasets? 3. Cell-type enrichment of microbial UMI: format error of formula. 4. Analyses-Page 11: "The statistical test identified that SARS-CoV-2 is enriched (p-value < 0.05) in epithelial cells, neutrophils, and plasma B cells (Fig. 3d and Table. 2)". It is best to highlight p < 0.05 data points in other colors rather than red squares. Why are there no p < 0.05 square in fig. 3e? 5. Fig. 2a and 2b: There are 8 colors in figure 2a, however only 4 figure legend were showed. What do the four light-colored bar mean? And the same to Fig 2b.
2. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad117), and has published the reviews under the same license. These are as follows.
  
  ** Reviewer 1 Yongxin Liu** Original submission
  
  The manuscript presented by the authors provides a useful tool on the virome, which named "Vulture: Cloud-enabled scalable mining of viral reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "virome in public data". However, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: Major comments: 1. The software, tested data and results are required to be uploaded on GitHub for peers to use, and conda and/or docker installation modes are recommended for software with complex dependencies. We will take software Star, Fork, and downloads of GitHub as one of the audience indicators. I found the GitHub links: https://github.com/holab-hku/Vulture. However, the readme.md show pipeline on AWS cloud. If I not have an AWS, how can I run it in my server. Now this project is only 2 stars. You need more people to take part in and interest in this project. 2. Software installation and User tutorial are required in Readme.md or Wiki in GitHub. Please provide step by step protocol to deploy it in the laptop or server. 3. A video of software download, installation, operation, and result display is required with a computer or server without any related software installed, to make sure that any new user can perform the whole process according to the tutorial. 4. The software is required to be posted on twitter and other social media, you can contact @ iMetaScience, @microbe_article etc. to get help in tweet or retweet. The number of Retweet, Like and View as one of the audience indicators. 5. Chinese is largest single langue science society. Provide the Chinese tutorial and video presentation of the software, contact meta-genome Official account for help to promote. The Number of readers, share and favorite also one of the audience indicators. 6. According to the feedback from users in all over the world, the author continuously maintains and optimizes the method to ensure its availability, ease of use and advancement. 7. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to million of hits in Google scholar. 8. The figures in your papers are diversity. However, I cannot find enough visualization function in your pipeline. The pipeline for integrated software is easy, the specific and diversity visualization plan is difficult. All the authors want their analysis result is ready-to-published. 9. Why only focus on the virus? Can this pipeline to generated all the microbiome, which is more interest and overview of the microbes.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.13.528411v1
Jan 2024
www.biorxiv.org www.biorxiv.org

A high-quality pseudo-phased genome for Melaleuca quinquenervia shows allelic diversity of NLR-type resistance genes

2
1. GigaScience 22 Jan 2024
  
  in GigaScience
  
  Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2--Julia Voelker
  
  The manuscript about NLR-type resistance genes in two haplotypes of Melaleuca quinquenervia is a relevant contribution to the research of Myrtaceae genomes and other long-lived trees. The methods are well described and should be reproducible with the available information and raw data, provided the authors mentioned all non-default settings in the method section. The FindPlantNLRs pipeline seems to be well documented on github.
  
  I believe that this manuscript is ready for publication after some small changes. Page and line numbers in the comments below refer to the PDF document: 1. The quality of some figures is not good (even upon download and zoom into the plot) and should be improved to higher resolution for publication. Especially in figure 3, all labels are too pixelated and hard to read. I would also recommend an increase in text size for this figure. In Figure 6 D & E, the authors should consider using consistent text sizes on the axes, and even though the quality is acceptable, a higher resolution of the labels would still be better.
  
  p. 10, Table 2: Although it is a standard statistic for genome assemblies, it would be helpful for some readers to specify what N50 and L50 are.
  
  p. 19, line 436: I believe the authors are referring to the wrong figure number.
  
  Below are some additional comments regarding typos or other language issues. While the text is generally well written, I would appreciate commas in certain sentences to improve readability, and think that some nouns are missing articles. I hope the authors will read through their text again and add articles where required, I won't point them out individually.
  
  p.4, line 33: wide range of p.7, line 130: 'a' instead of 8? p. 8, line 177: genome p.12, line 250: chromosome 2, add comma before 'while' in next line p.12, line 253: on all other chromosomes? p. 13, line 271: to occur? p.16, line 347: remove 'and' p.17, line 382, 384: orthologs? p.20, line 469: 'lead to the triggering of defence response' rephrase to make sense with the previous half of the sentence, also, defence response should have an article p.20, line 489/490: missing word?
2. GigaScience 22 Jan 2024
  
  in GigaScience
  
  Background The coastal wetland tree species Melaleuca quinquenervia (Cav.) S.T.Blake (Myrtaceae), commonly named the broad-leaved paperbark, is a foundation species in eastern Australia, Indonesia, Papua New Guinea, and New Caledonia. The species has been widely grown as an ornamental, becoming invasive in areas such as Florida in the United States. Long-lived trees must respond to a wide range pests and pathogens throughout their lifespan, and immune receptors encoded by the nucleotide- binding domain and leucine-rich repeat containing (NLR) gene family play a key role in plant stress responses. Expansion of this gene family is driven largely by tandem duplication, resulting in a clustering arrangement on chromosomes. Due to this clustering and their highly repetitive domain structure, comprehensive annotation of NLR encoding genes within genomes has been difficult. Additionally, as many genomes are still presented in their haploid, collapsed state, the full allelic diversity of the NLR gene family has not been widely published for outcrossing tree species.Results We assembled a chromosome-level pseudo-phased genome for M. quinquenervia and describe the full allelic diversity of plant NLRs using the novel FindPlantNLRs pipeline. Analysis reveals variation in the number of NLR genes on each haplotype, differences in clusters and in the types and numbers of novel integrated domains.Conclusions We anticipate that the high quality of the genome for M. quinquenervia will provide a new framework for functional and evolutionary studies into this important tree species. Our results indicate a likely role for maintenance of NLR allelic diversity to enable response to environmental stress, and we suggest that this allelic diversity may be even more important for long-lived plants.
  
  Reviewer 1– Andrew Read – University of Minnesota
  
  In the manuscript, A high-quality pseudo-phased genome for Melaleuca quinquenervia shows allelic diversity of NLR-type resistance genes, the authors assemble and analyze a phased genome of a long-lived tree species. In addition to providing a phased genomic resource for an important species, the authors analyze and compare the NLR gene complement in each of the two diploid genomes. I was surprised by the level of diversity of NLR genes in the two copies of the genome (this may be due to my biases based on working in highly homozygous species). This level of within-individual diversity has been largely overlooked by researchers owing to the difficulties of sequencing, assembly, and NLR identification. To address NLR identification, the authors publish a very nice pipeline that combines available tools into a framework that makes a lot of sense to me and will be valuable to anyone doing NLR gene work on new or existing genome assemblies. My main concern comes from not knowing how sequencing gaps and NLRs correlate across the two diploid genomes. Other than this, I think it’s a very nice paper that adds to the growing catalog of NLR gene diversity by tackling the challenge of NLRs in a heterozygous genome.
  
  Many of the authors’ interesting observations are based on comparisons of NLRs on the two haploid genomes, however some things are not clear to me: 1. Do any predicted NLR-genes overlap gaps in the alternative haploid genome? 2. If there is a predicted NLR-gene in one haploid genome and not the alternative genome, what is at the locus? Is it a structural variant indicating insertion/deletion of the NLR or is there ‘NLR-like’ sequence there that just didn’t pass the pipeline filters indicating an NLR fossil (or similar) – to me this is an important distinction. 3. How many of the NLR-genes on the two haploid genomes cluster 1:1 with their homolog on the alternative haploid genome – I’m particularly interested in the 15 ‘mismatched’ N-term-NBARC examples. It would be nice to know if these have partners in the alternative haploid genome, and if the partner has the same mismatch (if not, it would support the proposed domain swapping story) I believe each of these concerns will require whole genome alignment of the two haploid genomes.
  
  Additional comments (by line where indicated) The authors introduce the idea that M. quinquenervia is invasive in Florida, but this thread is never followed up on in the discussion and makes it feel a bit awkward. It would help if the authors clarified how the genome could help with management in native and invasive ranges
  
  Could the authors add some context for why ONT data was included and how it was used?
  
  It would be helpful if the authors provided a weblink to the iTOL tree
  
  164-166 – The observation of inversions potentially caused by assembly errors is nice!
  
  206 – add reference: Bayer PE, Edwards D, Batley J (2018) Bias in resistance gene prediction due to repeat masking. Nat Plants 4: 762–765. pmid:30287950
  
  240-246 – I’m not sure about excluding these incomplete NLRs – it would be interesting and potentially informative to see where they cluster (do they cluster with an NLR from the alternative haplotype? If so it may indicate truncation of one copy, etc) – however, if the author’s wish to remove these at this step I think they can add a statement like “we were interested in full-length NLRs, the filtered incomplete NLRs may represent….”
  
  429-430 – The criteria used to define clusters is described in the methods, can you confirm (and mention) that this is the same as used in the analyses you’re comparing to for E. grandis, rice, and Arabidopsis.
  
  435-437 – I’m interested to know if the four heterogenous clusters contain any of the N-term domain-swapped NLRs
  
  479-480 – The zf-BED domain is also present in rice NLRs – include citation for Xa1/Xo1
  
  523-524 – can you specify which base-call model was used on the ONT data?
  
  I’m curious about the presence/absence of IDs in the analyzed NLRs and would be very curious to know if the authors observe syntenic homologs across the two haploid genomes with ID presence/absence or presence of different IDs polymorphisms.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.04.27.538497v1
gigabytejournal.com gigabytejournal.com

Genome assembly and annotation of the king ratsnake, Elaphe carinata

1
1. GigaScience 11 Jan 2024
  
  in Public
  
  Raw sequencing data is also in the SRA under bioproject PRJNA955401,
  
  Nanopublication: RAOk_Yih3v "Organism of Elaphe carinata (species) - observed nucleotide sequence - SRX20564100" https://w3id.org/np/RAOk_Yih3v2q9s4LMZsy1v-qEhZ5ZGceChnl5h-godB2M
  
  nanopublicatiom
Visit annotations in context

Tags

nanopublicatiom

Annotators

GigaScience

URL

gigabytejournal.com/articles/101
www.biorxiv.org www.biorxiv.org

Finding the LMA needle in the wheat proteome haystack

2
1. GigaScience 02 Jan 2024
  
  in GigaScience
  
  Late maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point (pI) alpha-amylase in the aleurone as a result of a temperature shock during mid-grain development or prolonged cold throughout grain development leading to an unacceptable low falling numbers (FN) at harvest or during storage. High pI alpha-amylase is normally not synthesized until after maturity in seeds when they may sprout in response to rain or germinate following sowing the next season’s crop. Whilst the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have employed high-throughput proteomics to analyse thousands of wheat flours displaying a range of LMA values. We have applied an array of statistical analyses to select LMA-responsive biomarkers and we have mined them using a suite of tools applicable to wheat proteins. To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue, but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed. We observed that stored LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis, TCA cycle, along with DNA- and RNA binding mechanisms, as well as protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as wellas protein assembly via dimerisation and complexing. The secondary metabolism was also mobilised with the up-regulation of phytohormones, chemical and defense responses. LMA further invoked cellular structures among which ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain starch and other carbohydrates with the up-regulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose and UDP-glucose were down-regulated. This work demonstrates that proteomics deserves to be part of the wheat LMA molecular toolkit and should be adopted by LMA scientists and breeders in the future.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Luca Ermini
  
  This manuscript, which I had the pleasure of reading, is, simply put, a benchmark of five long read de novo assembly tools. Using 13 real and 72 simulated datasets, the manuscript evaluated the performance of five widely used long-read de novo assemblers: Canu, Flye, Miniasm, Raven, and Redbean.
  
  Although I find the methodological approach of the manuscript to be solid and trustworthy, I do not think the research is particularly innovative. Long-read assemblers have already been benchmarked in the scientific literature, and similar findings have been made. The authors are aware of this limitation of the study and have added a novel feature: the impact of read length on assembly quality, which in my opinion is still lacking sufficient innovation. However, the manuscript as a whole is valid and worthy of consideration. In light of this, I would like to share some suggestions I made in an effort to make the manuscript unique and more novel.
  
  Please see my comment below.
  
  1) Evaluation of the assemblies The metrics used to evaluate an assembly are frequently a murky subject as we are still lacking a standard language. The authors assessed the assemblies using three types of metrics: compass analysis, assembly statistics, and the Busco assessment, in addition to computational metrics like runtime and RAM usage. This is not incorrect, but I would suggest making a clear distinction between the metrics using (in addition to the computational metrics) three widely recognised metrics, or in short, the 3C criterion. The assembly metrics can be broken down into three dimensions: correctness (your compass analysis), contiguity (NG50) and completeness (the BUSCO assessment). The authors should reconsider the text using the 3C criterion; this will provide a clear, understandable, and structured way of categorising metrics. The paragraph beginning at line 197, for example, causes some confusion for the reader. The NG50 metrics evaluate assembly contiguity, whereas the number of misassemblies (considered by the authors in terms of relocation, inversion, and translocation) evaluate assembly correctness. I must admit that the misassemblies and contiguity can overlap, but I would still recommend keeping the NG50 (within contiguity) and misassemblies (within correctness) metrics separate.
  
  2) Novelty of the comparison The authors of the study had two main goals: to conduct a systematic comparison of five long-read assembly tools (Raven, Flye, Wtdbg2 or Redbean, Canu, and Miniasm) and to determine whether increased read length has a positive effect on overall assembly quality. The authors acknowledge the study's limitations and include an evaluation of the effect of read length on assembly quality as a novel feature of the manuscript (see line 70).
  
  The manuscript that described the Raven assembler (Vaser, R., Sikic, M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332-336 (2021)) compared the same assemblers' tools (Raven, Flye, Wtdbg2 or Redbean, Canu and Miniasm) evaluated in this manuscript plus two more (Ra and Shasta), used similar eukaryotes (A. thaliana, D. melanogaster, and Human), and reached a similar conclusion on Flye in terms of contiguity (NG50), and completeness (genome fraction) but overall there is not a best assembler in all of the evaluated categories. In this manuscript authors increased the number of eukaryotic genomes (including S. cerevisiae, C. elegans, T. rupribes, and P. falciparum) and reached similar conclusions: there is no assembler that performs the best in all the evaluation categories, but overall Flye is the best-performing assembler. This strengthens the manuscript, but the research is not entirely novel.
  
  Given that the field of third-generation technologies is rapidly progressing toward the generation of high-quality reads (Pacbio HiFi technology and ONT Q20+ chemistry are achieving accuracy of 99% and higher), the manuscript should also include a HiFi assembler benchmark. This would add novelty to the manuscript and pique the scientific community's interest. The authors have already simulated HiFi reads from S. cerevisiae, P. falciparum, C. elegans, A. thaliana, D. melanogaster, T. rubripes in addition to reference reads (or real reads) from S. cerevisiae (SRR18210286). P. falciparum (SRR13050273) and A. thaliana (SRR14728885).
  
  Furthermore, I am not sure what the benefit is of evaluating Canu on HiFi data instead of HiCanu, which was designed to deal with HiFi data. The authors already included some HiFi-enabled assemblers like Flye and Wtdbg2 but also HiFiasm should also be considered. I would strongly advise benchmarking the HiFi assemblers to complete the study and add a level of novelty. I would like to emphasise that the manuscript is solid and that I appreciate it; however, I believe that some novelty should be added.
  
  3) C elegans genomics The now-discontinued RSII, which had a higher error rate and a shorter average read than Sequel I or Sequel II, was used to generate the genomic data from C elegans. I understand the authors' motivation for including it in the analysis, but the use of RSII may skew the comparisons, and I would suggest adding a few sentences to the discussion about it.
  
  4) CPU time (h) and memory usage The authors claim the benchmark evaluation included runtime and RAM usage. However, I missed finding information about the runtime and RAM usage. Please provide CPU time (h) and memory usage (GB)
  
  Minor comments:
  
  1) Lines 64-65 "Here, we provide a comprehensive comparison on de novo assembly tools on all TGS technologies and 7 different eukaryotic genomes, to complement the study of Wick and Holt" I would modify "on all TGS technologies" as "at the present the two main TGS technologies"
  
  2) Line 163 Real reads. The term "real reads" may cause confusion for readers, leading them to believe that the authors produced the sequencing reads for the manuscript. I would use the term "ref-reads" indicating "reads from the reference genomes"
  
  3) Lines 218-219 Please provide full names (genus + species): S. cerevisiae, P. falciparum, A. thaliana, D. melanogaster, C. elegans, and T. rubripes
  
  4) Supplementary Table S4 "Accession number SRR15720446 seems to belong to a sample sequenced with 1 PACBIO_SMRT (Sequel II) rather than ONT
  
  5) Figures 2 and 3. Figures 2 and 3 give visual results of the performance of the five assemblers. I want to make a few points here: According to what I understand, the top-performing assembler is marked with a star and is plotted with a brighter colour than the others. However, this is not immediately apparent, and some readers might have trouble identifying the colour that has been highlighted. I would suggest either lessening the intensity of the other, lower-performance assemblers or giving the best assembler a graphically distinct outline. I also wonder if it would be useful to give the exact numbers as supplemental tables.
  
  Re-Review:
  
  Dear Cosma and colleagues, Thank you so much for addressing my comments in a satisfactory manner. The manuscript, in my opinion, has dramatically improved.
2. GigaScience 02 Jan 2024
  
  in GigaScience
  
  AbstractLate maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point (pI) alpha-amylase in the aleurone as a result of a temperature shock during mid-grain development or prolonged cold throughout grain development leading to an unacceptable low falling numbers (FN) at harvest or during storage. High pI alpha-amylase is normally not synthesized until after maturity in seeds when they may sprout in response to rain or germinate following sowing the next season’s crop. Whilst the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have employed high-throughput proteomics to analyse thousands of wheat flours displaying a range of LMA values. We have applied an array of statistical analyses to select LMA-responsive biomarkers and we have mined them using a suite of tools applicable to wheat proteins. To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue, but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed. We observed that stored LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis, TCA cycle, along with DNA- and RNA binding mechanisms, as well as protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as wellas protein assembly via dimerisation and complexing. The secondary metabolism was also mobilised with the up-regulation of phytohormones, chemical and defense responses. LMA further invoked cellular structures among which ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain starch and other carbohydrates with the up-regulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose and UDP-glucose were down-regulated. This work demonstrates that proteomics deserves to be part of the wheat LMA molecular toolkit and should be adopted by LMA scientists and breeders in the future.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad100), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Brandon Pickett **
  
  Overall, this manuscript is well-written and understandable. There's a lot of good work here and I think the authors were thoughtful about how to compare the resulting assemblies. Scripts and models used have been made available for free via GitHub and could be mirrored on or moved to GigaDB if required. I'll include a several minor comments, including some line-item edits, but the bulk of my comments will focus on a few major items.
  
  Major Comments: My primary concern here is that the comparison is outdated and doesn't address some of the most helpful questions. CLR-only assemblies are no longer state-of-the-art. There are still applications and situations where ONT (simplex, older-pore)-only assemblies are reasonable, but most projects that are serious about generating excellent assemblies as references are unlikely to take that approach.
  
  Generating assemblies for non-reference situations, especially when the sequencing is done "in the field" (e.g., using a MinION with a laptop) or by a group with insufficient funding or other access to PromethIONs and Sequel/Revios, is an exception to this for ONT-only assemblies. Further, this work assumes a person wants to generate "squashed" assemblies instead of haplotype-resolved or pseudohaplotype assemblies. To be fair, sequencing technology in the TGS space has been advancing so rapidly that it is extremely difficult to keep up, and a sequencing run is often outdated by the time analyses are finished, not to mention by the time a manuscript is written, reviewed, and published.
  
  Accordingly, in raising my concerns, I am not objecting to the analysis being published or suggesting that the work performed was poor, but I do believe clarifications and discussion are necessary to contextualize the comparison and specify what is missing.
  
  This comparison seeks to address Third-generation sequencing technologies: namely PacBio vs. ONT. However, each company offers multiple kinds of long-read sequencing, and they are not all comparable in the same way. Just as long noisy reads (PacBio CLR & ONT simplex) are a whole new generation from "NGS" short reads like from Illumina, long-accurate reads are arguably a new generation beyond noisy long reads. If this paper wants to include PacBio HiFi reads in the comparison, significant changes are necessary to make the comparison meaningful. I think it's reasonable to drop HiFi reads from this paper altogether and focus on noisy long reads since the existing comparison isn't currently set up to tell us enough about HiFi reads and including them would be an ordeal. If including HiFi, consider the following:
  
  1.a. Use assemblers designed for long-accurate reads. HiCanu (i.e., Canu with --pacbio-hifi option) is already used, as is a similar approach for Flye and wtdbg2. However, raven is not meant for HiFi data and miniasm is not either (though, it could be done with the correct minimap2 settings, but Hifiasm would be better). Assemblies of HiFi data with Raven and miniasm should be removed. Sidenote – Raven can be run with --weaken (or similar) for HiFi data, but it is only experimental and the parameter has since been removed. Including Hifiasm would be necessary, and it should have been included since Hifiasm was out when this analysis was done. Similarly, including MBG (released before your analysis was done) would be appropriate. Since you'd be redoing the analyses, it would be appropriate to include other assemblers that have since been released: namely LJA. Once could argue that Verkko should be included, but that opens another can of worms as a hybrid assembler (more on that later).
  
  1b. Use a read simulator that is built for HiFi reads. Badreads is not built for HiFi data (though using custom parameters to make it work for HiFi reads wasn't a bad idea at the time), and new simulators (e.g., PBSIM3, DOI: 10.1093/nargab/lqac092) have since been released that consider the multi-pass process used to generate HiFi data.
  
  1c. ONT Duplex data is likely not available for the species you've chosen as it is a very new technology. However, you should at least discuss its existence as something for readers to "keep an eye on" as something that is conceptually comparable to HiFi. 1d. Use the latest & greatest HiFi data if possible and at least discuss the evolution of HiFi data. Even better would be to compare HiFi data over time, but this data may not really be available and most people won't be using older HiFi data. Though, simulation of older data would conceivably be possible. While doing so would make this paper more complete, I would argue that it isn't worth the effort at this juncture. For reference, in my observation, older data has a median read length around 10-15 kb instead of 18-22 kb. 1e. Include real Hifi data for the species you are assembling. If none is available and you aren't in a position to generate it, then keep the hifi assembler comparison on real data separate from that of the CLR/ONT assembler comparisons on real data by using real HiFi data for other species. 2. Discuss in the intro and/or discussion that you are focusing on "squashed" assemblies. Without clever sample separation and/or trio-based approaches (e.g., DOI: 10.1038/nbt.4277), a single squashed haplotype is the only possible outcome for PacBio CLR and ONT-only approaches. For non-haploid genomes, other approaches (HiFi-only or hybrid approaches (e.g., HiFi + ONT or HiFi + Hi-C)) can generate pseudohaplotypes at worse and fully-resolved haplotypes at best. The latter is an objectively better option when possible, and it's important to note that this comparison wouldn't apply when planning a project with such goals. Similarly, it would probably be helpful to point out to the novice reader that this comparison doesn't apply to metagenome assembly either. 3. The title suggests to the reader that we'll be shown how long reads makes a difference in assembly compared to non-long read approaches. However, this is not the case, despite some mention of it in near line 318. Short read assemblies are not compared here and no discussion is provided to suggest how long read-based assemblies would improve outcomes in various situations relative to short reads. Unless such a comparison and/or discussion is added, I think the title should be changed. I've included this point in the "Major Comments" section because including such a comparison would be a big overhaul, but I don't expect this to be done. The core concern is that the analysis is portrayed correctly. 4. Sequencing technologies are often portrayed as static through time, but this is not accurate. This is a failing of the field generally. Part of the problem is the length of the publishing cycle (often >1yr from when a paper is written to when it's published, not to mention how long it takes to do the analysis before a paper is even written). Part of the problem is that current statistics are often cited in influential papers and then recited in more recent papers based on the influential paper despite changes having been made since that influential paper was released. Accordingly, the error rate in ONT reads has been misreported as being ~15% for many years even though their chemistry has improved over time and the machine learning models (especially for human samples) have also improved, dropping the error rate substantially. ONT has made improvements to their chemistry and changed nanopores over time and PacBio has tinkered with their polymerase and chemistry too. Accordingly, a better question for a person planning to perform an assembly would be "which assembler is best for my datatype (pacbio clr vs ont) and chemistry/etc.?" instead of just differentiating by company. Any comparison of those datatypes should at least address this as a factor in their discussion, if not directly in their analysis. I feel that this is missing from this comparison. In an ideal world, we'd have various CLR chemistries and ONT pores/etc. for each species in this analysis. That data likely doesn't exist for each of the chosen species though, and generating it would be non-trivial, especially retroactively. Using the most recent versions is a good option, but may also not exist for every species chosen. Since this analysis was started (circa Nov/Dec 2021 by my estimate based on the chosen assembler versions), ONT has released pore 10; in combination with the most recent release of Guppy, error rates <=3% are expected for a huge portion of the data. That type of data is likely to assemble very differently from R9.4, and starker differences would be expected for data older than R9.4. Even if all the data were the most recent (or from the same generation (e.g., R9.4)), library preps vary greatly, especially between UL (ultralong) libraries and non-UL libraries. Having reads >100kb, especially a large number of them, makes a big difference in assembly outcome in my observation. How does choice of assembler (and possibly different parameters) affect the assembly when UL data is included? How is that different from non-UL data? What about UL data at different percentages of the reads being considered UL? A paper focusing on long noisy reads would be much more impactful if it addresses these questions. Again, this may not be possible for this particular paper considering what's already been done and the available funding, and I think that's okay. However, these issues need to addressed in the discussion as open questions and suggested future work. The type of CLR and ONT data also needs to be specified in this work, e.g., in a supplemental table, and if the various datasets are not from the same types, these differences need to be acknowledged. At a minimum, I think the following data points should be included: chemistry/pore information (e.g., R9.4 for ONT or P2/C5 for PacBio), basecaller (e.g., guppy vX.Y.Z), and read length distribution info (e.g., mean, st. dev., median, %>100kb), ideally a plot of the distribution in addition to summary values. I also understand that these data were generated previously by others, and this information should theoretically be available from their original publications, which are hopefully accessible via the INSDC records associated with the provided accessions. The objective here is making the information easily accessible to the readers of this paper because those could be confounding variables in the analysis.
  
  This comparison considered only a single coverage level (30x). That's not an unreasonable shortcut, but it certainly leaves a lot of room for differences between assemblers. If the objective the paper is to help future project planners decide what assembler to use, it would be most helpful if they had an idea of what coverage they can use and still succeed. That's especially true for projects that don't have a lot of funding or aren't planning to make a near-perfect reference genome (which would likely spend the money on high coverage of multiple datatypes). It would be helpful to include some discussion about how these results may be different at much lower (e.g., 2x or 10x coverage) or at higher coverage (e.g., 50x, 70x, etc.) and/or provide some justification from another study for why including that kind of comparison would be unlikely to be worthwhile for this study, even if project planners should consider those factors when developing their budget and objectives.
  
  Figure 2 and 3 include a lot of information, and I generally like how they look and that they provide a quick overview. I believe two things are missing that will improve either the assessment or the presentation of the information, and I think one change will also improve things. 6a. I think metrics from Merqury (DOI: 10.1186/s13059-020-02134-9) should be included where possible. Specifically, the k-mer completeness (recovery rate) and reference-free QV estimate (#s 1 and 3 from https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation). Generally these are meant to be done from data of the same individual. However, most of the species selected for this comparison are highly homozygous strains that should have Illumina data available, and thus having the data come from not the exact some individual will likely be okay. This can serve as another source of validation. If such a dataset is not available for 1 or more of these species, then specify in the text that it wasn't available, and thus such an evaluation wasn't possible. If it's not possible to add one or both of these metrics to the figures (2 & 3), that's fine, but having it as a separate figure would still be helpful. I find these values to be some of the most informative for the quality of an assembly. 6b. It's not strictly necessary, so this might be more of a minor comment, but I found that I wanted to view individual plots for each metric. Perhaps including such plots in the supplement would help (e.g., 6 sets of plots similar to figure 4 with color based on assembler, grouping based on species, and opacity based on datatype). The specifics aren't critical, I just found it hard to get more than a very general idea from the main figures and wanted something easy to digest for each metric. 6c. Using N50/NG50 for a measure of contiguity is an outdated and often misleading approach. Unfortunately, it's become such common practice that many people feel obligated to include it or use it. Instead, the auN (auNG) would be a better choice for contiguity: https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity.
  
  This paper focuses on assembly and intentionally does not consider polishing (line 176), which I think is a reasonable choice. It also does not consider scaffolding or hybrid assembly approaches (again, reasonable choices). In the case of hybrid assembly options, most weren't available when this analysis was done (short read + long read assemblers were available, but I think it's perfectly reasonable to not have included those). Given the frequency of scaffolding (especially with Hi-C data [DOIs:10.1371/journal.pcbi.1007273 & 10.1093/bioinformatics/btac808]) and the recent shift to hybrid assemblers (e.g., phasing HiFi-based string graphs using Hi-C data to get haplotype resolved diploid assemblies (albeit with some switch errors) [DOI: 10.1038/s41587-022-01261-x] or resolving HiFi-based minimizer de bruijn graphs using ONT data and parental Illumina data to get complete, T2T diploid assemblies [DOI: 10.1038/s41587-023-01662-6]), I think it would be appropriate to briefly mention these methods so the novice reader will know that this benchmark does not apply to hybrid approaches or post-assembly genome finishing. This is a minor change, but I included it in this section because it matches the general theme of ensuring the scope of this benchmark is clear.
  
  Minor Comments: 1. line 25 in the abstract. Change Redbean to wtdbg2 for consistency with the rest of the manuscript.
  
  "de novo" should be italicized. It is done correctly in some places but not in others.
  
  line 64. "all TGS technologies": I would argue that this isn't quite true. ONT Duplex isn't included here even though Duplex likely didn't exist when you did this work. Also, see the major comments concerning whether TGS should include HiFi and Duplex.
  
  Table 1. Read length distributions vary dramatically by technology and library prep. E.g., HiFi is often a very tight distribution about the mean because of size selection. Including the median in the table would be helpful, but more importantly, I would like to see read-length distribution plots in the supplement for (a) the real data used to generate the initial iteration models and (b) the real data from each species.
  
  line 166 "fair comparison". I'm not sure that a fair comparison should be the goal, but having them at the same coverage level makes them more comparable which is helpful. Maybe rephrase to indicate that keeping them at the same coverage level reduces potentially confounding variables when comparing between the real and simulated datasets.
  
  line 169. Citation 18 is used for Canu, which is appropriate but incomplete. The citation for HiCanu should also be included here: DOI: 10.1101/gr.263566.120.
  
  line 169. State that these were the most current releases of the various assemblers at the time that this analysis was started. Presumably, that was Nov/Dec 2021. Since then, Raven has gone from v1.7.0->1.8.1 and Flye has gone from v2.9->2.9.1.
  
  line 175. Table S6 is mentioned here, but S5 has not yet been mentioned. S5 is mentioned for the first time on line 196. These two supp tables' numbers should be swapped.
  
  There is inconsistent use of the Oxford comma. I noticed is missing multiple times, e.g., lines 191, 208, 259, & 342.
  
  line 193. The comma at the end of the line (after "tools") should be removed. Alternatively, keep the comma but add a subject to the next clause to make it an independent clause (e.g., "...assembly tools, and they were computed...").
  
  line 237. The N50 of the reference is being used here. You provide accessions for the references used, but most people will not go look those up (which is reasonable). The sequences in a reference can vary greatly in their lengths, even within the same species, because which sequences are included in the reference are not standardized. Even the size difference between a homogametic and heterogametic reference can be non-trivial. Which are included in the reference, and more importantly included in your N50 value, can significantly change the outcome and may bias results if these are not done consistently between the included species. It would be helpful if here or somewhere (e.g., in some supplemental text or a table) the contents of these references was somehow summarized. In addition to 1 copy of each of the expected autosomes, were any of the following included: (a) one or two sex chromosomes if applicable, (b) mitochondrial, chloroplast, or other organelle sequences, (c) alternate sequences (i.e., another copy of an allele of some sequence included elsewhere), (d) unplaced sequence from the 1st copy, (e) unplaced sequence from subsequent copies, and (f) vectors (e.g., EBV used when transforming a cell line)?
  
  Supplemental tables. Some cells are uncolored, and other cells are colored red or blue with varying shading. I didn't notice a legend or description of what the coloring and shading was supposed to mean. Please include this either with each table or at the beginning of the supplemental section that includes these tables and state that it applies to all tables #-#.
  
  Supplemental table S3. It was not clear to me that you created your own model for the hifi data (pacbio_hifi_human2022). I was really confused when I couldn't find that model in the GitHub repo for Badreads. In the caption for this table or in the text somewhere, please make it more explicit that you created this yourself instead of using an existing model.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.22.525108v1
www.biorxiv.org www.biorxiv.org

Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

2
1. GigaScience 02 Jan 2024
  
  in GigaScience
  
  AbstractBackground Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.Competing Interest StatementThe authors have declared no competing interest.
  
  **Reviewer 2. Luke Carroll **
  
  The paper applies machine learning to publicly available proteomics data sets and assesses the ability to transfer learning algorithms between projects. The primary aim of these algorithms appears to be an attempt to increase consistency of retention time prediction for data-dependent acquisition data sets, however this is not explicitly stated within the text. The application of machine learning to derive insight from previous performed proteomics experienced is a worthwhile exercise.
  
  The authors report Î”RT to determine fitting for their models. It would be interesting to see whether the models had other metrics used to assess fitting, or could be used to increase number of identifications within sample sets, and whether this was successful. ALternatively, was there any conclusions able to be drawn about peptide structure and RT determination from these models?
  
  Project specific libraries are well known to improve results compared with publicly available databases, and the discussion on this point should be developed further through comparison of this work with other papers - particularly with advances in machine learning and neural networks in the data independent analysis field.
  
  Comparison of Q-Exactiv models vs Orbitraps appears to be somewhat redundant, and possible a result of poor meta-data as Q-Exactiv instruments are orbitrap mass spectrometers. A more interesting comparison to make here would be between orbitrap and TOF instruments, though as the datasets have all been processed through MaxQuant, it is likely the vast majority were acquired on orbitrap instruments.
  
  The paper uses Î”RT as the readout for all models tested, however the only chromatography variable considered in testing the models is gradient length. However, chromatography is also dependent on column chemistry, column dimensions, composition of buffer, use of traps, temperature etc. These are also likely to be contributing the variance observed between the PT datasets where these variables will be consistent and publicly available datasets. These factors are also likely to play a role in higher uncertainty for early and late eluting peptides where these factors are likely to vary most between sample sets. The metadata may not be available to use to compare within the data sets selected, so the authors should at minimum make discussion around these points.
  
  Sample preparation is likely to have similar effects, and as the PT datasets are generated synthetically using ideal peptides, publicly available datasets will be generated from complex sample mixtures, and have increased variance due to inefficiencies of digestion, sample clean up and matrix effects. Previous studies on variance have also described sample preparation as the highest cause of variance. This needs further discussion
  
  While the isolation windows of the m/z will lead to unobserved space, search engines setting will also apply here. From the text, it appears that the only spectra that were considered were those already identified in a search program (due to having Andromeda cut-off scores always apply). Typical setting for a database search will have a cut off of peptide sequences of at least 7 residues, making peptide masses appearing lower than 350 m/z unlikely. There is also significant amount of noise below 350 m/z and this also likely contributes to poorer fitting.
  
  The authors identify differences in MSMS spectral features, however, most of these points are well known in the field. The authors should develop the discussion on the causes of the differences in fragmentation, as CID low mass drop off is expected, and the change in profile is expected with increasing activation energies. A more developed analysis could exclude precursor masses from these plots and focus solely on fragment ions generated.
  
  The authors highlight that internal fragmentation of peptides could be used as a valuable resource to implement in machine learning. There has already been some success using these fragmentation patterns for sequence identification within both top-down and bottom up proteomic searches that the authors should consider discussing. However, these data do not appear to be incorporated into the machine learning models in this paper - or at least seem not to play a significant role in prediction, and this section appears to be a bit out of place.
  
  Re-Review The changes and additions to the discussion for the paper address the key points, and have addressed some of the limitations imposed by the availability and ability to extract certain data elements particularly around sample preparation and LC settings. I think this strengthens their manuscript, and provides a more wholistic discussion of factor in the experimental setup.
2. GigaScience 02 Jan 2024
  
  in GigaScience
  
  Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad096), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1: Juntao Li **
  
  This paper aimed to facilitate machine learning efforts in mass spectrometry data by conducting a systematic analysis of the potential sources of variance in public mass spectrometry repositories. This paper examined how these factors affect machine learning performance and performed a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. Although the experimental content is extensive and provides promising results, some major points need to be addressed as follows:
  
  1.Please explain the rationality of the RT used for evaluating model performance. In addition, it is necessary to increase other evaluation metrics to provide a more powerful comparison of model performance.
  
  2.The curves in Figures 6 and 8 should provide more explanations to help readers understand. In addition, all figures are somewhat blurry and clearer figures should be provided.
  
  3.This paper does not provide specific implementation steps of variance. Please describe the variance analysis process in mathematical language and provide the corresponding mathematical formula.
  
  4.There are some formatting issues: Keywords and the title 'Data Description' should only have the first letter capitalized. On pages 6, 17, and 18, the font size of the article is inconsistent.
  
  5.There are some grammar issues: On pages 6 and 16, dataset should be added with 's'. On page 7, lines 9-10, the tense is not unified.
  
  6.There are significant issues with the format of references. Inconsistent capitalization of initial letters in literature titles, such as [1] and [5]; Some literature lacks page numbers, such as [6] and [18]. Please re- organize the references according to the format required by the journal.
  
  Re-Review:
  
  I am glad to see that the authors have revised the manuscript based on the reviewer's comments and improved its quality. However, the responses to some comments did not fully convince me. I suggest the authors further revise or explain the following issues.
  
  I agree the rationality of Î”RT as a performance measure, but does not agree with the author's viewpoint of 'However, as the model performance indicates metric variance, and there are no changes to the conclusions drawn from the model performance'. I suggest the authors truthfully provide other classic machine learning performance metrics on the test dataset and analyze the differences.
  
  In order to avoid randomness caused by single data partitioning (training and testing data partitioning), multiple random data partitioning strategie (100 or 50 times) is usually adopted to evaluate the performance of learners using multiple average performance measures and variance. It is recommended that the authors consider this issue.
  
  The structure and references of the papers that I have seen that have been officially published in GigaScience are very different from the manuscript (the author has claimed to have organized and written according to the requirements). I am not sure if it was my mistake or the authors' mistake. I suggest the authors confirm the issue again and improve the writing.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.05.01.538996v1
www.biorxiv.org www.biorxiv.org

When do longer reads matter? A benchmark of long read de novo assembly tools for eukaryotic genomes

2
1. GigaScience 02 Jan 2024
  
  in GigaScience
  
  Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.Competing Interest StatementThe authors have declared no competing interest.
  
  **Reviewer 2: Katharina Scherf ** General comments This paper is a very thorough report on large-scale proteomics mapping of ca. 4000 wheat samples and several challenges related to sample preparation, measurement and data analysis. It is the first paper reporting such an extensive dataset and tools for analysis. Overall, I think that the authors have done in-depth work and it is also described in a way that can be understood well. The descriptions of how the authors arrived at the final workflow will also be useful to other groups attempting to do proteomics of wheat or other grains. I have only few comments for improvement. Note: line numbers would have been helpful
  
  Specific comments Abstract - Results: "LMA expression greatly impacted grain starch and other carbohydrates …" and then alpha-gliadins and LMW glutenin is mentioned. However, these are proteins and their relation to starch/carbohydrates is not clear.
  
  Introduction overall: Please harmonize the use of alpha-amylase and a-amylase; alpha-amylase is recommended, or else the Greek letter.
  
  p3, L1: "great source of protein": In terms of quantity, this is true. However, you should also include a brief statement about protein quality, which is not ideal, especially when considering gluten proteins
  
  section 2.1: Please include if all samples were grown together at the same place in one year (or not); i.e. include the information from section 3.1.1 already here.
2. GigaScience 02 Jan 2024
  
  in GigaScience
  
  AbstractBackground Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad084), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1: Nobuaki Takemori **
  
  The large proteome dataset for wheat, a representative grain, presented in this manuscript is valuable not only for agriculture science but also for basic plant science, but unfortunately, the manuscript is too wordy in its description and informative. Of course, a detailed description of the experimental methods and data generation process is an important component in obtaining reproducibility, but excessive information in the main text may have the unintended effect of hindering the reader's understanding of the manuscript. The volume of the main text in this manuscript should be reduced to 1/2 or even 1/3 of the original by referring to the following suggested revisions.
  
  Title: It looks rather like the title of a review article and is not appropriate for the title of an original research paper. An abbreviation is also used, making it difficult to understand. It should be changed to a title that more specifically and pragmatically reflects the content of the paper.
  
  Materials and Methods 2.3: The sample pretreatment used in this experiment has already been described in Ref. 41, so detailed description in this text is unnecessary. Also, Figure 1, which visualizes the experimental process, is too packed with information and is difficult to read in its small font. In addition, many extraneous photographs of LC-MS instruments and other common equipment are included. Sample pretreatment should be described very briefly in the text, and only those areas where there are differences from previous reports should be mentioned. If the author wishes to describe the details of the experiment to assure reproducibility, it is recommended to describe it in the form of an experimental protocol and include it in the Supplementary Information.
  
  Materials and Methods 2.5: The 11 different paths the authors have set up for LC-MS/MS analysis are difficult to understand in text. Maybe they could be summarized in a table or visualized using a flowchart.
  
  Materials and Methods 2.6 to 2.9: It is recommended that only the essentials be described in the text and the minute details be moved to the Supplementary Information.
  
  Results 3.2.(p 26, line 11-20): The description should be moved to the introduction.
  
  Results 3.1.3-3.1.4 Too detailed and too long. Only the main points should be mentioned. It would be effective to use concise Figures where possible.
  
  Figure 6: Too much information; A, B, F, and G should be supplemental information.
  
  Figure 8: Wheat cartoon is unnecessary. The font is too small. This information should be in a Table.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.30.526229v1
Dec 2023
www.biorxiv.org www.biorxiv.org

Nanopore Adaptive Sampling Enriches for Antimicrobial Resistance Genes in Microbial Communities

2
1. GigaScience 25 Dec 2023
  
  in GigaByte
  
  Editors Assessment: Antimicrobial resistance (AMR) is a global public health threat, and environmental microbial communities can act as reservoirs for resistance genes. There is a need for genomic surveillance could provide insights into how these reservoirs change and impact public health. With that goal in mind this study tested the ability of nanopore sequencing and adaptive sampling to enrich for AMR genes in a mock community of environmental origin. On average adaptive sampling resulting in a target composition 4x higher than without adaptive sampling, and increased target yield in most replicates. The methods and scripts for this approach were reviewed and curated together, although the scope of this study was limited in terms of communities tested and AMR genes targeted. And the authors improved their analysis by conducting an additional analysis of a diverse microbial community. Demonstrating the method is reusable and its results are promising for developing a flexible, portable, and cost-effective AMR surveillance tool.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 25 Dec 2023
  
  in GigaByte
  
  AbstractAntimicrobial resistance (AMR) is a global public health threat. Environmental microbial communities act as reservoirs for AMR, containing genes associated with resistance, their precursors, and the selective pressures to encourage their persistence. Genomic surveillance could provide insight into how these reservoirs are changing and their impact on public health. The ability to enrich for AMR genomic signatures in complex microbial communities would strengthen surveillance efforts and reduce time-to-answer. Here, we test the ability of nanopore sequencing and adaptive sampling to enrich for AMR genes in a mock community of environmental origin. Our setup implemented the MinION mk1B, an NVIDIA Jetson Xavier GPU, and flongle flow cells. We observed consistent enrichment by composition when using adaptive sampling. On average, adaptive sampling resulted in a target composition that was 4x higher than a treatment without adaptive sampling. Despite a decrease in total sequencing output, the use of adaptive sampling increased target yield in most replicates.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.103), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Ned Peel. **
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Yes. I do not think the authors have included a specific license and assume the code will be released under a Creative Commons CC0 waiver.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. No guidelines on how to contribute, report issues or seek support on the code.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes. A list of software used, along with version numbers, can be found in "dart_methods_notebook.md"
  
  Additional Comments:
  
  The authors describe each step of the analysis well and have provided code to reproduce the analysis and figures from the manuscript.
  
  **Reviewer 2. Julian Sommer **
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  No. Not applicable to this study, since no novel software is described.
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Not applicable to this study, since no novel software is described.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. Not applicable to this study, since no novel software is described.
  
  Is the code executable?
  
  Unable to test. The code and software used for analysis of the data is reported in the supplement data. However, the data used in this study in the SRA biobank is not available to download at the time of this review.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Unable to test. See above.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The analysis steps are clearly commented.
  
  Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?
  
  No. The code provided for the data analysis is not usable without the raw sequencing data.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Not applicable.
  
  Additional Comments.
  
  The aim of this study was to test the ability of adapting sampling sequencing on the Oxford Nanopore sequencer to enrich for antibiotic resistance genes in a synthetic mixture of bacterial DNA. DNA from six environmental bacterial isolates with known antibiotic resistance genes were mixed at equal mass and used for metagenomic sequencing on an Oxford Nanopore MinION MK1B, comparing adaptive sampling with standard sequencing. By analysing 10 sequencing runs using low throughput, low cost flongle flow cells, the authors obtained sequencing data to compare adaptive sampling and standard sequencing approaches. Using a defined composition of sequenced sample and technical and biological replicates, the method is generally suitable. From their data, the authors conclude that adaptive sequencing significantly reduces throughput and increases gene target enrichment by analysing different parameters.
  
  This result is important for the use of adaptive sampling in general, but has already been published in numerous publications, the author cites in his study. According to the author, the novel aspect of this work is the environmental origin of the bacteria used to generate the synthetic mock community. However, since the approach of adaptive sampling does not change regardless of the origin of the sequenced DNA, there are no significant new insights generated in this study. Also, the synthetic mock community of six members does not resemble an environmental metagenomic sample with incomparably more complex species diversity with different abundances. From the data presented in this study, no conclusions can be drawn regarding the performance of adaptive sampling sequencing of environmental metagenomic samples.
  
  To improve the study, I suggest the following: Sequencing of DNA from environmental samples using nanopore sequencing without adaptive sampling and identification of antibiotic resistance genes. Subsequently, resequencing the sample using adaptive sampling based on the identified antibiotic resistance genes and comparing the results in terms of gene target enrichment as analysed in the study. This was partly suggested by the authors and should be carried out to gain new insights into the very interesting application of metagenomic sequencing for the One Health approach.
  
  Additionally, there are some inconsistencies in the manuscript. For example, line 128 – 132 describes the sequencing process using different flowcells and technical replicates. However, it remains unclear, how the half of the channels of each flowcell were reserved for adaptive sampling sequencing since the adaptive sampling sequencing is always performed on the whole flowcell. Additionally, it is stated, that each flowcell was used twice for sequencing, however, no method on how to reuse the flongle flowcells is described and no protocol for this is available from oxford nanopore.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.27.546783v1
gigabytejournal.com gigabytejournal.com

The genome assembly and annotation of the Chinese cobra, Naja atra

2
1. GigaScience 05 Dec 2023
  
  in Public
  
  The genome assembly and annotation of the Chinese cobra, Naja atra
  
  Nanopublication: RAyW5v4w76 "Article: The genome assembly and annotation of the Chinese cobra, Naja atra" https://w3id.org/np/RAyW5v4w76mcFJYDreFTuhc4Yu0sKwZQBccYfoB_Q-7_o
  
  nanopublicatiom
2. GigaScience 05 Dec 2023
  
  in Public
  
  Raw reads are available in the SRA via bioproject PRJNA955401. Additional data is in the GigaDB repository [25 Reference25WangJ, WuY, WangS Supporting data for “The genome assembly and annotation of the Chinese cobra, Naja atra”. GigaScience Database, 2023; http://dx.doi.org/10.5524/102476 .].
  
  Nanopublication: RAt6pmOk9T "Organism of ?term=txid8656 - sequenced nucleotide sequence - PRJNA955401" https://w3id.org/np/RAt6pmOk9T4pCGTI5HTJ3hntFoIWRNv5zpGSNxX0JTYVk
  
  nanopublication
Visit annotations in context

Tags

nanopublicatiom

nanopublication

Annotators

GigaScience

URL

gigabytejournal.com/articles/99
Nov 2023
www.biorxiv.org www.biorxiv.org

A reference assembly for the legume cover crop, hairy vetch (Vicia villosa)

2
1. GigaScience 17 Nov 2023
  
  in GigaByte
  
  Editors Assessment:
  
  The hairy vetch Vicia villosa is an annual legume widely used as a cover crop due to its ability to withstand harsh winters. Here a new a 2.03GB reference-quality genome is presented, assembled from PacBio HiFi long-sequence reads and Hi-C scaffolding. After adding some more methodological details and long-terminal repeat (LTR) assembly index (LAI) analysis the assembly quality and metrics look quite convincing as a chromosome-scale assembly. This resource hopefully providing the foundation for a genetic improvement program for this important cover crop and forage species.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Nov 2023
  
  in GigaByte
  
  ABSTRACTVicia villosa is an incompletely domesticated annual legume of the Fabaceae family native to Europe and Western Asia. V. villosa is widely used as a cover crop and as a forage due to its ability to withstand harsh winters. A reference-quality genome assembly (Vvill1.0) was prepared from low error rate long sequence reads to improve genetic-based trait selection of this species. The Vvill1.0 assembly includes seven scaffolds corresponding to the seven estimated linkage groups and comprising approximately 68% of the total genome size of 2.03 gigabase pairs (Gbp). This assembly is expected to be a useful resource for genetic improvement of this emerging cover crop species as well as to provide useful insights into plant genome evolution.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.98), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Rong Liu
  
  See reviewer comments document: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT0zODcmZmlsZT0xNTAmdHlwZT1nZW5lcmljJnZpZXc9ZmFsc2U~
  
  Reiewer 2. Haifei Hu
  
  Fuller et al. conducted an interesting work on the Vicia villosa genome study, which could be beneficial for the science community. However, there are some concerns about this work before it can be published.
  
  Introduction The MS seems to indicate the V.villosa genome is important for breeding, and it is an ideal legume that can grow in winter. But the coming analysis and results are missing to address this. The authors should include additional analysis, at least in the gene annotation session, to indicate what genes are potentially associated with the improvement of genetic-based selection and the ability to grow in winter conditions. After reading the MS, it looks like it mainly focuses on the comparison of the V.vilsoa genome and the V.sativa genome. Please indicate why it is important to do so and provide more background on V.sativa in the introduction. Line 59. It is too sudden to start to describe high heterozygosity as still in the challenge without directly linking to V.villosa. The authors need to include the background that V.villosa is heterozygous first, then talk about how challenging it is to generate an assembly.
  
  Methods Line 112: Why is the estimation based on K-mer size quite different from the generated assembly size? The authors’ explanation is weak and needs an in-depth and better explanation of these unexpected results. Did you see any similar observations in other studies? Please give examples(citations). Line 121: Any reason not to use the commonly used HiFi assembler HFi-asm? Line 142-143: Did you have a file to record which genome regions you have introduced the breaks and how this step was performed? Line 158: the unit bp changed into Mb for better comparison Line 160: Here, you should use contig N50 rather than scaffold N50 to indicate the quality of the gnome. And you need to compare the contig N50 with the V.sativa.
  
  DATA VALIDATION AND QUALITY CONTROL Should perform BUSCO and LAI to assess the quality of the genome in the main text.
  
  4 Phylogenetic tree construction Soybean is an important legume species, and it will make this result more useful and interesting for readers. You should include the Wm82 V4 genome for this analysis. And the version of other legume species’ genomes needs to be indicated.
  
  5 Figures Figure 3 HiC alignment map shows near 600Mb genomes can not be scaffolded into a genome. Any reason? What is the green dot point in the figure? Figure 4 b, the BUSCO of Vvil1.0 is much higher than V.stativa. Any reason? And no description of how you perform the BUSCO analysis in the main text. Figure 6 Circle plot, would that possible to rename the scaffold as a chromosome based on the alignment between V.sativa and V.vil?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.28.534423v1
www.biorxiv.org www.biorxiv.org

Data from Entomological Collections of Aedes (Diptera: Culicidae) in a post-epidemic area of Chikungunya, City of Kinshasa, Democratic Republic of Congo

2
1. GigaScience 17 Nov 2023
  
  in GigaByte
  
  Editors Assessment: Aedes mosquito spread Arbovirus epidemics (e.g. Chikungunya, dengue, West Nile, Yellow Fever, and Zika), are a growing threat in Africa but a lack of vector data limits our ability to understand their propagation dynamics. This work describes the geographical distribution of Ae. aegypti and Ae. albopictus in Kinshasa, Democratic Republic of Congo between 2020 and 2022. Sharing 6,943 observations under a CC0 waiver as a Darwin Core archive in the University of Kinshasa GBIF database. Review improved the metadata by adding more accurate date information, and this data can provide important information for further basic and advanced studies on the ecology and phenology of these vectors in West Africa.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Nov 2023
  
  in GigaByte
  
  AbstractArbovirus epidemics (e.g. Chikungunya, dengue, West Nile, Yellow Fever, and Zika), are a growing threat in Africa in areas where Aedes (Ae.) aegypti and A. albopictus are present.The lack of complete sampling of these two vectors limits our ability to understand their propagation dynamics in areas at risk from arboviruses. Here, we describe for the first time the geographical distribution of two arbovirus vectors (Ae. aegypti and Ae. albopictus) in a chikungunya post-epidemic zone in the provincial city of Kinshasa, Democratic Republic of Congo between 2020 and 2022. In total 6,943 observations were reported using larval capture and human capture on landing methods. These data are published in the public domain as a Darwin Core archive in the Global Biodiversity Information Facility. The results of this study potentially provide important information for further basic and advanced studies on the ecology and phenology of these vectors, as well as on vector dynamics after an epidemic period.Subject Areas Ecology, Biodiversity, Taxonomy
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.98), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Luis Acuña-Cantillo **
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  They must be review the standard Darwin core format for sampling events. https://www.gbif.org/darwin-core.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. They don't describe how the map of the study area was created, whether they used a GIS or not. Sampling points must be included on the map.
  
  They don't mention how the identification of the larval stages was carried out and how they were differentiated from other genera of species of the Culicinae subfamily, such as Culex, Haemagogus, Mansonia, Sabethes or other species of the genus Aedes, since the two main species of this genus, were its objective.
  
  In 5 reference, they mention is only for adult identification. They should include or cite the collection protocols and describe them as much as possible so that the study can be replicated in other African countries.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  Not my area of expertise. The data could be validated with biological collection of specimens
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  The scientific names must follow the same nomenclature, the first time the full name Aedes aegypti is mentioned and the second time Ae.aegypti, if there are two species within the same genus only one is mentioned the first time and the second time both abbreviated Ae.aegypti and Ae.albopictus.
  
  Bibliographic references should be cited accordingly, for example: (1-4).
  
  The names of the diseases must follow the same writing with a capital letter at the beginning or all in lower case Chikungunya or chikungunya.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  From the description of the study and the collection times, I would believe that it fits more with Sampling Events, the data is well organized, however, it is suggested to review the Darwin Core template for this type of data and adjust to the corresponding model. , event_core review: https://www.gbif.org/darwin-core.
  
  Additional Comments: The data paper can be published with suggestions for improvement. Congratulations, very good job!
  
  **Reviewer 2. Mary Ann Tuli **
  
  See the data audit file for more:
  
  https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00NjQmZmlsZT0xNzYmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~
  
  **Reviewer 3. Paul Taconet **
  
  Is the language of sufficient quality?
  
  Yes. Some minor changes that I recommend : "And the relative annual average humidity is 79%." may be changed to "The relative annual average humidity is 79%.". "Aedes albopictus is the most abundant species in the studied region" may be changed to "Aedes albopictus was the most abundant species in the studied region"
  
  Are all data available and do they match the descriptions in the paper?
  
  No.
  
  1/The data available are of type 'occurrence' (only in 1 file - the "occurrence" file). For a better presentation of the data, I would suggest to transform them into "sampling event" data, which is more suited to this kind of data acquired from sampling events (see https://ipt.gbif.org/manual/en/ipt/latest/sampling-event-data), while keeping the occurrence dataset. This would enable the user to quickly understand the dates and locations of the sampling events.
  
  2/ In the data, the only available date (column eventDate) is the first of January (eg. 2021-01-01T00:00:00). This does not enable to separte the data into seasons (Rainy et Dry) as presented in table 1 of the manuscript. I strongly suggest the authors to provide the specific date for each collected mosquito in the data.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. 1/Larval collections : sampling strategy used ? 2/How many collection rounds in total ? please provide the dates of collection.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. 1/Human landing catch : was any quality control done during the collection of data (i.e. check that the collectors were at their place, etc.) ?
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  Yes. 1/comments for figure 1 (map) : - "legend" should be written in english (and not in french) - "harvesting sites" -> entomological collection points - the background layer is not very appropriate. Maybe better to put an Open Street Map background layer
  
  2/What about ethical approval for the Human Landing Catches ? please provide the name of the institution who has approved the HLC and the approval number, if relevant
  
  3/ in the dataset, for the species scientific name, I suggest to use the names as presented in : Harbach, R.E. 2013. Mosquito Taxonomic Inventory, https://mosquito-taxonomic-inventory.myspecies.info/ . Or at least, to provide the "nameAccordingTo" column.
  
  4/ In the dataset, many columns seem totally empty. Please remove them if so.
  
  Additional Comments: Thanks for this nice work and the effort put to publish your entomological data. I strongly suggest you to add the real dates of collection of the data in the GBIF dataset (see comments above).
  
  **Reviewer 4. Angeliki Martinou **
  
  Are all data available and do they match the descriptions in the paper?
  
  Yes. It will be good for the authors the first time that they cite the two species to use the full names Aedes (Stegomyia) albopictus (Skuse) Aedes (Stegomyia) aegypti (Linnaeus, 1762)
  
  In the methods section the title should be Human Landing Catches and not Human capture on landing
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.09.20.558445v1
www.biorxiv.org www.biorxiv.org

Developing best practices for genotyping-by-sequencing analysis in the construction of linkage maps

3
1. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:
  
  Reviewer Name: Ramil Mauleon
  
  The paper titled "Developing best practices for genotyping-by-sequencing analysis using linkage maps as benchmarks" aims to present an end to end workflow uses GBS genotyping datasets to generate genetic linkage maps. This is a valuable tool for geneticists intending to generate a high confidence linkage map from a mapping population with GBS data as input.I got confused on reading the MS though, is this a workflow paper or is this a review of the component software for each step of genetic mapping and how parameter/use differences affect the output ? If it's a review, then the choice of software reviewed are not comprehensive enough, esp on SNP calling, and linkage mapping.There is no clear justification why each component software was used,example the use of GATK and freebayes for SNP calling I am familiar with using TASSEL GBS and STACKS for SNP calling using GBS data, why weren't they included in the SNP calling software. The MS would benefit greatly from including these SNP calling software in their benchmarking.Onemap and gusmap seems also pre-selected for linkage mapping, without reason for use, or maybe the reason(s) were not highlighted in the text. I've had experience in the venerable MAPMAKER and MSTMap, and would like to see more comparisons of the chosen genetic linkage mapping software with others, if this is the intent of the MS.The MS also clearly focuses on genetic linkage mapping using GBS, which should be more explicitly stated in the title. GBS is also extensively used in diversity collections and there is scant mention of this in the MS, and whether the workflow could be adapted to such populations.Versions of sofware used in the workflow are also not explicitly stated within the MS.The shiny app is also not demonstrated well in the MS, it could be presented better with screenshots of the interface , with one or two sample use cases.
2. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:
  
  Reviewer name: Peter M. Bourke
  
  I read with interest the manuscript on Reads2Map, a really impressive amount of work went into this and I congratulate the authors on it. However, it is precisely this almost excessive amount of results that for me was the major drawback with this paper. I got lost in all the detail, and therefore I have suggested a Major Revision to reflect that I think the paper could be somehow made more stream lined with a clearer central message and fewer figures in the text. Line numbers would have been helpful, I have tried to give the best indication of page number and position, but in future @GigaScience please stick to line numbers for reviewers, it's a pain in the neck without them.
  
  Overall I think this is an excellent manuscript of general interest to anyone working in genomics, and definitely worthy of publication.Here are my more detailed comments:
  
  General comment: if a user would like to use GBS data for other population types than those amenable for linkage mapping (e.g. GWAS or genomic prediction, so a diversity panel or a breeding panel), how could your tool be useful for them?
  
  Other general comment: the manuscript is long with an exhaustive amount of figures and supplementary materials. Does it really need to be this detailed? It appears like the authors lost the run of themselves a little bit and tried to cram everything in, and in doing so risk losing the point of the endeavour. What is the central message of this manuscript? Regarding the figures, the reader cannot refer to the figures easily as they are now mainly contained on another page. Do you really need Figures 16-18 for example?
  
  Figures 13 and 14 could be combined perhaps? I am sure that at most 10 figures and maybe even less are needed in the main text, otherwise figures will always be on different pages and hence lose their impact in the text call-out.
  
  Abstract and page 4: "global error rate of 0.05" - How do you motivate the use of a global error rate of 5%? Surely this is dataset-dependent?
  
  Page 4 - how can a user estimate an error per marker per individual? The description of the create_probs function suggests there is an automatic methodology to do this, but I don't see it described. You could perhaps refer to Zheng et al's software polyOrigin, which actually locally optimises the error prior per datapoint. Maybe something for the discussion.
  
  Page 6 "recombination fraction giving the genomic order" do you mean "given"?Page 10 section Effects of contaminant samples - if you look at Figure 9 you can see that the presence of contaminant samples seems to have an impact on the genotypes of other, non-contaminant samples, especially using GATK and 5% global error. With the contaminants present, the number of XO points decreases in many other samples. This is very odd behaviour I would have thought. Is it known whether this apparent suppresion of recombination breakpoints in non-contaminant individuals is likely to be "correct"? Perhaps the SNP caller was running under the assumption that all individuals were part of the same F1? If the SNP caller was run without this assumption (eg. specifying only HW equilibrium, or model-free) would we still see the same effect? This is for me a quite worrying result but something that you make no reference to as far as I can tell.
  
  Page 12 "Effects of segregation distortion" In your study you only considered a single linkage group. One of the primary issues with segregation distortion in mapping is that it can lead to linkage disequilibrium between chromosomes, if selection has occurred on multiple loci. This can then lead to false linkages across linkage groups. Perhaps good to mention this.Page 12 "have difficulty missing linkage information" - missing word "with"
  
  Page 17 I see no mention of the impact of errors in the multi-allelic markers on the efficiency, particularly of order_seq which seems to be very poorly-performing with only bi-allelics (Fig 20). If bi-allelic SNPs have errors then it is not obvious why multi-SNP haplotypes should not also have errors.
  
  Page 3 Figure 1 - here the workflow shows multiple options for a number of the steps, which can lead to the creation of many map variants (e.g. 816 maps as mentioned on Page 4). Should all users produce 816 variants of their maps? With potentially millions of markers, this is going to take a huge amount of time (most users will want 100% of all chromosomes, not 37% of a single chromosome). Or should this be done for only a subset of markers? What if there is no reference sequence available to select a subset? As there are no clear recommendations, I suspect that the specific combination of pipeline choices will usually be datasetdependent. You actually mention this in the discussion
  
  page 17. And with only 2 real datasets from 2 different species, there is also no way to tell if eg. GATK works best in rose, or updog should be used for monocots but not dicots etc. It would be helpful if the authors were more explicit about how their tool informs "best practices for GBS analysis" for ordinary users. Perhaps it is there, but for me this message gets lost.
  
  Page 17 "updates in this version 3.0 to resolve issues with inflated genetic maps" - if I look at Figure 20, it seems that issues with inflated map length have not yet been fully resolved!
  
  Page 17 "we provide users tools to select the best approaches" - similar comment as before - does this mean users should build > 800 maps with a subset of their dataset first, and then use this single approach for the whole dataset? It is not explicitly stated whether this is the guidance given. What is the eventual aim - to produce a good linkage map, or to use the linkage map to critically compare genotyping tools?
3. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:
  
  **Reviewer Name: Zhenbin Hu **
  
  In this MS, the authors tried to develop a framework for using GBS data for downstream analysis and reduce the impact of sequence errors caused by GBS. However, sequence error is an issue not specific to GBS, it is also for whole genome sequences. Actually, I think the major issue for GBS is the missing data. However, in this MS, the authors did not test the impact of missing data on downstream analysis.The authors also mentioned that sequencing error may cause distortion segregation in linkage map construction, however, distortion segregation in linkage map construction can also happen for correct genotyping data. The distortion segregation can be caused by individual selection during the construction of the population. So I don't think it is correct to use distortion segregation to correct sequence errors.The authors need to clear the major question of this MS, in the abstract, the authors highlight the sequence errors, while in the introduction, the authors highlight the package for linkage map construction (the last paragraph). Actually, from the MS, authors were assembling a framework for genotyping-by-sequencing data.Two major reduced-represented sequencing approaches, GBS and RADseq, have specific tools for genotype calling, such as Tassel and Stack. However, the authors used the GATK and Freebayes pipeline for variant calling, authors need to present the reason they were not using TASSEL and Stack.In the genotyping-by-sequencing data, individuals were barcoded and mixed during sequencing, what package/code was used to split the individuals (demultiplex) from the fastq for GATK and Freebayes pipeline?The maximum missing data was allowed at 25% for markers data, how about for the individual missing rate?On page 6, the authors mentioned 'seuqnece size of 350', what that means?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.11.24.517847v4
www.biorxiv.org www.biorxiv.org

cellsnake: a user-friendly tool for single cell RNA sequencing analysis

2
1. GigaScience 13 Nov 2023
  
  in GigaScience
  
  AbstractBackground Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, non-expert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines.Results We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples.Conclusion As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad091 ), which carries out open, named peer-review. These review is published under a CC-BY 4.0 license:
  
  **Reviewer name: Qianqian Song **
  
  This paper offers an open-source tool, i.e., cellsnake, to perform single-cell data analysis. This cellsnake tool offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples. I like the incorporation design of the metagenome analysis in this tool, which makes it different with other available tools in single-cell analysis.
  
  1) I looked through their tutorial, and have a specific question regarding the resolution parameter. I wonder if this resolution argument needs to be pre-selected? Or the cellsnake tool can automatically select a resolution parameter?
  
  2) Is it possible to add color legends in the umap? Rather than label all cell types on the umap. It can be very hard to distinguish the cell types, especially when there are many cell types available.
  
  3) If the single-cell data is profiled from human tissue, is it also possible to use cellsnake to perform microbiome analysis?
  
  4) I recommend the authors to compare cellsnake with other existing tools. Pros and cons need to be highlighted.
2. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, non-expert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines.Results We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples.Conclusion As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad091 ), which carries out open, named peer-review. These review is published under a CC-BY 4.0 license:
  
  Reviewer name: Tazro Ohta
  
  The manuscript describes Cellsnake, a user-friendly tool for single-cell RNA sequencing analysis that targets non-expert users in the field of bioinformatics. Cellsnake operates as a command-line application, providing offline analysis capabilities for sensitive data. The integration of popular single-cell RNA-seq analysis software within Cellsnake, as described in Table 1, enhanced its utility as a comprehensive workflow. Cellsnake has different execution options (minimal, standard, and advanced) with varying outputs and execution times. The authors have provided well-structured online documentation, including helpful quick-start examples that facilitated easy understanding and usage of Cellsnake.
  
  The tool was tested using the Docker appliance and the provided fetal brain dataset and performed as expected. The manuscript explains the functions well, with the results reproduced from existing research using publicly available datasets. The following issues need to be addressed by the authors.
  
  The authors should include the citation for the Snakemake paper to acknowledge its contribution. https://doi.org/10.1093/bioinformatics/bts480
  
  To support the claim of unique features in Cellsnake, a comparison with other similar methods, such as that on Galaxy (https://doi.org/10.1093/gigascience/giaa102), should be included.
  
  It is recommended to host the Docker container image on both the GitHub Container Registry and the Docker Hub for better availability and redundancy. The authors should publish the Dockerfile to enable users to build a container image, if needed.
  
  Online documentation is missing a link to the fetal-liver example dataset (https://cellsnake.readthedocs.io/en/latest/fetalliver.html), which needs to be addressed. The fetalbrain dataset shared via Dropbox should also be deposited in the Zenodo repository to improve accessibility and long-term preservation.
  
  To assist users who want to use Cellsnake as a Snakemake workflow, the tool documentation should provide clear instructions on how to run Cellsnake as a single snakemake pipeline. This would be useful for users who utilize existing workflow platforms to accept snakemake requests.
  
  The benchmarking of Cellsnake must provide more precise specifications than simply referring to "a standard laptop" for computing requirements. My trial of "cellsnake integrated standard" with the fetalbrain dataset took more than 17 h via Docker execution on my M1 Max MacBook Pro. This may be because the provided Docker image is AMD-based, which let my MacBook run the container on a VM, but the recommended computational specifications will help users. The GitHub issue of the Cellsnake repository also mentioned that the software is not tested on Windows Conda, which should be mentioned at least in the online documentation.
  
  In the Data Availability section, please ensure that the correct formatting and consistent identifiers are used for public data, such as replacing SRP129388 with PRJNA429950 and E-MTAB-7407 with PRJEB34784, specifying that these IDs are from the Bioproject database. It is important to mention that EGA files are under controlled access, requiring user permission for retrieval.
  
  The references in the manuscript need to be properly formatted to ensure the inclusion of publication years and DOIs where available.
  
  The help message from the Cellsnake command indicates that its default values are set for human samples. The authors should mention in the manuscript that the pipeline is configured for human samples and requires further configuration for use with samples from other organisms. A step-by-step guide to configuring the setting for the other species, including the reference data download, would be helpful in obtaining more audiences.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.05.03.539204v3
www.biorxiv.org www.biorxiv.org

SpheroScan: A User-Friendly Deep Learning Tool for Spheroid Image Analysis

2
1. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background In recent years, three-dimensional (3D) spheroid models have become increasingly popular in scientific research as they provide a more physiologically relevant microenvironment that mimics in vivo conditions. The use of 3D spheroid assays has proven to be advantageous as it offers a better understanding of the cellular behavior, drug efficacy, and toxicity as compared to traditional two-dimensional cell culture methods. However, the use of 3D spheroid assays is impeded by the absence of automated and user-friendly tools for spheroid image analysis, which adversely affects the reproducibility and throughput of these assays.Results To address these issues, we have developed a fully automated, web-based tool called SpheroScan, which uses the deep learning framework called Mask Regions with Convolutional Neural Networks (R-CNN) for image detection and segmentation. To develop a deep learning model that could be applied to spheroid images from a range of experimental conditions, we trained the model using spheroid images captured using IncuCyte Live-Cell Analysis System and a conventional microscope. Performance evaluation of the trained model using validation and test datasets shows promising results.Conclusion SpheroScan allows for easy analysis of large numbers of images and provides interactive visualization features for a more in-depth understanding of the data. Our tool represents a significant advancement in the analysis of spheroid images and will facilitate the widespread adoption of 3D spheroid models in scientific research. The source code and a detailed tutorial for SpheroScan are available at https://github.com/FunctionalUrology/SpheroScan.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad082 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:
  
  **Reviewer Name: Francesco Pampaloni **
  
  This study represents a significant contribution to the field of screening and analysis of threedimensional cell cultures. The demand for reliable and user-friendly image processing tools to extract quantitative data from a large number of spheroids or other types of three-dimensional tissue models is substantial. The authors of this manuscript have developed a tool that aims to address this need by providing a straightforward method to extract the projected area and intensity of individual cellular spheroids imaged with bright-field microscopy. The tool is compatible with "Incucyte" microscopes or any other automated microscope capable of imaging multiple specimens, typically found in high-density multiwell plates.An admirable aspect of this work is the authors' decision to make all the code and pipeline openly available on Github. This openness allows other scientists to test and validate the code, promoting transparency and collaboration in the scientific community. However, several improvements should be made to the manuscript prior to publication.One important aspect that the authors should address in the manuscript is the suitability, rationale, and extent of using a neural network-based segmentation approach for the specific analysis described in the manuscriptâ€”segmentation of single bright-field images of spheroids.
  
  While neural networks are anticipated to play an increasingly important role in microscopy data segmentation in the coming years, they are not a universal solution. Although there may be segmentation tasks that are challenging to accomplish with traditional approaches, where neural networks can be highly effective, other segmentation tasks can be successfully performed using conventional strategies. For example, in our research group, we were able to reliably segment densely populated bright-field images containing numerous organoids in a single field of view using a pipeline based on the ImageJ plugin MorphoLibJ (see references: https://doi.org/10.1093/bioinformatics/btw413 and https://doi.org/10.1186/s12915-021-00958-w). Therefore, it would be informative and valuable for readers if the authors compared the results obtained from the neural network with those achieved by employing simple thresholding techniques (such as Otsu or Watershed) on the same dataset, as demonstrated in a similar study (reference: https://doi.org/10.1038/s41598-021-94217-1, Figure 5).
  
  Furthermore, to address the limitations of the model, the authors should provide specific examples (preferably in the supplementary material due to space constraints) of incorrect segmentations or artifacts that arise from applying the neural network to the data. For instance, it would be beneficial to explore scenarios where spheroids are surrounded by cellular debris or when multiple spheroids are present in the field of view. These real-life situations are common and it is important to provide insights into potential challenges that may arise when the images of the spheroids are not pristine.
2. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Background In recent years, three-dimensional (3D) spheroid models have become increasingly popular in scientific research as they provide a more physiologically relevant microenvironment that mimics in vivo conditions. The use of 3D spheroid assays has proven to be advantageous as it offers a better understanding of the cellular behavior, drug efficacy, and toxicity as compared to traditional two-dimensional cell culture methods. However, the use of 3D spheroid assays is impeded by the absence of automated and user-friendly tools for spheroid image analysis, which adversely affects the reproducibility and throughput of these assays.Results To address these issues, we have developed a fully automated, web-based tool called SpheroScan, which uses the deep learning framework called Mask Regions with Convolutional Neural Networks (R-CNN) for image detection and segmentation. To develop a deep learning model that could be applied to spheroid images from a range of experimental conditions, we trained the model using spheroid images captured using IncuCyte Live-Cell Analysis System and a conventional microscope. Performance evaluation of the trained model using validation and test datasets shows promising results.Conclusion SpheroScan allows for easy analysis of large numbers of images and provides interactive visualization features for a more in-depth understanding of the data. Our tool represents a significant advancement in the analysis of spheroid images and will facilitate the widespread adoption of 3D spheroid models in scientific research. The source code and a detailed tutorial for SpheroScan are available at https://github.com/FunctionalUrology/SpheroScan
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad082 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:
  
  **Reviewer name: Kevin Tröndle **
  
  The authors present a "Technical Note" about an open-source web tool called SpheroScan. As input users could upload (large batches of) spheroid images (brightfield, 2D). The tool delivers two outputs: (1) Prediction Module: creates a file with area and intensity of detected spheroids (CSV), (2) Visualization Module: plots of the corresponding parameters (PNG). Performance was tested on 480 Incucyte images and 423 microscope images with 336 (70 %) and 265 for training, 144 (30 %) and 117 for validation, and 50 images for testing, respectively. The framework is based on Mask R-CNN and Detectron2 library. The performance was tested in the range of 0.5 to 0.95 against manual annotation (VGG Annotator). As evaluation measure they used Intersection over union (IoU), determining the overlap between the predicted and ground truth regions and calculates values of Average Precision (AP) for masking: 0.937 and 0.972 (Test), 0.927 and 0.97 (Validation) as well as AP for bounding box: 0.899 and 0.977 (test) 0.89 and 0.944 (Validation). They show a linear runtime, proofed with different sized datasets (1 s / image) for masking on a 16 core CPU, 64 GB RAM machine. The tool is available on GitHub and claimed to be available as a web tool on spheroscan.onrender.com.General evaluation:The concept of the tool serves some important needs of 3D cell culture-based assays: automated, standardized, high-throughput image analysis. As such, it represents value added for the research field.
  
  However, it remains open how high the impact, the reproducibility, and the chances of potential application by other researchers will be. This is due to some significant limitations in accessibility (i.e. non-permanent or non-functional web tool), as well as the (potential) restriction of input data (i.e. brightfield only, not validated with external data) and the limited options for analysis of the metadata (i.e. area and intensity only). The greatest value stems from the possibility to access a web interface, which is easy to use and will ideally be equipped with additional functionalities in the future.
  
  Comment 1 (minor):The presented tool uses the Mask R-CNN deep-learning model in their image processing pipeline. Several tools, which perform image segmentation, are based on this or other models are well-established and already implemented in several commercial imaging devices and allow for segmentation of cell containing image areas, e.g. to determine confluency or cell migration in "wound healing assays", mainly optimized for 2D cultures, but also applicable for 2D images of 3D spheroids. The concept of automated image segmentation is thus not novel and only meets the journal's input criterion as "update or adaptation of existing" tools.The state-of-the-art and preliminary work are not sufficiently referenced. Several similar and alternative (open-source) tools are existent and should be mentioned in the manuscript, e.g. (Lacalle et al., 2021; Piccinini et al., 2023; Trossbach et al., 2023), to give only a few examples.
  
  Comment 2 (major):The authors claim to present an user-friendly open-source web tool. The python project is available on Github, and on a demo-server (https://spheroscan.onrender.com/) where the web interface can be accessed. Unfortunately the mentioned web tool is not functional, i.e. it is stated on the website: "This is a demonstration server and the prediction module is not available for use. To utilize the prediction functionality, please run SpheroScan on your local machine.".This is significantly limiting the applicability of the presented tool to users who are able to execute python code on their local hardware. Therefore, the demo server should either present a functional user interface (recommended), or the statement should be removed from the manuscript, which would limit the impact of the submission significantly
  
  .Comment 3 (major):The presented algorithm was trained exclusively on internal data of brightfield images from "Incucyte and microscope platforms". Furthermore, two distinct models were generated, working with either Incucyte or microscope images.It remains unclear how the algorithm will perform on external data of prospective users. Given the fact that two distinct models had to be trained for different image sources (i.e. from two different platforms) indicates a limited robustness of the models in this regard. This is clearly a general problem of image processing algorithms, but one that will stand in the way of applicability by external users with certainly other imaging techniques. Since the web tool interface is not functional at this point, the authors will also not be able to evaluate or improve on this after publication. At least one performance test with external data, obtained from an ideally blinded user should be performed, to further elaborate on this.
  
  Comment 4 (major):Many assays nowadays use fluorescent labels, for example to calculate cell ratios within 3D arrangements, e.g. for cell viability or the expression of certain proteins. The authors do not state if the algorithm (or future iterations thereof) is or will be able to process multi-channel microscope images of spheroids.This is a significant limitation of the presented work and should at least be mentioned in the corresponding section, respectively. Furthermore, a proof-of-concept test run with fluorescent images could be performed to test the algorithm performance and derive potentially necessary adaptations in future versions.
  
  Comment 5 (minor):The output of the tool is a list of detected spheroids with corresponding area (2D) and bright field average intensity within the area.The usability of these two parameters is limited to specific assays, such as the mentioned use case to investigate collagen gel contraction assays. Several other parameters of interest could easily be derived from the metadata, such as roundness, volume estimation (assuming a spheroid shape), or even cell count estimation. This should again be mentioned in the "limitations and considerations" section.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.28.533479v1
www.biorxiv.org www.biorxiv.org

Computational prediction of human deep intronic variation

2
1. GigaScience 13 Nov 2023
  
  in GigaScience
  
  AbstractThe adoption of whole genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to differentiate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce.In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that affect splicing regulatory elements or the branchpoint region. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground truth information, but the use of these tools results in decreased predictive power when compared to black box methods.Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad085 ), which carries out open, named peer-review. The review is published under a CC-BY 4.0 license:
  
  Reviewer name: Raphael Leman
  
  Summary: In this work Barbosa et al., presented a benchmarking of several splicing predictors for human intronic variants. Overall, the results of this study shown that deep learning based tools such as SpliceAI outperformed the other splicing predictors to detect splicing disturbing variants and so pathogenic variants.
  
  The authors also detailed the performances of these tools on several subsets of data according to the collection origins of variants and according to the genomic localization of variants. This work is one of the first large and independent studies about splicing prediction performances among intronic variants and in particular among deep intronic variants in a context of molecular diagnosis. This work also highlights the need to have reliable prediction tools for these variants and that the splicing impact of these variants are often underestimated. However, I estimated that major points should to be solved before considering the article to publication.
  
  **Major points ** 1 The most important point is that authors shown results in the main text but in following paragraphs they claimed that these results were biased. In addition, the results, taking into account these biases, were only shown in supplementary data and the readers should make the correction themselves to get the "true" results. Indeed, the interpretation of biased results and "true" results changes drastically. The two main biases were: i) the use of ClinVar data already used for the training of CAPICE (see my following comment nÂ°2-), ii) the intronic tags of variants and the relative distance to the nearest splice site were wrong (see my following comment nÂ°5-). Consequently, the authors should remove these biased results and only show results after bias correction.
  
  2 Importantly, several tools used ClinVar variants or published data to train and/or validate their models. Therefore, to perform a benchmark on true independent collection of variants, the authors should ensure the lack of overlapping between variants used for the tool development and this present study.
  
  3 As authors shown by the comparison between the ClinVar classification (N = 54,117 variants) and impact on RNA from in vitro studies (N = 162 variants), there was discrepancies between this two information (N = 13/74 common variants, 18%). Consequently, using ClinVar classification to assay the performance of splicing prediction tools is not optimal. To partially fix this point, I think further studying (ex: get minor allele frequency, availability of in vitro RNA studies, …) the intronic variants with positive splicing predictions from two or more tools with a ClinVar classification benign or likely benign and inversely, the intronic variants with negative splicing predictions from two or more tools with a ClinVar classification pathogenic or likely pathogenic could be interesting.
  
  4 The authors used pre-computed databases for 19 tools, but the most of these databases do not include small insdels and so add artificially missing data in disfavor of the tool although the same tool could score these indels variants in de novo way.
  
  5 The authors said that "We hypothesized that variability in transcript structures could be the reason [increase in performance in the deepest intronic bins]: despite these variants being assigned as occurring very deep within introns (> 500bp from the splice site of the canonical isoform) in the reference isoform, they may be exonic or near-splice site variants of other isoforms of the associated gene". To solve this transcript structure variability, firstly the authors could use weighted relative distance as following: |(|Pos_(nearest splice site)-Pos_variant |)-Intron_Size |â•„(Intron_Size ). Secondly, the ClinVar data contains the RefSeq transcript ID on which the variant was annotated (except for large duplications/deletions), so the authors should make the correspondence between these RefSeq transcript IDs and the transcripts used to perform splicing predictions.
  
  6 With respect to the six categories of splice-altering variants, it is unclear how the authors considered cases in which variants alter physiological splice motives (e.g., natural consensus sequences 3'SS/5'SS, branch point, or ESR) but, instead of exon skipping, the spliceosome recruits another distant splice site that is partially or not affected by the variant.
  
  7 In the table 1 listing the tools considered for this study, please explicit for each tool on which collections of data (ClinVar or splicing altering variants) and for which genomic regions the benchmark was done. This information will facilitate the reading of the article.
  
  8 Accordingly to my comment nÂ°3-, all spliceogenic variants are not necessary pathogenic. The mutant allele could produce aberrant transcripts without a frame-shift and without impact the functional domains of the protein. In addition, the transcription could also lead to a mix between aberrant transcript and full-length transcript. As a result, the main goal of splicing prediction tools is to detect splicing altering varaints. Considering variants with positive splicing prediction as pathogenic is a dangerous shortcut and only an in vitro RNA study could confirm the pathogenicity of a variant. The discussion section should be update in this sense.
  
  9 The authors claimed that: "The models [SQUIRLS and SPiP] were frequently able to correctly identify the type of splicing alteration, yet they still fail to propose higher-order mechanistic hypotheses for such predictions.". I think that the authors over-interpreted the results (see my comment nÂ° 21-).
  
  10 The authors recommended prioritizing intronic variants using CAPICE, It is still true once the bias was corrected (see my comment nÂ°1-).
  
  **Minor points **
  
  11 In the introduction the authors could clearly define the canonical splice site regions (AG/GT dinucleotides in 3'SS: -1/-2 and 5'SS: +1/+2) to make the difference with the consensus splice sites commonly define as: 3'SS: -12 (or -18)/+2 and 5'SS: -3/+6. 12 In the introduction, please also add that splice site activation could be also due to disruption of silencer motif. 13 In the ref [17], the authors did not say that the enrichment of splicing related variants within splice site regions was linked to exons and splice sites sequencing. They proved that whole genome sequencing increased the diagnostic rate of rare genetic disease, actually they did not focus on splicing variants. This enrichment was more probably induced by the fact that geneticists mainly studied variants with positive splicing predictions. 14 In the paragraph 'The prediction tools studied are diverse in methodology and objectives', please add that most of prediction tools target consensus splice sites (ex: MES, SSF, SPiCE, HSF, Adaboost, …).
  
  15 In the paragraph 'The prediction tools studied are diverse in methodology and objectives', the authors claimed that 'sequence-based deep learning models such as SpliceAI, which do not accept genetic variants as input.' but it is wrong as SpliceAI could accept VCF file as input. 16 In the paragraph 'Pathogenic splicing-affecting variants are captured well by deep learning based methods', this is further explained in the section method, but I think a sentence explaining that the 243 variants were from 81 variants described in ref [19] and 162 variants from a new collection will clarify the reading of article 17 In the paragraph 'Pathogenic splicing-affecting variants are captured well by deep learning based methods', among the 13 variants incorrectly classified, please detailed how many variants were classified as benign and VUS. 18 Due to the blue gradient, the Fig 1C is hard to analyze. 19 In the paragraph 'Branchpoint-associated variants', the variant rapported in the ref [79] were studied within tumoral context and so the observed impact could not be the same in healthy tissue. 20 In the paragraph 'Exonic-like variants', the authors changed the parameters of SpliceAI predictions, from the original prarameters used for the precomputed scores, to take into account variants located deep inside the pseudoexon. Please ensure whether other prediction tools have also user-defined optimizable parameters to take into account these variants. 21 In the paragraph 'Assessing interpretability', the authors observed that non-informative SPiP annotations presented a high score level. This could be explained by the fact of the tool report a positive prediction without annotation only because the model score was high without a relation to a particular splicing mechanism. 22 In the paragraph 'Assessing interpretability', the authors could compare the SpliceAI annotations regarding the abolition/creation of splice sites and their relative positions to the variants to the observed effect on RNA. 23 In the paragraph 'Predicting splicing changes across tissues', by my count the analysis of AbSpliceDNA predictions was done on 89 variants (154 - 65 = 89), if true please indicate clearly in the text. 24 In the method section, paragraph "ClinVar", the 13 variants with discordance between the classification and the observed splicing impact, how many did they have confidence stars. 25 In the method section, paragraph "Disease-causing intronic variants affecting RNA splicing", the authors filtered out variants within the 10 pb around the nearest splice site, please explicit why. 26 In the method section, paragraph "Disease-causing intronic variants affecting RNA splicing", the authors used gnomAD variants as control set, however their threshold of variant frequency is too low (1%). Indeed, some pathogenic variants involved in recessive genetic disorders have a high frequency in population. A threshold of 5% is more appropriate. 27 In the method section, paragraph "Variants that affect RNA splicing", the authors should describe how they considered variants leading to multiple aberrant transcripts and variants with partial effect (i.e., allele mutant still producing full length transcript). 28 In the method section, paragraph "Variants that affect RNA splicing", regarding the six categories defined by the authors: How the indels variants were annotated if they overlapped between several categories.
  
  The new splice donor/acceptor categories included only variants creating new AG/GT or variants occurring within the consensus sequences of cryptic splice sites. Among the category Donor-downstream, please make the distinction between variants located between [+3; +6] bp (i.e. consensus sequence) and variant beyond +6 bp. The exonic-like variants could be variants that did not impact ESRs motives (see my comment nÂ°6-). 29 In the method section, paragraph "Variants that affect RNA splicing", the authors select for the control datasets, variants generating the CAGGT and GGTAAG motives. However, this approach lead to an over-enrichment of false positives. Moreover, it could be also interesting if among the variants creating new splice sites or pseudoexons to identify the presence of GC donor motif or U12-minor spliceosome motif (AT/AC) and how the different splicing tools can detect them. 30 In Fig S3C, scale the gnomAD population frequency in -logâ‚•â‚€(P) to make the figure more readable. 31 I saw several times double spaces in the text please correct them. English is not my native language so I am not the best judge, but some sentences seem syntactically incorrect (ex: "The splicing tools with the smallest and largest performance drop between the splice site bin ("1-2") and the "11-40" bin were Pangolin and TraP, with weighted F1 scores decreasing by 0.334 and 0.793, respectively"). Please have the article proofread by someone who is fluent in English.
2. GigaScience 13 Nov 2023
  
  in GigaScience
  
  The adoption of whole genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to differentiate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce.In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that affect splicing regulatory elements or the branchpoint region. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground truth information, but the use of these tools results in decreased predictive power when compared to black box methods.Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad085 ), which carries out open, named peer-review. The review is published under a CC-BY 4.0 license:
  
  **Reviewer name: Jean-Madeleine de Sainte Agathe **
  
  This manuscript presents an important and very exhaustive benchmark concerning intronic variant splicing predictors. The focus on deep-intronic variants is highly appreciated as it addresses a very crucial challenge of today's genetics. The authors present the different tools in a very clear and pedagogical way. I should add that this manuscript is pleasant to read. The authors use the average precision score, allowing a refined comparison between tools.
  
  They give practical recommendations. They emphasize the use of SpliceAI and pangolin for intronic variants. For branchpoint regions, they recommend Pangolin and LabRanchoR. It should be noted that this study is to my knowledge the first independent benchmark of Pangolin, CISpliceAI, ConSpliceML, AbSplice-DNA, SQUIRLS, BPHunter, LaBranchoR and SPiP together. Overall, this study is important as it will be very helpful for the interpretation of intronic variants. I hence fully and strongly support its publication. I have several comments that (I think) should be addressed before publication, especially the first point:
  
  1) I admit that the curation of such large datasets is challenging, however, I failed to find some of the Table S6 variants in the referenced work. Please, could you kindly point me to the referenced variation for the following variants? - The variant "1 hg38_156872925 C T NTRK1 ENST00000524377.1:c.851-708C>T pseudoexon_inclusion keegan_2022" is classified as 'affects_splicing'. However, I did not find it in Keegan 2022 (reference 20). In Keegan, the table S1 mentions NTRK1 variants but not c.851-708C>T. For these NTRK1 variants, keegan et al refers to another publication Geng et al 2018 (PMC6009080), where I can't find the ENST00000524377.1:c.851-708C>T variants neither. - Same for "COL4A3 ENST00000396578.3:c.4462+443A>G 2:g.228173078A>G" - Same for "ABCA4 ENST00000370225.3:c.1937+435C>G 1:g.94527698G>C" - Same for "FECH ENST00000382873.3:c.332+668A>C 18:g.55239810T>G" - Concerning "MYBPC3 ENST00000545968.1:c.1224-52G>A 11:g.47364865C>T" , I did not find it in pbarbosa as stated, but in another reference which, I think, should be mentioned in this manuscript: https://pubmed.ncbi.nlm.nih.gov/33657327/ - "BRCA2 ENST00000544455.1:c.8332-13T>G 13:g.32944526T>G" is classified as splicing neutral based on moles-fernÃ¡ndez_2021, but it has previously been shown to alter splicing (https://pubmed.ncbi.nlm.nih.gov/31343793/), please clarify. If these variants were somehow erroneously included, the authors should reprocess their results with the corrected datasets.
  
  2) Although it has been done before, the usage of gnomAD variants as a base of splicing-neutral variants is questionable. Indeed, it is theoretically possible that such variants truly alter splicing. For example, genuine splicing alterations can result in mild inframe consequences on the gene products. Or splicing alterations can damage non-essential genes. I suggest that the authors: -either select another gnomAD variants list located in disease-associated genes, where benign splicing alterations seem less plausible. -or discuss this putative limitation in their results.
  
  3) Table S8: "Variants above 0.05, the optimized SpliceAI threshold for non-canonical intronic splicing variation" Is that a recommendation of this work? Or was it found elsewhere? Please clarify. More generally, this manuscript uses Average Precision scores, but the authors should explain to their non-statistician readers how it relates to the delta scores of each tool (Fig 3C). Indeed, any indication (or even recommendation, but not necessarily) concerning the use of cut-off values would be very appreciated by the geneticist community.
  
  4) p.3 "If the model is run twice, once with the reference and once with the mutated sequence, it is possible to measure splice site alterations caused by genetic variants." This study makes only use of the delta scores, which have previously been shown to be misleading in some rare cases (PMID 36765386). The authors would be wise to mention this. For example, in Table S3, "ENST00000267622.4:c.5457+81T>A 14(hg19):g.92441435A>T" is predicted by SpliceAI DG=0.16, but as the reference prediction is already at 0.84, this 0.16 is the maximal delta score possible, yielding donor score = 1.
  
  5) p.12 "Among the tools that predict across whole introns, SQUIRLS and SPiP are the only ones designed to provide some interpretation of the outcome." Concerning the nature of the mis-splicing event, I think the authors should mention SpliceVault, which has been specifically built for this task (pmid 36747048).
  
  6) p.14: "SpliceAI and Pangolin […]. If usability is a concern and users do not have a large number of predictions to make, SpliceAI is preferred since the Broad Institute has made available a web app for the task" Now, the broad institute web app includes pangolin (at least for hg38 variants). Please, rephrase of delete this sentence.
  
  7) Concerning complex delins, which are not annotated with the current version of SpliceAI, the authors should give recommendations. For example, the complex delins from tableS9 "hg19_chr7 5354081 GC AT" is correctly predicted by CI-SpliceAI and SpliceAI-visual, both tools allowing the annotation of complex delins with the SpliceAI model.
  
  8) p.8 "Unfortunately, BPHunter only reported the variants predicted to disrupt the BP, rendering the Precision-Recall Curves (PR Curves) analysis impossible." I agree with the authors. However, I think it is sometimes assumed (wrongly?) that all variants unannotated by BPhunter have BPH_score=0. Maybe the authors could explicit this. For example, by saying that the lack of prediction cannot be safely equated with a negative prediction.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.17.528928v1
www.biorxiv.org www.biorxiv.org

Single-cell transcriptome analysis illuminating the characteristics of species-specific innate immune responses against viral infections

2
1. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Bats harbor various viruses without severe symptoms and act as their natural reservoirs. The tolerance of bats against viral infections is assumed to originate from the uniqueness of their immune system. However, how immune responses vary between primates and bats remains unclear. Here, we characterized differences in the immune responses by peripheral blood mononuclear cells to various pathogenic stimuli between primates (humans, chimpanzees, and macaques) and bats (Egyptian fruit bats) using single-cell RNA sequencing. We show that the induction patterns of key cytosolic DNA/RNA sensors and antiviral genes differed between primates and bats. A novel subset of monocytes induced by pathogenic stimuli specifically in bats was identified. Furthermore, bats robustly respond to DNA virus infection even though major DNA sensors are dampened in bats. Overall, our data suggest that immune responses are substantially different between primates and bats, presumably underlying the difference in viral pathogenicity among the mammalian species tested
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad086 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license.
  
  ** Reviewer name: Doreen Ikhuva Lugano **
  
  This paper gives a good introduction on bats as reservoirs of several viral infections, which studies have shown is due to the uniqueness of their immune system. They and others suggest that bats immune system is dampened exhibiting tolerance to various viruses. This gives the study a good rationale as to why study the bats immune system, compared to other mammals. They also give a good rationale as to why they used single-cell sequencing, to allow the identification of various cell types and the differences in these cell types. From their finding the main conclusions are that differences in the host species are more impactful; than those among the different stimuli. They also suggest that bats initiate an innate immune response after infection with DNA viruses through an alternative pathway. For example, the induction dynamics of PRRs seems to be different in their dataset. They also suggest this could be due to the presence of species-specific cellular subsets. 1. Interesting model system and a good comparison of bats with other mammals. 2. Good technique in using single-cell sequencing, with a clear rationale as to why it was chosen. This advances knowledge on what was already known about bats immune system, but the species-specific cellular subsets are new. 3. Interesting technique to go through the bulk transcriptomic data in four species and four conditions. This allowed findings of the most important genes/pathways. 4. Good rationale / flow of experiments from one to another 5. I liked that they investigated stimuli from different pathogens , including DNA, RNA virus and bacteria and still show that bats had a different immune system, in the different stimuli. Minor comments 1. Do they speculate this occurrence in is this just in Egyptian Fruit bats or all species of bats? 2. Mentioned in the introduction why they used the egyptian fruit bats - which are a model organism, but this could help people who are not in this field understand exactly why use these bats. Advantages? Location? Proximity to the various viruses based on the fact they are mostly found in endemic regions such as Africa etc. 3. Can they include viral load in each species? 4. It is not clear which scRNAseq tools were used for data analysis in identifying the types of cells. Or did they use already established database based on markers?
2. GigaScience 13 Nov 2023
  
  in GigaScience
  
  Bats harbor various viruses without severe symptoms and act as their natural reservoirs. The tolerance of bats against viral infections is assumed to originate from the uniqueness of their immune system. However, how immune responses vary between primates and bats remains unclear. Here, we characterized differences in the immune responses by peripheral blood mononuclear cells to various pathogenic stimuli between primates (humans, chimpanzees, and macaques) and bats (Egyptian fruit bats) using single-cell RNA sequencing. We show that the induction patterns of key cytosolic DNA/RNA sensors and antiviral genes differed between primates and bats. A novel subset of monocytes induced by pathogenic stimuli specifically in bats was identified. Furthermore, bats robustly respond to DNA virus infection even though major DNA sensors are dampened in bats. Overall, our data suggest that immune responses are substantially different between primates and bats, presumably underlying the difference in viral pathogenicity among the mammalian species tested.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad086 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license.
  
  ** Reviewer name: Urs Greber **
  
  Hirofumi Aso and colleagues provide a manuscript entitled 'Single-cell transcriptome analysis illuminating the characteristics of species specific innate immune responses against viral infections'. The aim was to describe differences in innate immune responses of peripheral blood mononuclear cells (PBMCs) from different primates and bats against various pathogenic stimuli (different viruses and LPS). A major conclusion from the study is that differences in the immune response between primate and bat PBMCs are more pronounced than those between DNA, RNA viruses or LPS, or between the cell types. The topic is of interest as the immunological basis for how bats appear to be largely disease resistant to some viruses that cause severe infections in humans is not well understood. One notion by others has been that bats have a larger spectrum of interferon (IFN) type I related genes, some of which are expressed constitutively even in unstimulated tissue, and there, trigger the expression of IFN stimulated genes (ISGs). Alongside, enhanced ISG levels may need to be compensated for in bats. Accordingly, bats may exhibit reduced diversity of DNA sensing pathways, as well as absence of a range of proinflammatory cytokines triggered in humans upon encountering acute disease causing viruses. The study here uses single-cell RNA sequencing (scRNA-seq) analysis, and transcript clustering algorithms to explore the profile of different innate immune responses upon viral infections of PBMCs from H sapiens, Chimpanzee, Rhesus macaque, and Egyptian fruit bat. Most commonly referred to cell types were detected in all four species, although naÃ¯ve CD8+ T cells were not detected in bat PBMCs, which led the authors to focus on B cells, naÃ¯ve T cells, killer T/NK cells, monocytes, cDCs, and pDCs. The study used three pathogenic stimuli, Herpex simplex virus 1 (HSV1), Sendai virus (SeV), and lipopolysaccharide (LPS). Specific comments The text is well written, concise, and per se interesting, but I have a few questions for clarification.
  
  1) Can the authors provide quality and purity control data for the virus inocula to document virus homogeneity? E.g., neither the methods, nor the indicated ref 26 specify if or how HSV1 was purified. Same is true for SeV where the provided ref 34 does not indicate if virus was purified or not. If virus inocula were not purified then it remains unclear to what extent the effects on the PBMCs described in the study here were due to virus or some other component in the inoculum. Conditions using inactivated inoculum might help to clarify this issue.
  
  2) What was the infection period? Was it the same for all viruses?
  
  3) Upon stimuli application, there was a noteable expansion of B cells and a compression of killer T / NK cells in the bat but not the human samples, as well as compression of monocytes, the latter observed in all four species. Can the authors comment on this observation?
  
  4) Lines 78-79: I do not think that TLR9 ought to be classified as a cytosolic DNA sensor. Please clarify.
  
  5) Line 117: please clarify that the upregulation of proinflammatory cytokines, ISGs and IFNB1 was measured at the level of transcripts not protein.
  
  6) Line 244: DNA sensors. Authors report that bats responded well to DNA viruses, although some of their DNA sensing pathways (e.g., STING downstream of cGAS, AIM2 or IFI16) were attenuated compared to primates (H sapies, Chimpanzee, Macaque). And they elute to the dsRNA PRR TLR3. But I am not sure if TLR3 is the only PRR to compensate for attenuated DNA sensing pathways. The authors might want to explicitly discuss if other RNA sensors, such as RIG-I-like receptors (RIG-I, LGP2, MDA5) were upregulated similarly in bats as in primate cells upon inoculation with HSV1.
  
  7) Is it known how much TLR3 protein is expressed in bat PBMCs under resting and stimulated conditions? Same question for the DNA and RNA sensor proteins, e.g., cGAS, AIM2 or IFI16, RIG-I, LGP2, MDA5, or effector proteins, such as STING.
  
  8) Can authors clarify if cGAS is part of the attenuated DNA sensors in the bat samples under study here? And it would be nice to see the attenuated response of DNA sensing pathways in the bat samples, as suspected from the literature, including STING downstream of cGAS, or AIM2 and IFI16.
  
  9) What are the expression levels of IFN-I and related genes in the bat cells among the different stimuli?
  
  10) Technical point: where can the raw scRNA-seq data be found?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.06.519403v1
Oct 2023
www.biorxiv.org www.biorxiv.org

Integrating deep mutational scanning and low-throughput mutagenesis data to predict the impact of amino acid variants

4
1. GigaScience 17 Oct 2023
  
  in GigaScience
  
  AbstractEvaluating the impact of amino acid variants has been a critical challenge for studying protein function and interpreting genomic data. High-throughput experimental methods like deep mutational scanning (DMS) can measure the effect of large numbers of variants in a target protein, but because DMS studies have not been performed on all proteins, researchers also model DMS data computationally to estimate variant impacts by predictors. In this study, we extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results. To evaluate our model, we collected 146 AS datasets, mapping to 54 DMS datasets across 22 distinct proteins. We show that improved model performance depends on the compatibility of the DMS and AS assays, and the scale of improvement is closely related to the correlation between DMS and AS results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Leopold Parts
  
  Summary Fu et al. explore utilising low-throughput mutational fitness measurements to predict the results of high-throughput deep mutational scanning experiments. They demonstrate that adding alanine scanning results to predictive models improves performance, as long as the alanine scan used a sufficiently similar evaluation approach to a deeper experiment. The findings make intuitive sense, and will be useful for the community to internalize.
  
  While we have several comments about the methods used, and requests to fortify the claims with more characterization, we do not expect addressing any of them will change the core findings. One can argue that direct application of AS boosted predictions is likely to be limited due to the number of scans available and the speed at which DMS experiments are now being performed, so it would also be useful to discuss the context of these results in the evolution of the field, and we make specific suggestions for this. Regardless, the presented results are a useful demonstration of a more general use case of low-throughput or partial mutagenesis data for improving fitness prediction and imputation.
  
  Major Comments
  
  There are many other computational variant effect predictors beyond Envision and DeMaSk. It would be very useful to see how their prediction results compare to some others, particularly the best performing and common models that are also straightforward to download and run (e.g. EVE, ESM1v, SIFT, PolyPhen2). This would be important context to see how impactful the addition of AS data is to DeMaSk/Envision. Please run additional prediction tools for reference of absolute performance; there is no need to incorporate AS data into them. Several proteins have a very small number of AS residues (Figure 2), and from our reading of the methods, other residue scores are imputed with the mean AS value for that protein. (As an aside, it would be good to clarify if this average is across studies or within study). If this reading is correct, the majority of residues for each proteins will have imputed AS results (e.g. in case of PTEN, over 90%), which can be problematic for training and prediction. Please clarify if our interpretation of the imputation approach is correct, and if so, please also provide results for a model trained without imputation, on many fewer residues. If the boosting model has already implemented this, please integrate the Supplementary methods into the main methods, and reference these and the results when describing the imputation approach to avoid such concerns. It is not clear how significant/impactful the increases in performance are in figures 4, 5, S4, S5 & S6. Please use a reasonable analytical test, or training data randomization to evaluate the improvement against a null model. There are quite a few proteins with repeated DMS/AS measurements. In our experience these correlate from moderately to very highly. Including multiple highly correlated studies could lead to pseudo-replication and biasing the model performance results. Please present a version of the results where the repeats are averaged first to test whether that bias exists. Minor Comments [suggestions only; no analyses required from us]
  
  A short discussion about the number of available alanine scans, particularly for proteins without DMS results, would help put the work in context. For example, it would be good to know how many proteins would benefit from improved de-novo predictions (e.g. no DMS data) and how many could have improved imputation (incomplete DMS data). Similarly the rate and cost of DMS data generation is important to understand the utility of their results. I think a short discussion of how useful models of this sort are in practice now and in future would be helpful to the reader. This seems most natural as part of the end of the discussion, but could also fit in the introduction. Figure 2 is missing y axis label. We also softly suggest log scale axis, to not obscure the degree to which some proteins have more residues covered and the proportion of residues covered by AS. Figure 3 includes DMS/AS study pairs with at least three alanine substitutions to compare - we think this is a low cut-off, particularly with the regularisation applied. I think something like 10+ would be more informative. I think their cross-validation scheme leaves out an entire protein at a time, as opposed to one study each iteration. I agree this is the better way to do it. However, I initially read it as the latter, which would lead to leakage between train/validation data since the same residue would be included in both if a protein had multiple datasets. It might be useful to be more explicit to prevent other readers doing the same. L231 In the discussion they mention fitting a model only using studies with a minimum DMS/AS correlation. This occurred to me as well while reading the relevant part of the results. Is there a good reason not to do this? It doesn't seem like a large amount of work and conceptually seems a good way to assess a model that says what a DMS might look like is it had the same selection criteria as a given AS. L154 Similarly, a correlation cut-off as well as choosing the most corelated study seems like it would be a fairer comparison in figure 5. Just because an AS is the most correlated doesn't necessarily mean it is well correlated. It would be interesting to see if the improvement results in figure 7 correlate with substitution matrices (e.g. Blosum) or DMS variant fitness correlations (e.g. correlation between A and C, A and D, etc.). Intuitively it feels like they should. It would be nice to label panels in figure 7. It also seems notable that predicting alanine substitutions is not the most improved - a brief comment on why would be interesting. The AS model adds 2x20 parameters to the model for encoding, which is a lot if CCR5 is held out, as there are only a few hundred total independent residues evaluated. While the performance on held out proteins is a good standard, it would be interesting to evaluate the increase from model selection perspective (BIC/AIC or similar) if possible. L217 The statement doesn't seem logical to me - if such advanced imputation methods were available surely they would be better used to impute all substitutions than just model alanine then use linear regression to model the rest? L331-332 The formula used for regularising Spearman's rho makes sense, and can likely be interpreted as a regularizing prior, but we found it hard to understand its provenance and meaning from the reference. A sentence on its content (not just describing that it shrinks estimates) and a more specific reference would be useful for interested readers like ourselves. L364 It says correlation results were dropped when only one residue was available whereas in figure legends it says results with less than three residues were dropped. Notwithstanding thinking three is maybe too low a cutoff, these should be consistent or clarified slightly if I've misunderstood the meaning. It would be nice to have a bit more comment on the purpose of the final supplementary section (Replacing AS data with DMS scores of alanine substitutions) - if you have DMS alanine results it seems likely you will have the other measurements anyway.
2. GigaScience 17 Oct 2023
  
  in GigaScience
  
  AbstractEvaluating the impact of amino acid variants has been a critical challenge for studying protein function and interpreting genomic data. High-throughput experimental methods like deep mutational scanning (DMS) can measure the effect of large numbers of variants in a target protein, but because DMS studies have not been performed on all proteins, researchers also model DMS data computationally to estimate variant impacts by predictors. In this study, we extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results. To evaluate our model, we collected 146 AS datasets, mapping to 54 DMS datasets across 22 distinct proteins. We show that improved model performance depends on the compatibility of the DMS and AS assays, and the scale of improvement is closely related to the correlation between DMS and AS results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Joseph Ng
  
  This manuscript explored whether low-throughput alanine scanning (AS) experimental data could complement deep mutational scanning (DMS) to classify the impact of amino acid substitutions in a range of protein systems. The analysis partially confirms this hypothesis in that it only applies when the functional readout being measured in the two assays are compatible with one another. In my opinion this is an insight that should be highlighted in a publication and therefore I believe this manuscript deserved to be published. I just wish the authors could clarify & further explore the points below better in their manuscript before recommending for acceptance:
  
  In my opinion the most important bit of data curation is the classification of DMS/AS pairs as high/medium/low etc. compatible, and this is the key towards the authors' insight that assay compatibility is an important determinant of whether signals in the two datasets could be cross-matched for analysis. The criteria behind this classification are listed in Figure S2 but I feel the wording needs to be more specific. For example, in Figure S2, the authors wrote 'Both assays select for similar protein properties and under similar conditions' - what exactly does this mean? What does the authors consider to be 'similar protein properties'? I could not find more detailed explanation of this in the Methods section. The authors gave reasons in the spreadsheet in Supp. Table 1 for the labels they give to each pairs of assays, but I'm still not exactly sure what they consider to be 'similar'. Is there are more specific classification scheme which is more explicit in defining these 'similarities', e.g. by defining a scoring grid explicitly listing the different levels of 'similarities' of measurable properties, e.g. both thermal stability - score of 3; thermal stability vs protein abundance - 2; thermal stability vs cell survival - 1 (or equivalent, I think the key issue is to provide the reader with a clear guide so they can readily assess the compatibility of the datasets by themselves)? I would have thought discrepancy between the DMS and AS scores to be different across different structural regions of the protein, e.g. the discrepancy would be larger in ordered region compared to disorder as the protein fold would constrain the types of amino acids tolerable within the ordered segment of the protein. Is this the case in the authors' collection of datasets? If so, does the compatibility of assays modulate this discrepancy?
3. GigaScience 10 Oct 2023
  
  in Public
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Leopold Parts
  
  Summary Fu et al. explore utilising low-throughput mutational fitness measurements to predict the results of high-throughput deep mutational scanning experiments. They demonstrate that adding alanine scanning results to predictive models improves performance, as long as the alanine scan used a sufficiently similar evaluation approach to a deeper experiment. The findings make intuitive sense, and will be useful for the community to internalize.
  
  While we have several comments about the methods used, and requests to fortify the claims with more characterization, we do not expect addressing any of them will change the core findings. One can argue that direct application of AS boosted predictions is likely to be limited due to the number of scans available and the speed at which DMS experiments are now being performed, so it would also be useful to discuss the context of these results in the evolution of the field, and we make specific suggestions for this. Regardless, the presented results are a useful demonstration of a more general use case of low-throughput or partial mutagenesis data for improving fitness prediction and imputation.
  
  Major Comments
  
  There are many other computational variant effect predictors beyond Envision and DeMaSk. It would be very useful to see how their prediction results compare to some others, particularly the best performing and common models that are also straightforward to download and run (e.g. EVE, ESM1v, SIFT, PolyPhen2). This would be important context to see how impactful the addition of AS data is to DeMaSk/Envision. Please run additional prediction tools for reference of absolute performance; there is no need to incorporate AS data into them. Several proteins have a very small number of AS residues (Figure 2), and from our reading of the methods, other residue scores are imputed with the mean AS value for that protein. (As an aside, it would be good to clarify if this average is across studies or within study). If this reading is correct, the majority of residues for each proteins will have imputed AS results (e.g. in case of PTEN, over 90%), which can be problematic for training and prediction. Please clarify if our interpretation of the imputation approach is correct, and if so, please also provide results for a model trained without imputation, on many fewer residues. If the boosting model has already implemented this, please integrate the Supplementary methods into the main methods, and reference these and the results when describing the imputation approach to avoid such concerns. It is not clear how significant/impactful the increases in performance are in figures 4, 5, S4, S5 & S6. Please use a reasonable analytical test, or training data randomization to evaluate the improvement against a null model. There are quite a few proteins with repeated DMS/AS measurements. In our experience these correlate from moderately to very highly. Including multiple highly correlated studies could lead to pseudo-replication and biasing the model performance results. Please present a version of the results where the repeats are averaged first to test whether that bias exists. Minor Comments [suggestions only; no analyses required from us]
  
  A short discussion about the number of available alanine scans, particularly for proteins without DMS results, would help put the work in context. For example, it would be good to know how many proteins would benefit from improved de-novo predictions (e.g. no DMS data) and how many could have improved imputation (incomplete DMS data). Similarly the rate and cost of DMS data generation is important to understand the utility of their results. I think a short discussion of how useful models of this sort are in practice now and in future would be helpful to the reader. This seems most natural as part of the end of the discussion, but could also fit in the introduction. Figure 2 is missing y axis label. We also softly suggest log scale axis, to not obscure the degree to which some proteins have more residues covered and the proportion of residues covered by AS. Figure 3 includes DMS/AS study pairs with at least three alanine substitutions to compare - we think this is a low cut-off, particularly with the regularisation applied. I think something like 10+ would be more informative. I think their cross-validation scheme leaves out an entire protein at a time, as opposed to one study each iteration. I agree this is the better way to do it. However, I initially read it as the latter, which would lead to leakage between train/validation data since the same residue would be included in both if a protein had multiple datasets. It might be useful to be more explicit to prevent other readers doing the same. L231 In the discussion they mention fitting a model only using studies with a minimum DMS/AS correlation. This occurred to me as well while reading the relevant part of the results. Is there a good reason not to do this? It doesn't seem like a large amount of work and conceptually seems a good way to assess a model that says what a DMS might look like is it had the same selection criteria as a given AS. L154 Similarly, a correlation cut-off as well as choosing the most corelated study seems like it would be a fairer comparison in figure 5. Just because an AS is the most correlated doesn't necessarily mean it is well correlated. It would be interesting to see if the improvement results in figure 7 correlate with substitution matrices (e.g. Blosum) or DMS variant fitness correlations (e.g. correlation between A and C, A and D, etc.). Intuitively it feels like they should. It would be nice to label panels in figure 7. It also seems notable that predicting alanine substitutions is not the most improved - a brief comment on why would be interesting. The AS model adds 2x20 parameters to the model for encoding, which is a lot if CCR5 is held out, as there are only a few hundred total independent residues evaluated. While the performance on held out proteins is a good standard, it would be interesting to evaluate the increase from model selection perspective (BIC/AIC or similar) if possible. L217 The statement doesn't seem logical to me - if such advanced imputation methods were available surely they would be better used to impute all substitutions than just model alanine then use linear regression to model the rest? L331-332 The formula used for regularising Spearman's rho makes sense, and can likely be interpreted as a regularizing prior, but we found it hard to understand its provenance and meaning from the reference. A sentence on its content (not just describing that it shrinks estimates) and a more specific reference would be useful for interested readers like ourselves. L364 It says correlation results were dropped when only one residue was available whereas in figure legends it says results with less than three residues were dropped. Notwithstanding thinking three is maybe too low a cutoff, these should be consistent or clarified slightly if I've misunderstood the meaning. It would be nice to have a bit more comment on the purpose of the final supplementary section (Replacing AS data with DMS scores of alanine substitutions) - if you have DMS alanine results it seems likely you will have the other measurements anyway.
4. GigaScience 10 Oct 2023
  
  in Public
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Joseph Ng
  
  This manuscript explored whether low-throughput alanine scanning (AS) experimental data could complement deep mutational scanning (DMS) to classify the impact of amino acid substitutions in a range of protein systems. The analysis partially confirms this hypothesis in that it only applies when the functional readout being measured in the two assays are compatible with one another. In my opinion this is an insight that should be highlighted in a publication and therefore I believe this manuscript deserved to be published. I just wish the authors could clarify & further explore the points below better in their manuscript before recommending for acceptance:
  
  In my opinion the most important bit of data curation is the classification of DMS/AS pairs as high/medium/low etc. compatible, and this is the key towards the authors' insight that assay compatibility is an important determinant of whether signals in the two datasets could be cross-matched for analysis. The criteria behind this classification are listed in Figure S2 but I feel the wording needs to be more specific. For example, in Figure S2, the authors wrote 'Both assays select for similar protein properties and under similar conditions' - what exactly does this mean? What does the authors consider to be 'similar protein properties'? I could not find more detailed explanation of this in the Methods section. The authors gave reasons in the spreadsheet in Supp. Table 1 for the labels they give to each pairs of assays, but I'm still not exactly sure what they consider to be 'similar'. Is there are more specific classification scheme which is more explicit in defining these 'similarities', e.g. by defining a scoring grid explicitly listing the different levels of 'similarities' of measurable properties, e.g. both thermal stability - score of 3; thermal stability vs protein abundance - 2; thermal stability vs cell survival - 1 (or equivalent, I think the key issue is to provide the reader with a clear guide so they can readily assess the compatibility of the datasets by themselves)? I would have thought discrepancy between the DMS and AS scores to be different across different structural regions of the protein, e.g. the discrepancy would be larger in ordered region compared to disorder as the protein fold would constrain the types of amino acids tolerable within the ordered segment of the protein. Is this the case in the authors' collection of datasets? If so, does the compatibility of assays modulate this discrepancy?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.14.520494v1
www.biorxiv.org www.biorxiv.org

Genome assembly of the bearded iris Iris pallida Lam

2
1. GigaScience 14 Oct 2023
  
  in GigaByte
  
  **Editors Assessment: **
  
  Irises on top of being a popular and beautiful ornamental plant, have wider commercial interest due to the many interesting secondary metabolites present in their rhizomes that have value to the fragrance and pharmaceutical industries. Many of these have large and difficult to assemble genomes, and to fill that gap the Dalmatian Iris (Iris pallida Lam.) is sequenced here. Using PacBio long-read sequencing and bionano optical mapping to produce a giant 10Gbp assembly with a scaffold N50 of 14.34 Mbp. The authors didn’t manage to handle the haplotigs separately or to study the ploidy, but as all of the data is available for reuse others can explore these questions further. This reference genome should also allow researchers to study the biosynthesis of these secondary metabolites in much greater detail, opening new avenues of investigation for drug discovery and fragrance formulations.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 Oct 2023
  
  in GigaByte
  
  Irises are perennial plants, representing a large genus with hundreds of species. While cultivated extensively for their ornamental value, commercial interest in irises lies in the secondary metabolites present in their rhizomes. The Dalmatian Iris (Iris pallida Lam.) is an ornamental plant that also produces secondary metabolites with potential value to the fragrance and pharmaceutical industries. In addition to providing base notes for the fragrance industry, iris tissues and extracts possess anti-oxidant, anti- inflammatory, and immunomodulatory effects. However, study of these secondary metabolites has been hampered by a lack of genomic information, instead requiring difficult extraction and analysis techniques. Here, we report the genome sequence of Iris pallida Lam., generated with Pacific Bioscience long-read sequencing, resulting in a 10.04 Gbp assembly with a scaffold N50 of 14.34 Mbp and 91.8% complete BUSCOs. This reference genome will allow researchers to study the biosynthesis of these secondary metabolites in much greater detail, opening new avenues of investigation for drug discovery and fragrance formulations.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.94), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Baocai Han **
  
  Iris pallida Lam., an ornamental plant, produces secondary metabolites with potential value to the fragrance and pharmaceutical industries, while also possessing anti-oxidant, anti-inflammatory, and immunomodulatory effects. The genome assembly of this species could be more helpful in investigation for drug discovery and fragrance formulations.
  
  I have a number of comments that follow:
  
  Line 10 (page 2): “resulting in a 10.04 Gbp assembly with a scaffold N50 of 14.34 Mbp”. I found the genome size is 13.49 Gb in Table 2 and line 18 (page 7) due to differing haplotigs in the phased assembly. While I can not find how to deal with this problem. I suggest to purge the duplicates from the genome using the Purge_Dups pipeline. (Guan D, McCarthy SA, Wood J et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 2020; 36(9): 2896–2898.)
  
  Line 5 (page 8): why is the gene number of the Complete and duplicated BUSCOs so high. Is it due to issues with genome assembly or the presence of a particularly high number of repetitive sequences in the species?
  
  there is no reference or website for many softwares and pipelines, eg. HybridScaffolding pipeline (line 22, page 5), lima (line 2, page 6) and Exonerate (line 11, page 6)
  
  I suggest upload the genome annotation file, given that genome annotation has already been performed.
  
  **Reviewer 2. Kang Zhang **
  
  Is the language of sufficient quality?
  
  Yes. Though I found several sentences confusing: P2L8 (Is the DNA/RNA extraction particularly difficult for iries?), and P9L5 (wording).
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  Yes. With the following comments.
  
  1. P7L20. The basic stats of the subreads should be introduced before the assembling process.
  
  The authors should provide more methodological details about the BUSCO assessment, such as the database version, the mode (genome or protein), etc.
  
  I am curious about the genome size enlargement introduced by the scaffolding. Were different haplotigs (from different haplotypes) were used for scaffolding, and why? I suppose that only the primary haplotigs should be used.
  
  Considering the high proportion of duplicated BUSCO genes, I wonder whether the iris sequenced is a polyploid or not? Please clarify it in the Background.
  
  Additional Comments: Dr. Wong and her colleagues reported a genome assembly of iris using the PacBio technology. Due to the huge genome size, the generated data volume is impressive. Although the quality of the assembly is not so satisfying, it is reasonable considering the genome size and the high heterozygosity, which is commonly found in many flowers. Overall, the methods used in this work are well described, and the data could be accessed. I only get several minor points regarding the details during the assembling process.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.29.555454v1
Sep 2023
www.biorxiv.org www.biorxiv.org

A Database of Restriction Maps to Expand the Utility of Bacterial Artificial Chromosomes

2
1. GigaScience 25 Sep 2023
  
  in GigaByte
  
  **Editors Assessment: **
  
  While Bacterial Artificial Chromosomes libraries were once a key resource for building the human genome project over time they have been rendered relatively obsolete by long-read technologies. In the era of CRISPR-Cas systems pairing this data with one of the many guide-RNA libraries to find targets for manipulation with CRISPR tools is bringing back BACs advantages for genomics. With this in mind the authors have developed a BAC restriction map database containing the restriction maps for both uniquely placed and insert-sequenced BACs from 11 libraries covering the recognition sequences of available restriction enzymes. Alongside a set of Python functions to reconstruct the database and more easily access it (which were debugged and had improved documentation added during review). The presented data should be valuable for researchers simply using BACs, as well as those working with larger sections of the genome in terms of synthetic genes, large-scale editing, and mapping.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 21 Sep 2023
  
  in GigaByte
  
  AbstractWhile Bacterial Artificial Chromosomes were once a key resource for the genomic community, they have been obviated, for sequencing purposes, by long-read technologies. Such libraries may now serve as a valuable resource for manipulating and assembling large genomic constructs. To enhance accessibility and comparison, we have developed a BAC restriction map database.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.93), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Po-Hsiang Hung **
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The dataset in FTP includes all the Bac sequences and the restriction enzyme recognition sites in csv files. However, I could not find the database of pairs of BACs, which have overlaps generated by restriction enzymes that linearize the BACs. The makePairs function gave me an error when I tried running it locally, so I was not able to verify what is in these datasets. Personally, I find this function to be one of the most useful features described in this manuscript.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide
  
  Yes. This manuscript contains the necessary minimal information (Submitting author, Author list, Dataset title, Dataset description, and Funding information)
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. The authors provide their code in GitHub such that researchers can download the datasets and analyze the sequences locally. However, I felt that the descriptions in the readme.md file is often insufficient to reproduce the data presented in the manuscript, especially for researchers with little to no programming experience. Detailed information includes examples of how to use each function, the input format, and the location of the output folder/files. I also encountered software version issues during the installation of bacmapping. Please re-test the code in a new environment and describe all the versions of each software. For instance, I found Python version 3.11 is incompatible with this package while Python version 3.7 is compatible.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. The author used the BioRestriction class from Biopython to get the digestion site information. No extra validation is conducted in this manuscript. Due to the errors I encountered in re-running the code (see details in Any Additional Overall Comments to the Author), an independent method for checking several digestion sites in some Bac clones is suggested. The suggested independent method is to do enzyme digestion on some Bac clones or upload some Bac sequences to other software and compare the digestion sites.
  
  In the output files that contain the digestions sites for each enzyme, some of the enzyme digestion sites are either NA or []. What is the difference between the two? If they mean the same thing (no cutting by the enzyme), bugs or other coding errors may cause this inconsistency. Please check the code again and also verify some of them using the independent methods suggested above. Examples of this issue are the files in maps>sequenced>CEPHB. Here I list two enzymes that show different results in each file: 3.csv : Ragl ([]), SchI (NA) 6.csv: EspEI (NA), AccII([]) 13.csv: EcoT22I ([]), Hsp92II (NA) X.csv: PacI ([]), AcIWI (NA)
  
  Is the validation suitable for this type of data?
  
  No. No validation in this manuscript. See the answer above.
  
  Additional Comments: The authors make a database with enzyme digestion site information of Bac clones to help people to use the Bac clones for further usage. I think it is useful to have this information and also have the code to do further analysis locally. Thus, I think providing a very detailed user manual (or readme.md) is very important to help people use this dataset. Below I summarized the issues I encountered in running codes and also some suggestions. Major points: (1) I tested some bacmapping functions, and I discovered that some functions are not working as intended due to typos/bugs - The version of the software is required to help people properly install this package - Refining the code and also providing a better user manual is very helpful for people without a lot of coding experience to use it. The detailed information includes examples of how to use each function, the input format, and the location of the output folder/files. Descriptions for some functions in the readme file are not detailed enough and often do not describe what the input needs to be. For example, getCuts() require ‘row’ as input. But the author never gives a detailed description of what ‘row’ is in the readme file. I had to look in bacmapping.py to understand what ‘row’ is. If a function requires the variable ‘row’, show a few examples of how ‘row’ can be extracted from the proper input file. - mapPlacedClones() requires an input file (‘/home/eamon/BACPlay/longboys.csv’, line 335) that is located in the author’s local computer and is not available through github. - Typo in line 814 in getMap(). Should be: name = cloneLine[‘CloneName’] - Inconsistency in output variable type in getMap() (line 830 and 851). When local == ‘sequenced’, the output variable is a tuple, which causes issues in downstream functions such as getRestrictionMap() (line 869). (2) Add pairs of BACs into the dataset (3) The output file of digestion sites of each enzyme, some of the enzyme digestion sites showed NA or [ ]. Please double-check this and explain the differences (4) Validation of an independent method for the digestion map is suggested
  
  Minor points: (1) Add a title to each column of sequencedStats.csv is useful for understanding the table easier
  
  Re-review:
  
  The authors have addressed majority of my points. The software installation works great after considering version control. The updated read.me provide detailed information for each function and their required input variables, and the examples in jupyter notebook are a great help for running the code. I did, however, encounter two minor errors when I tested the Ch19_bacmapping_example.ipynb on a Mac system. Please check this and update it.
  
  (1)The .DS_store file that is automatically generated on a Mac system in the bacmapping/Examples/Ch19_example/maps/placed folder causes an error when running bmap.mapPlacedClones(cpustouse=cpus, chunk_size=chunksize). The same problem happened when I ran bmap.mapSequencedClones(cpustouse=cpus). After I deleted .DS_store in the folder, the code worked.
  
  Here is the error message when I ran bmap.mapSequencedClones(cpustouse=cpus). NotADirectoryError: [Errno 20] Not a directory: '/Users/user_nsame/bacmapping/Examples/Ch19_example/maps/sequenced/.DS_Store'
  
  (2) The second error is from running bmap.getRestrictionMap(name,enzyme). I got the error message, 'list' object has no attribute 'item'. I was able to run this function after changing maps[enzyme].item() to maps[enzyme] in line 779 of bacmapping.py. I encountered the same error with the drawMap function. I was able to run to run this function after changing line 847 of bacmapping.py from rmap = maps[nenzyme].item() to rmap = maps[nenzyme].item().
  
  Here is the error message
  
  AttributeError Traceback (most recent call last) Cell In[20], line 5 3 maps = bmap.getMaps(name) 4 #print(maps) #this is a big dataframe of all the maps, uncomment to check it out ----> 5 rmap = bmap.getRestrictionMap(name,enzyme) 6 print('Sites in ' + name + ' where ' + enzyme + ' cuts: '+ str(rmap)) 7 plt = bmap.drawMap(name, enzyme)
  
  File ~/miniconda3/envs/bacmapping/lib/python3.11/site-packages/bacmapping/bacmapping.py:779, in getRestrictionMap(name, enzyme) 777 maps = getMaps(name) 778 nenzyme, r = getRightIsoschizomer(enzyme) --> 779 return(maps[nenzyme].item())
  
  AttributeError: 'list' object has no attribute 'item'
  
  **Reviewer 2. Wei Dong **
  
  Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise
  
  Is the validation suitable for this type of data? I am not sure about this.This is not my specialty.
  
  Overall comments: This is a great idea, fully exploring, integrating, and utilizing existing data for new research.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.31.535162v1
www.medrxiv.org www.medrxiv.org

Trumpet plots: Visualizing The Relationship Between Allele Frequency And Effect Size In Genetic Association Studies

2
1. GigaScience 25 Sep 2023
  
  in GigaByte
  
  **Editors Assessment: **
  
  This work presents a new standardized graphical approach for visualizing genetic associations across a wide range of allele frequencies. These proposed TrumpetPlots have a distinctive trumpet shape, hence the proposed name. With the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects, this view can help to provide new and valuable insights into the genetic basis of traits and diseases, and also help prioritize efforts to discover new risk variants. The tool is provided as a novel R package and R Shiny application and to demonstrate its use the article illustrates the distribution of variant effect sizes across the allele frequency range for over 100 continuous traits available in the UK Biobank. After some problems in testing the package is now available and easy to deploy via CRAN.
  
  *This assessment refers to version 1 of this preprint. *
  
  Summary
2. GigaScience 04 Sep 2023
  
  in GigaByte
  
  AbstractRecent advances in genome-wide association study (GWAS) and sequencing studies have shown that the genetic architecture of complex diseases and traits involves a combination of rare and common genetic variants, distributed throughout the genome. One way to better understand this architecture is to visualize genetic associations across a wide range of allele frequencies. However, there is currently no standardized or consistent graphical representation for effectively illustrating these results.Here we propose a standardized approach for visualizing the effect size of risk variants across the allele frequency spectrum. The proposed plots have a distinctive trumpet shape, with the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects. These plots, which we call ‘trumpet plots’, can help to provide new and valuable insights into the genetic basis of traits and diseases, and can help prioritize efforts to discover new risk variants. To demonstrate the utility of trumpet plots in illustrating the relationship between the number of variants, their frequency, and the magnitude of their effects in shaping the genetic architecture of complex diseases and traits, we generated trumpet plots for more than one hundred traits in the UK Biobank. To facilitate their broader use, we have developed an R package ‘TrumpetPlots’ and R Shiny application, available at https://juditgg.shinyapps.io/shinytrumpets/, that allows users to explore these results and submit their own data.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.89) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Clara Albiñana **
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. Although there are no explicit guidelines for contribution in the manuscript or website, it is true that by placing the project on gitlab it is possible to contribute to the project / open issues.
  
  Is the code executable?
  
  No. Unfortunately, I wasn't able to install the R package. I have now opened an issue on the gitlab page so that it can hopefully get solved.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. It is very common for new R packages to just use devtools for installation.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The requirements for generating a trumpet plot just involve providing a set of GWAS summary statistics with column-specific names, together with the GWAS sample size. This is very common for GWAS summary statistics-based tools. I think it is fine for the R package to require re-naming the columns to fit the format, as one already needs to upload the file into R. However, I find it inconvenient to have to re-save the summary statistics file with different name-columns for the shinyapp tool. Providing e.g. column indexes alone would be much more user-friendly.
  
  Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?
  
  No. I cannot answer this question until I can install the tool.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Not applicable. There are no existing comparable tools.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  Yes. I can see there is a toy dataset included with the R package.
  
  Additional Comments:
  
  I think the manuscript is very clear and good at making the point of the utility of the software. The proposed trumpet plots are very visually appealing and can be useful to characterise the genetic variation of diverse phenotypes. The novelty of the trumpet plots, as compared to previously proposed effect size vs. allele frequency plots, is the use of positive and negative effect sizes, making it look like a trumpet. I also appreciate the style decisions in the standard generated plots, with a nice visually-appealing color scheme and design.
  
  On the use of the software, I have focused my testing on the R package, which I was not able to install. The shinyapp is very useful for visualising the existing, pre-computed trumpet plots, but I do not find it very useful for generating user-uploaded summary statistics for the reasons I mentioned above. Another comment on the ShinyApp is that I appreciate the possibility to download the plots but it would be very useful to include the name of the visualized phenotype as the plot title, for example, to avoid confusion when downloading multiple plots.
  
  I also found an incorrect sentence in the abstract, which is think should be reversed: " The proposed plots have a distinctive trumpet shape, with the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects".
  
  **Reviewer 2. Wentian Li **
  
  Is the documentation provided clear and user friendly?
  
  No. Many aspects of Fig.1 are not explained.
  
  Overall Comments: Plots with allele frequency as x axis and effect size (e.g. odds ratio) as y axis is a very common display of the contribution from both common and rare alleles to genetic association. A schematic form of this plot is practically on almost everybody's presentation slides when introducing this topic (to see an example, see, e.g. Science (23 Nov 2012), vol 338(6110), pp.1016-1017 ). Considering how many people have already been familiar with this type of plot, I feel that very little new is added in this paper: maybe only a new name ("trumpet"), and/or the power lines. The other methods contributions (log-x, one variant per LD, avoiding gene-level statistics) are rather straightforward. People without experience with "shiny" (R package) can still use ggplot2 or plot in R to get the same result. Generally speaking, I think the paper is weak, though OK as a program/package announcement.
  
  Major comments: * I think the trumpet shape (increase of "effect size" for rare variant) is probably a direct consequence of using odds-ratio as a measure of effect size. If the allele frequency in normal population is p0, that in disease population is p1, [p1/(1-p1)]/[p0/(1-p0)] ~ p1/p0 tends to be large for small p0's, simply because the denominator is small. On the other hand, if population attributable risk (p0(RR-1)/(1+p0(RR-1))) is used as the y-axis, I am uncertain what the shape of the plot would be.
  
  A risk allele has these pieces of information:
  
  allele frequency,
  
  effect size (e.g. odds ratio),
  
  type-I error/p-value,
  
  type-II error/power. The plot in this paper show #1 vs #2 and #4 being added as extra. In another publication with a proposal to plot genetic association results (Comp Biol. and Chem. (2014), 48:77-83 doi: 10.1016/j.compbiolchem.2013.02.003), #2 is against #3 with #1 being an added extra. I'm sure using other combinations could lead to other types of plots. The authors should discussion/compare these possibilities.
  
  Minor comments: In Fig.1, the size of the dots, the brown vs cyan color, the discontinuity of scatter dots around 0.01, are not explained.
  
  Re-review:
  
  I have read authors' response and I'm mostly satisfied. Only two minor comments: * Witte 2014 Nature Rev. Genet. article summarizes the point I tried to make well. I understand that rare variants should have a relatively higher effect from an evolutionary perspective, but since these are rare, their individual or even collective contribution to a disease in the population is still small. A casual reader may not realize this point and I think it would be helpful to cite Witte's article. * My minor comment on Fig.1 is still not addressed: there seem to be more points on the right side of p=0.01 line than the left side. Why this discontinuity? (the added text in Revision is about the color and size of the dots, not about this discontinuity)
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

medrxiv.org/content/10.1101/2023.04.21.23288923v1
www.biorxiv.org www.biorxiv.org

Haplogenome assembly reveals interspecific structural variation in Eucalyptus hybrids

2
1. GigaScience 04 Sep 2023
  
  in GigaScience
  
  De novo
  
  Xupo Ding 1. The CDS and protein sequences could not extracted from the file of masked.fasta with gff3 file when verifying the accuracy of genes loci and related proteins. The extract software is gffread in cufflinks 2.1.1. Please confirm the final assembly file that would upload to GigaDB.2. Confirmed the accuracy of gene predication, especially for ks calculation.3. Before the repeat masked with the software of Repeatmasker, the final sequences were scanned with LTR_retriever and the LAI index have been generated in this folder. The LAI values were 20.55 and 18.06, which could be classified the haplogenome assembly as the reference or gold level, please describe the LAI values after busco completeness in the revised manuscript.4. The percentages of two largest subfamilies of LTR, Gypsy and Copia, were not presented in the supplementary TableS5.5. Two Eucalyptus genomes have been published (Nature 2014; Gigascience, 2020) and they were all not analysis the LTR insert time in detail. The insert times of all TE, Gypsy and Copia would highlighted this manuscript, especially the basic data have been presented with *.list in the LTR_harvest and LTR_retriever scan.6. Did the special genes of each haplogenome classify? Which pathways or Go terms they enriched in?7. Some SVs may be associated with the plant traits. The genes distributing in the regions of different SVs type should be furtherly identified and enriched with GO and KEGG.8. "Syntenic gene pairs between the E. grandis and E. urophylla haplogenomes were identified using a python version of MCScan, JCVI v1.1.18."Syntenic gene pairs in Figure 4 seemed only from JCVIï¼Œnot using MCScan.9. The reference cite should be consistent, such as Candotti et al in the section of Genome scaffolding should be revised.10. Language should be improved and modified by academic editor.
2. GigaScience 04 Sep 2023
  
  in GigaScience
  
  Summary
  
  Chao Bian: This study, entitled "Haplogenome assembly reveals interspecific structural variation in Eucalyptus hybrids", has reported two haplotypes from Eucalyptus grandis and E. urophylla.Both genomes are of high quality and high completeness. Nevertheless, why not directly and separately sequenced the Eucalyptus grandis and E. urophylla, and separately assembled each genome? In this way, the authors will not perform so much assembling steps to distinguish haplogenome.On the other hand, the authors have written a large paragraph to show the SV and SNP between both Eucalyptus species. However, the author only shown the number of SVs and SNPs, but did not show any relationship between the SV and biological characters. Could some SVs and SNPs involved in or impacted some genes can interpret some biological difference between Eucalyptus grandis and Eucalyptus grandis?In my view, only showing the number of SVs and SNPs is indeed fruitless for wide interests of this study. Some biological stories should be reported in a genome study.Please provide new figures with higher resolution. These figures are too much unclear.Please use the novel version of BUSCO V5.2.2, and indicate the used library.What's the QUAST assessment result in this study?The English language of this paper needs to be largely polished. Too much spelling and mistakes were appeared in the manuscript.Some minor suggestions:The decimal places should be uniform, such as "(567 Mb and 545 Mb) to 97.9% BUSCO completion" and "scaffold N50 of 43.82 Mb and 42.45 Mb for the E. grandis and E. urophylla haplogenomes, respectively".In 'All scripts used in this study is available on github.', 'is' should be 'are'.The language of this sentence should be revised "Illumina short-reads were used for k-mer based genome size estimation was performed using Jellyfish v2.2.6 (Jellyfish, RRID:SCR_005491) [25] for 21- mers and visualised with GenomeScope v2.0"For scaffolding step, why the authors removed all contigs smaller than 3kb?'The predicted gene space was' should be 'The predicted gene spaces were'.For "a contig N50 of 3.91 Mb 1." and 'was greater than 88.0% 2', what're meaning of the last '1' and '2' in these sentences.In this sentence 'Approximately 3.3 Î¼g of HMW DNA from was used without', 'from' what?"a BUSCO completeness score of at least 95.3% was obtained for contigs anchored to one of the eleven chromosomes.", for one of the eleven chromosomes? Why contigs were only anchored to one chromosome?Revise 'markers each.,'."BUSCO completeness scores of 94.6% and 95.8% was obtained", 'was' should be 'were'."Although there is a greater number of local variants compared to SVs", 'there is' should be 'there are'."respectively, Supplementary Table S3)" revised to 'respectively, (Supplementary Table S3)'.'Mbp' revised to 'Mb'.'assemblies was' should be 'assemblies were'.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.17.501336v1
www.biorxiv.org www.biorxiv.org

GADMA2: more efficient and flexible demographic inference from genetic data

2
1. GigaScience 04 Sep 2023
  
  in GigaScience
  
  Background
  
  Ilan Gronau: This manuscript describes updates made to GADMA, which was published two years ago. GADMA uses likelihood-based demography inference methods as likelihood-computation engines, and replaces their generic optimization technique with a more sophisticated technique based on a genetic algorithm. The version of GADMA described in this manuscript has several important added features. It supports two additional inference engines, more flexible models, additional input and output formats, and it provides better values for the hyper-parameters used by the genetic algorithm. This is indeed a substantial improvement over the original version of GADMA. The manuscript clearly describes the different added features to GADMA, and then demonstrates them with a series of analyses. These analyses establish three main things: (1) they show that the new hyper-parameters improve performance; (2) they show how GADMA can be used to compare performance of different approaches to calculate data likelihood for demography inference; (3) showcase new features of GADMA (supporting model structure and inbreeding inference). Overall, the presentation is very clear and the results are interesting and compelling. Thus, despite being a publication about a method update, it shows substantial improvement, provides interesting new insights, and will likely lead to expansion of the user base for GADMA.The only major comment I have is about the part of the study that optimizes the hyperparameters. The hyper-parameter optimization is a very important improvement in GADMA2. The setup for this analysis is very good, with three inference engines, four data sets used for training and six diverse data sets used for testing. However, because of complications with SMAC for discrete hyperparameters, the analysis ends up considering six separate attempts. The comparison between the hyper-parameters produced by these six attempts is mostly done manually across data sets and inference engines. This somewhat beats the purpose of the well-designed set up. Eventually, it is very difficult for the reader to asses the expected improvement of the final suggested values of hyperparameters (attempt 2) to the default ones. I have two comments/suggestions about this part.First, I'm wondering if there is a formal way to compare the eventual parameters of the six attempts across the four training sets. I can see why you would need to run SMAC six separate times to deal with the discrete parameters. However, why do you not use the SMAC score to compare the final settings produced by these six runs?Second, as a reader, I would like to see a single table/figure summarizing the improvement you get using whatever hyper-parameters you end up suggesting in the end compared to the default setting used in GADMA1. This should cover all the inference engines and all the data sets somehow in one coherent table/figure. Using such a table/figure, you could report improvement statistics, such as the average increase in log-likelihood, or average decrease in convergence times. These important results get lost in the many improved figures and tables.These are my main suggestions for revisions of the current version. I also have some more minor comments that the authors may wish to consider in their revised version, which I list below.Introduction:===========para 2: the survey of demography inference methods focuses on likelihood-based methods, but there is a substantial family of Bayesian inference methods, such as MPP, Ima, and G-PhoCS. Bayesian methods solve the parameter estimation problem by Bayesian sampling. I admit that this is somewhat tangential to what GAMDA is doing, but this distinction between likelihood-based methods and Bayesian methods probably deserves a brief mention in the introduction.para 2,3: you mention a result from the original GADMA paper showing that GADMA improves on the optimization methods implemented by current demography inference methods. Readers of this paper might benefit of a brief summary of the improvement you were able to achieve using the original version of GADMA. Can you add 2-3 sentences providing the highlights of the improvement you were able to show in the first paper?para 3: The statement "GADMA separates two regular components" is not very clear. Can you rephrase to clarify?Materials and methods - Hyper-parameter optimization:==============================================I didn't fully understand what you use for the cost function in SMAC here. Seems to me like there are two criteria: accuracy and speed. You wish the final model to be as accurate as possible (high log likelihood), but you want to obtain this result with few optimization iterations. Can you briefly describe how these two objectives are addressed in your use of SMAC? It's also not completely clear how results from different engines and different data sets are incorporated into the SMAC cost. Can you provide more details about this in the supplement?para 2: "That eliminate three combinations" should be "This eliminates three combinations".para 3: "Each attempt is running" should be "Each attempt ran"para 3: "We take 200Ã—number of parameters as the stop criteria". Can you clarify? Does this mean that you set the number of GADMA iterations to 200 times the number of demographic model parameters? Why should it be a linear function of the number of parameters? The following text explains the justification, butTable 1: I would merge Table S2 with this one (by adding the ranges of all hyper-parametres as a first column). It's important to see the ranges when examining the different selections.Materials and methods - Performance test of GADMA2 engines:=====================================================para 2: "ROS-STRUCT-NOMIG" should be "DROS-STRUCT-NOMIG" Also, "This notation could be read" - maybe replace by "This notation means" to signal that you're explaining the structure notation.Para 4 (describing comparisons for momi on Orangutan data): "ORAN-NOMIG model is compared with three …". You also consider ORAN-STRUCTNOMIG in the momi analysis, right?Results - Performance test of GADMA2 engines:========================================Inference for the Drosophila data set under model with migration: you mention that the models with migration obtain lower likelihoods than the models without migration. You cannot directly compare likelihoods in these two models, since the likelihood surface is not identical. So, I'm not sure that the fact that you get higher likelihoods in the models without migration is a clear enough indication for model fit. The fact that the inferred migration rates are low is a good indication for that. It also seems like despite converging to models with very low migration rates, the other parameters are inferred with higher noise. For example, the size of the European bottleneck is significantly increased in these inferences compared to that of the NOMIG. So, potentially the problem here is that more time is required for these complex models to converge.Inference for the Drosophila data set under structured model (2,1): the values inferred by moments and momentsLD appear to neatly fit the true values. However, it is not straightforward to compare an exponential increase in population size to an instantaneous increase. Maybe this can be done by some time-averaged population size, or the average time until coalescence in the two models? This will allow you to quantify how good the two exponential models fit the true model with instantaneous increase.Inference for the Orangutan data set under structured model (2,1) without migration: you argue that a constant population size is inferred for Bor by moments and momi because of the restriction on population sizes after the split. You base this claim on a comparison between the log-likelihoods obtained in this model (STRUCT-NOMIG) and the standard model (NOMIG) in which you add this restriction. I didn't fully understand how you can conclude from this comparison that the constant size inferred for Bor is due to the restriction on the initial population size after the split. I think what you need to do to establish this is run the STRUCT model without this restriction and see that you get exponential decrease. Can you elaborate more on your rationale? A detailed explanation should appear in the supplement and a brief summary in the main text.Inference for the Orangutan data set with models with pulse migration: This is a nice result showing that the more pulses you include, the better the estimates become. However, your main example in the main text uses the inferred migration rates. This is a poor example, because migration rates in a pulse model cannot be compared to rates in a continuous model. If migration is spread along a longer time range, then you expect the rates to decrease. So, there is no expectation of getting the same rates. You do expect, however, to get other parameters reasonably accurate. It seems like this is done with 7 pulses, but not so much with one pulse. This should be the main the focus of the discussion of these results.Results - inference of inbreeding coefficients:======================================When you describe the results you obtained for the cabbage data set, you say "the population size for the most recent epoch in our results is underestimated (6 vs 592 individuals) for model 1 without inbreeding and overestimated (174,960,000 vs. 215,000 individuals) for model 2 with inbreeding". The usage of under/overestimated is not ideal here, because it would imply that the original dadi estimates are more correct. You should probably simply say that they are lower/higher than estimates originally obtained by dadi. Or maybe even suggest that the original estimates were over/underestimated?Supplementary materials:=====================Page 4, para2: "Figure ??" should be "Figure S1"Page 4, para 4: Can you clarify what you mean by "unsupervised demographic history with structure (2, 1)"?Page 22, para 2: "Compared to dadi and moments engines momentsLD provide slightly worse approximations for migration rates". I don't really see this in Supplementary Table S16. Estimates seem to be very similar in all methods. Am I missing anything? You make the same statement again in the STRUCT-MIG model (page 23).Page 22, para 4: "The best history for the ORAN-NOMIG model with restriction on population sizes is -175,106 compared to 174,309 obtained for the ORAN-STRUCT-NOMIG mod". There is a missing minus sign before the second log likelihood. You should also specify that this refers to the moments engine. Also see comment above about this result.
2. GigaScience 04 Sep 2023
  
  in GigaScience
  
  Abstract
  
  Ryan Gutenkunst: In this paper, the authors present GADMA 2, an update of their population genomic inference software GADMA. The author's software serves as a driver for other population genomics software, enabling a consistent user interface and a different parameter optimization approach. GADMA 2 extends GADMA by adding two new inference engines: momi2 and momentsLD, hyperparameter optimization for the genetic algorithm, demes visualization, selection, dominance, and inbreeding modeling, and a new method for specifying model structures. In this paper, the authors show that their optimized genetic algorithm is somewhat more effective than the original hyperparameter settings. They also compare among inference engines, finding some differences in behavior. Lastly they compare with dadi itself in two models with inbreeding, finding better likelihood parameter sets than those previously published.GADMA has already found some use in the population genomics community, and GADMA 2 is a substantial update. The manuscript describes the updates in good detail and demonstrates the effectiveness of GADMA 2 on two real-world data sets. Overall, this is a strong contribution, and we have few major concerns.Major Technical Concerns:1) The authors claim to now support inference of selection and dominance. But what they support is very limited and not very biological. In particular, they currently support inferences which assume a single selection and dominance coefficient for the entire data set (as in Williamson et al. (2005) PNAS). In reality, any AFS will include sites with a variety of selection coefficients, usually summarized by a distribution of fitness effects. Since Keightley and EyreWalker (2007) Genetics, this has been the standard for inferring selection from the AFS. The authors should be clear about the limitations of what they have implemented.2) Figure 4 shows that optimization runs using GADMA 2 tend to find better likelihoods than bare dadi optimization runs. But the advice for using dadi or moments is to run multiple optimizations and take the best likelihood found, with some heuristic for assessing convergence. So most users would not (or at least should not) stop with the result of a single dadi optimization run. It does seem that GADMA 2 reduces the complexity of assessing convergence between multiple dadi optimization runs. But another important consideration is computational cost. (At an extreme, if each dadi run was 100 times faster than a single GADMA 2 run, then the correct comparison would be between the best of 100 dadi runs and a single GADMA 2 run.) It is not clear from the paper how the 100 GADMA 2 runs compare to the 100 dadi runs in terms of computational cost. It would be very helpful to have a table or some text describing the average computational cost (in CPU hours) of those runs.Major Writing / Presentation Concerns:1) Bottom of page 5: The authors are sharing the results of their hyperparameter optimizations from their own server, with uncertain lifetime. These results should be moved to an archival service such as Dryad.Minor Technical Concerns:1) The authors note that the DROS-MIG models had worse likelihoods than the DROS-NOMIG models. Since these are nested models, the DROS-MIG model must mathematically have a better global optimum likelihood. It would be worth pointing out that the likelihoods they found indicate a failure of the optimization algorithms. The authors should also present the DROS-MIG model results in a supplementary table.2) The Godambe parameter uncertainties in Tables S20 and S21 are pretty extreme, sometimes 10^-13 to 10^12. This may be due to instability of the Godambe approximation versus step size. In Blischak et al. (2020) Mol Biol Evol, the authors tried several step sizes and sought consistent results between them (Tables S1-S4). We suggest the authors take that approach here.Minor Writing / Presentation Concerns:1) The author claims that "GADMA does not require model specification". However, it seems that GADMA "structure model" rather describes a different and perhaps broader way to specify demographic models rather than completely eliminates model specification.2) The authors use the term "inference engine" for the four tools GADMA 2 builds upon. But to us, the act of inference includes parameter optimization. In this case, these tools are not being used for the inference itself, but rather to calculate the (composite) likelihood of the data. Perhaps a better term would be "likelihood calculator"?3) The authors suggest engine-specific hyperparameter optimization as a future goal. But the optimal hyperparameters are also likely to be model specific. (For example, 2- versus 4-population models might benefit from different optimization regimes.) Can the authors comment on this?Writing Nitpicks1) Abstract: "optimization algorithms used to find model parameters sometimes turn out to be inefficient" â†’ vague: more details on why/how they are inefficient would be helpful2) Introduction: "Inference of complex demographic histories… in the population's past." needs citation.3) Page 2: "parameter to infer, for example, all migration" is a comma splice and should be split into two sentences.4) Supplement page 4: Figure ?? reference is broken.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.14.496083v4
www.biorxiv.org www.biorxiv.org

KGML-xDTD: A Knowledge Graph-based Machine Learning Framework for Drug Treatment Prediction and Mechanism Description

2
1. GigaScience 04 Sep 2023
  
  in GigaScience
  
  Background
  
  Michel Dumontier: This paper describes KGML-xDTD, a knowledge graph-based ML framework to predict and explain potential applications of drugs. The main approach is the use of graph reinforcement learning to predict drug-disease pairs and provide a knowledge-based path as a potential mechanism of action. The method is evaluated against other approaches, various data partitioning strategies, comparison to a manually curated database of mechanisms of actions, and two use cases. The paper is well written, easy to read, and makes a contribution to the scientific literature. Accurate prediction of drug uses remains an important and challenging problem in biomedical informatics. The novelty of the approach is to use graph reinforcement learning to achieve state of the art performance for the problem, and it also is able to generate plausible paths within a knowledge graph to serve as mechanistic explanations. There are some limitation to the work that should be addressed. These include: 1) The baseline models (GAT & GraphSAGE+SVM) only use a small subset of drug-disease replacements. The authors indicate that the smaller subset is necessary owing to time performance constraints. However, there is no discussion as to the possible impact the reduced subset any aspects in relation to their method. 2) The approach only evaluate 3-hop KG paths, which is 1/7 of what is available in DrugMechDB. What is the quality/performance impact of choosing longer paths? Wouldn't the the number of biologically reasonable paths to explain a predict be substantially reduced? I worry that this is cherry picking the dataset to show good performance for the only case (3-hop) that it is capable of (While critizing other methods as not being performant) 3) The authors use RepoDB as one of their sources, and specifically use the "withdrawn" set as true negatives. However, most withdrawn tags are linked to reasons other than safety or efficacy of the clinical trial. As such it is not clear that this set is a good true negative set. 4) The authors use MyChem as a resource for drug indications/contraindications. However, MyChem is not an original source - it aggregates other resources. The authors should properly identify the source of "human curated annotations". 5) I commend the authors for their evaluation, which uses a number of different train/test strategies and against different methods. However, as far as i can see the train/test strategy does not adequately remove similar true drugs-disease pairs from the training/test set. That is to say there are many drugs that are approved for very similar conditions, and therefore it becomes somewhat trivial to predict these (this problem is highlighted in the 2011 PREDICT paper by assaf gottlieb). More work should be done here to report an accuracy based on more stringent evaluation criteria. 6) It's unclear to me that the 124k diseases are real (diagnosable) diseases that could be prescribed for. Inflating the number of possible (but implausible) diseases might augment the performance, but contribute nothing to medicine. Elaborate. 7) Figures 5, 6 are difficult to read 8) It's nice to see the 2 use cases in the paper. However, the extracted subgraphs are quite different than the DrugMechDB MOA paths. So there's something to be said about the succinctness of the DrugMechDB MOA paths, which might prove to be a better training set for some explanation algorithm, rather that one that is independently generated. Overall, this is a nice paper with an interesting approach.
2. GigaScience 04 Sep 2023
  
  in GigaScience
  
  ABSTRACT
  
  Yuansheng Liu: The paper entitled "KGML-xDTD: A Knowledge Graph-based Machine Learning Framework for Drug Treatment Prediction and Mechanism Description" proposes KGML-xDTD, a two-module, knowledge graph-based machine learning framework . Author constructs a large knowledge graph for the training of the model. The model is divided into two modules, one for drug repurposing prediction and the other for Mechansim Of Action Prediction. Both modules have achieved good results compared with the existing baseline. Here are my specific points: (1) It is mentioned on page 6 that the data are classified into three categories, while other data are classified into two categories. How did you exclude the "unknown" category and adjusted result? (2) Drug Repurposing Prediction model and Mechanism of Action Prediction model seems to be two separate training model. I can not find evidence of multitasking training from the content. If the model is trained separately, which model is the evaluation metrics according to? If training together, the model section should be written more clearly. (3) The introduction part only mentioned about Drug Repurposing Prediction Model, but it didn't describe existing Mechanism Of Action Prediction model. (4) Baseline seems to be Drug Repurposing Prediction SOTA model. But the best performance of the work is about Mechanism Of Action Prediction. (5) The data set appears to selectively chose drug-disease pairs with intermediate paths. But if the drug or disease in the network do not connect, that how dose Drug Repurposing Prediction model perform?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.11.29.518441v2
Aug 2023
www.biorxiv.org www.biorxiv.org

Metaphor - A workflow for streamlined assembly and binning of metagenomes

2
1. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractRecent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyse genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customisable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly - combining the short input reads into longer, contiguous fragments (contigs), and binning - clustering these contigs into individual genome bins. Both processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully-automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets, and the impact of available assembly and binning strategies on the final results. The workflow is freely available at https://github.com/vinisalazar/metaphor.Author summary
  
  **Reviewer 2. Po-Yu Liu **
  
  The Metaphor is a workflow with high completeness for short-read-based metagenomic analysis. I look forward to its compatibility with long-read platforms (ONT and PacBio). This work is worth publishing. However, it is still a bioinformatic knowledge and skill-required toolkit. If the Metaphor can be integrated into a web-based platform, such as Galaxy or Kbase, it would be more user-friendly for much more users.
2. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractRecent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyse genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customisable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly - combining the short input reads into longer, contiguous fragments (contigs), and binning - clustering these contigs into individual genome bins. Both processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully-automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets, and the impact of available assembly and binning strategies on the final results. The workflow is freely available at https://github.com/vinisalazar/metaphor.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad055) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Thomas Brüls **
  
  The authors present a snakemake-based workflow to automate and chain the main computational ingredients (assembly and binning) of genome-centric metagenomics; the authors developed a technically sound tool for this purpose, and by itself it is certainly valuable to the research community and worth of publication. however, even if the article is casted as a technical note -hence with an emphasis on the design, implementation and assessment of the tool-, I feel that a more thorough discussion of both its abilities and inabilities (e.g. strain resolution, detection of low abundance organisms, identification of virus bins, etc) would be worth for a more general audience. On the same token, a more deep discussion of some of the results obtained with their tool (see below) would be of interest and would also illustrate useful use cases. I would suggest the following modifications/additions:-the experiments with the strain madness dataset suggest that the genomes (or fragments thereof, i.e. the bins) resolved should be viewed as "species" genomes, or composite genomes possibly originating from multiple strains. if so, do the authors think this represents a hard limit to the assembly + binning approach, or could further existing tools (e.g. performing variant detection on top of cross-assembly before the binning step) be integrated or developed in the future for strain-resolution (i.e. to identify strains not dominant in any sample)? -related, a simple summary of the number of individual strains recovered in individual bins for the strain madness experiment would be interesting.-another issue that would be worth discussing in my opinion is the impact of genome abundance on the recovery of corresponding bins and their quality. the platform developed by the authors appears to be well suited for such kind of analyses and the results would be of both theoretical and practical interest. to put it simply, what is the minimal initial coverage of genomes required in order for them to be recovered in bins of a given size and quality?-rem: theses two issues (strain-level diversity and individual strain genome abundances) likely interact to limit bin resolution, and this could be mentioned by the authors.-the data presented by the authors suggest that the metabat binning engine significantly outperforms the other two tools (concoct and vamb, which are both widely used), see e.g Figure 2; what would account for that, and do the authors think this is a general observation (i.e. beyond the specific CACB setting or marine metagenome shown in Fig 2)? -a bin refinement step (based on the DAS tool and dereplication) is frequently mentioned but should be more detailed (including a precise definition of the bin quality metric used).
  
  further rather minor comments: -in the abstract, when mentioning "technical challenges associated with...", it would be worth mentioning that algorithmic challenges are present as well. -in the introduction, "It is hypothesised that pooled assembly and binning may lead to improved results when analysing communities with high genetic diversity, and to poorer results when there is a high level of intraspecies/strain-level diversity". I would assume there are many instances in the real world that are both, i.e. that present both high inter-species and intra-species genetic diversity, what then?-in the future directions, the authors mention the identification of eukaryotic and viral contigs and bins, and could shortly elaborate how this could be done properly. -the sentence "In summary, our assessment of ..." at the end of the ms appears to have a syntactic problem.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.09.527784v1
www.biorxiv.org www.biorxiv.org

Hetnet connectivity search provides rapid insights into how two biomedical entities are related

2
1. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractHetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open source implementation of these methods in our new Python package named hetmatpy.Competing Interest Statement
  
  **Reviewer 2. Paolo Provero **
  
  In this work Himmelstein and collaborators introduce a statistically controlled way of extracting significant node pairs in heterogeneous networks (hetnets) without relying on a ground truth and related training. The method "explains" why two nodes are significantly connected by extracting the metapaths most responsible for the enrichment. This is based on computing a null distribution of the DWPC, which allows assigning a P-value to each metapath joining two nodes, and then to visualize the individual paths responsible for the enrichment. The method is novel and significant, and can be in principle be applied to many hetnets, in life sciences and beyond, when a ground truth is not available or not desirable as it would introduce bias. The software tools developed appear to be readily available to other researchers.
  
  Major comment: If I understand correctly, given two nodes (say "Alzheimer disease" and "Circadian rhythm") the method extracts, in a statistically controlled way, the most significant metapaths joining the two nodes, and then the individual paths responsible for the enrichment. But this is not the most obvious question a life scientist would ask the network, which would be instead something like "Which are the pathways most significantly connected to "Alzheimer disease"? Indeed this type of question would be the one to ask when aiming for drug repurposing (possibly replacing "pathways" with "compounds" or "pharmacologic classes"). Based on Fig. 4A, the pathways are presented, or "suggested," in decreasing order of number of metapaths, but this is hardly a ranking by significance. Would it be possible to summarize the results in such a way as to rank the pathway nodes connected to a given disease node by significance (or more generally to rank the nodes of a certain type by the significance of their connection to a given node of another type)? This should be discussed.
  
  I also have several minor concerns. (1) The authors introduce and compute a null distribution of the DWPC which takes into account node degree in a statistically controlled way when evaluating the connectivity between two nodes. However, the DWPC itself does take into account node degree, as the name implies, and contains a tunable parameter that can be optimized, at least when a ground truth is available (as in Ref 39 by the same first author). I understand such tuning is not possible when, as in the present case, no ground truth is available, but the authors should make this point more clearly. (2) I find Fig. 1B a bit confusing: according to the legend, the top rows are known treatments, which should have higher than expected connectivity. However, based on the colors as explained by the legend, the bottom treatment/disease pairs seem to have higher connectivity (3) The acronym DWPC is defined after it has been used several times (4) The legend of Figure 2 should specify that these results apply to the nodes "Alzheimer disease" and "Circadian rhythm", although this becomes clear in Fig. 4 (5) I don't think Figure 3, representing the home page of the web site, is especially useful (6) I found Fig. 4 confusing: the sum of the path counts for the selected metapaths in panel B is way larger than the 425 results shown in Panel C. As far as I understand no path can belong to more than one metapaths, so is there some further selection here? (7) The "Frontend" section of the Methods seems a bit too detailed for the Gigascience audience.
  
  Re-review: The authors have addressed all my comments in a satisfactory way.
2. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractHetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open source
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad047) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Karthik Raman **
  
  The paper is very well-written and addresses an important problem. The database appears easy to use and contains a lot of pre-computed data, which will be useful for researchers to query and generate useful insights. I only have a few minor comments, which if addressed, could further strengthen this manuscript.
  
  Minor comments: Without line and page numbers, it was a bit tricky to point out the issues.
  
  "One such application" in the introduction does not read well - just "one application"2. It is nice to see that DWPCs that are not retained by the database can be generated on the fly. The para goes on to mention "while still allowing on-demand access to the full metrics for all metapaths with length â‰¤ 3" --- is it also possible to generate metrics for longer paths if needed?
  
  Below Fig 2, there is a point about the adjusted p-value. I see that the discussion about FDR is presented later in the manuscript (and well justified), but there could be a pointer here to that section.
  
  Is there a possibility to include other computations like betweenness centrality and motifs also? This kind of data looks really ripe for an excellent analysis of repeated motifs etc.
  
  I found the Methods extremely long and may be a bit distracting for readers of this manuscript --- I was wondering if some of these can be moved to Supplementary.
  
  In the section on "Details of matrix DWPC implementation", it is stated that "our matrix methods were validated". It is not clear where these validations have been discussed.
  
  Supplementary? 7. In the section on "Permuted hetnets", it is not fully clear what the parameters for XSwap algorithm was. What were the parameters, e.g. number of swaps, etc.?
  
  In the section on "Details of the gamma-hurdle distribution", there is perhaps a missing equation below the second statement of "The probability of a draw from the distribution is"
  
  The validation here which points to an ipynb, could be put in Supplement.
  
  In the section on "Prioritizing enriched metapaths for database storage", what is the logic underlying the choice of parameters? "For metapaths with length â‰¥ 2, we chose an adjusted pvalue threshold of 5 Ã— (nsource Ã— ntarget)^âˆ’0.3".
  
  Under "Visual Design", are the colours chosen "colour-blind friendly"?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.05.522941v1
www.biorxiv.org www.biorxiv.org

SODAR: managing multi-omics study data and metadata

2
1. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.Competing Interest StatementThe authors have declared no competing interest.
  
  **Reviewer 2. Philippe Rocca-Serra **
  
  The reviewer thanks the authors for their efforts in producing the submitted manuscript. The authors describe a django based web application designed to support data management. The tool is built to support experimental metadata capture using the ISA format in its tsv form. The tool relies on irods to manage data files associated with the experimental metadata. The tool offers programmatic access via an API and clear front end.
  
  Main comments: The title: "SODAR: enabling, modeling, and managing multi-omics integration studies" could be clearer.Being more concise "SODAR: standard compliant management of multi-omics studies " would deliver a better message. Page 1 , Abstract: it would benefit from further refinement as there are several repetitions. Check 3rd sentence for English. "ranging from....to..." , s/whereas/to/"Scientists from diverse backgrounds also have different demands for interfacing with the data, ranging from computational users that need programmatic or command line access whereas non-computational users need graphical interfaces. "to:"Scientists, with different backgrounds, ranging from computational scientists to wet-lab scientists, have different needs when it comes to data access, with programmatic interfaces being favoured by the former and graphical ones by the latter". Instead of saying "under a permissive licence", be more explicit and plainly state "under MIT licence. "Page 2, Introduction:what is the difference between " data analysis and integration of data"? Repetition/redundancy in "An example of such complex study is (Esterhuyse et al., 2015) in infection biology, which will be used as an example below. "Suggestion:Use of term "modeling": using "plan" or "planning" may be better to remove any ambiguity about the nature of the modelling (statistical modeling, data modeling). Alternating, perfer 'representation' or 'representing'. (the term model is repeated many times in the following sentences) The statement "The most comprehensive standard for describing study metadata is the ISA-Tab format ..." is probably too strong. There are more formal (UML) models such as FUGE-OM (https://doi.org/10.1038/nbt1347 ) or CDISC SDM & SDTM.A more understated assessment such as "a popular standard, owing to its simplicity, is the ISA-Tab format""Alternatives include..." possibly cite other options for managing such complex datasets as seen with BIDS in neuroscience (Gorgolewski, K., Auer, T., Calhoun, V. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci Data 3, 160044 (2016). https://doi.org/10.1038/sdata.2016.44) or why not mention HDF5 specification. This section could be improved by refining the transitions between the different ideas presented or organising the flow. For example, by layout out the challenges of 1/ dealing with experimental metadata and 2/ dealing with digital objects produced by instruments, which have the characteristics outlined by the authors (volume, depth). Then review the technical solutions and then present the choices made by this implementation and possibly identify the selection criteria which led to choosing one specification over another. Results:Page 4: " Non-computational users can interface with SODAR using the graphical UI, whereas computational users can use command line interfaces and REST APIs from scripts and other external software. "Repeat from the abstract. I would suggest rephrasing to 'humanise' 'computational users' vs 'non-computation users', and identifying the function and roles in actual labs (bioinformaticians, data analysts, aka dry lab scientists) vs (experimentalists, wet-lab biologists). Figure 1: same comment (in fact confirming by the choice of characters).a question about the diagram: Is it the case that the Web UI does not talk to server via the API as done in some modern development. Probably highlight there the reliance on the Django framework. Section 2.1The first sentence needs attention, check the English. "for both serving for modeling experiments..."Also, there are systems (EBI Metabolights tools on their github repo, DataVerse, FAIRdom SEEK, Zendro...).So the story telling should probably first talk about the survey of the existing and then only bring to arguments justifying new development. Table 1.It is odd to lump blanket statements for tools such as LIMS, ELN or 'Study Databases' without clearly stating which ones specifically have been evaluated. It seems that one could formulate a table with very different results.
  
  Question: How was selection bias controlled for? Page 5:This section should be reorganised and each explanatory statement refined to add clarity. Case in point:"Arbitrary Experiments": Does experiment equate 'ISA.Assay'? is it akin to a Workflow or process Sequence ? Question: among the key feature that such a system should have to support the work of dry/wet lab scientists, surely, deposition to public repositories should be high on the list. Why is this absent? Page 6:typo: s/bioinfsormaticians/bioinformaticians/punctuation: to be checked: missing commas make for a difficult read.suggestion: simplify the role of 'experimentalists' in the context of SOBAR."They use the templates provided by the Data Stewards to instantiate a wet lab track and track its metadata." Question: How are data stewards trained in ISA-Tab? Access to the demo tool gives the opportunity to use and test the component. While the UI is simple and intuitive, a number of limitations in the editing functionality make usage more difficult that it needs to be.Page 7:"of course, using the REST-API of SODAR, it is possible to automate these tasks" Could the author produce a jupyter notebook showing how to do so? It would be a nice addition and possibly a good resource that could facilitate uptake. Section 2-3:page 8-9-10: this section could be streamlined and condensed to really focus on the interaction between shaping a sample processing & data acquisition workflow into a template which can be used by a wet lab scientists. All this while allowing a markup with ontology terms. Note: the ontology terms on the demo server do not resolve properly. Question: Why choosing Bioportal over other services, e.g. EBI OLS? Question: How can value-sets be constrained in SODAR? Question: ontology browser: it is unclear if the ontologies need to be loaded locally or if they are accessed via an API call to the relevant services ? Can the authors clarify this point? the demo server did not seem to allow it or I wasn't able. may be a figure showing the functionality would help? Page 11: Internal Usage Statistics Question: it seems that the mean size of an experiment stored in SODAR is ~60 samples and about 10 files per sample. These are relatively small sized studies. Can the authors provide insights about the performance of the platform with large studies (several thousands of samples and above)?
  
  Methods: Question: Installation and deployment of SODAR.Why the authors omit to mention that SODAR can be deployed via Docker? It seems useful information. Question: AltamISAChecking the library, it seems that development has stalled. It is a concern? Have the authors tested swapping AltamISA with ISA-API ? Is it at all possible ? could it be made via an adaptor of some sort? Can Altam ISA convert to ISA-JSON or other public repository compatible format to provide a capability to assist users disseminate their results? Comment: figure 3 should not be a supplementary material but a proper content as it is useful as showcasing SODAR UI and customization.
  
  Re-review: The reviewer thank the authors for their efforts and extensive rework of the manuscripts, and for delivering this software stack. minor corrections:
  
  page 4, 2nd paragraph, first sentence: typo -> s/approaching itusing/approaching it using/page 7, 2nd paragraph, suggested edit:change from: "For publication, raw and processed data and metadata are deposited in scientific catalogues, study databases and registries. An example is the BioSamples database for metadata [22].""to:For publication, metadata and raw or processed data are deposited in scientific catalogues, study databases and registries. Examples are the BioSamples database for metadata [22] and Short Read Archive for raw sequencing data [citation needed]."
  
  "important clarifications: 1. this sentence makes a disservice to the manuscript: "Our work isrepresentative of the work typically done by core units in clinics. Clinical settings often deal with humans as their primary sample source. This implies controlled access of data, or not being allowed to share confidential data. Thus, developing support for hosting data in a public repository is not our aim. Likewise, uploading data to other public repositories has not been a priority. "Two reasons:- the first one is opening the can of worms of data governance and oversight of patient related information. I would steer clear of that in this piece.- the second one is because i would flip the argument around. "While deposition to public repositories was not necessarily the priority, the development of an (almost, see below ) ISA compliant system provides such a capability should the data owner need it" 2. in the result section, or in the documentation, a welcome addition would be example of templates for non-sequencing based assays. For instance, since the authors mentioned their need to support proteomics and mass-spectrometry users, it would make sense to highlight the templates available. In other words, it would help the target audience of the manuscript locate 'metadata profile definitions' (somewhat akin to ISA configurations) for specific assay types. If I have missed it from the manuscript or the github repo, please ignore. 3. "dialectic" ISA format:Several examples are available from the GitHub repository generally follow the ISA-Tab specifications but also introduce a local field: "Library Name". While such value would make sense in the official ISA specification, it is currently not supported. This leads to the creation of a diverging format.It would be sensible to keep the "Library Name" as an presentation label (for display in the UI) and substitute it to "Labeled Extract Name" when exporting outside the database to the tab format, in order to retain compatibility with other ISA parser and the official specifications. It could be added as an output option to the Altam-ISA parser in case deposition to public repositories is needed (e.g. EMBL-Metabolights). This would go some way in helping 'Interoperability' and would not be too onerous a change. Worth of note, I was recently made aware that ENA repository would be accepting submission in ISA-Tab and ISA-JSON format, hence raising this point to the authors. Suggestion: clarify this in the Methods section. Also, it seems the following example is missing 'Assay Name' and 'Raw Data File' fields:https://raw.githubusercontent.com/bihealth/sodar- paper/main/GSE96583_PBMC_Single-Cell_Demo_Project/a_PBMC_test_scRNAseq_nucleotide_sequencing.txt
2. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad052) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Xiaotao Shen **
  
  The authors developed the SODAR tool, which supports multi-omics integration studies. This is a great tool that has a user-friendly interface and supports multi-omics integration. However, I have several concerns that need to be addressed before this manuscript can be considered to be published. How does the SODAR handle the multi-omics data that are from different samples? For example, the gut microbiome data from stool samples and proteomics data from blood samples, which may be from the same person but collected at different dates. Since SPDAR supports cell editing, so how does it make the metadata and expression data consistent automatically? The authors claim that the SODAR can support multi-omics integration studies. However, I didn't find out how SODAR can do that. Could the authors give more descriptions about that?
  
  Re-review: The authors have addressed all my comments and concerns.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.19.504516v3
www.biorxiv.org www.biorxiv.org

MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

2
1. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.Key points
  
  **Reviewer 2. Jianxin Wang **
  
  In this manuscript, the authors present MuLan-Methyl, a deep-learning framework for predicting 6mA, 4mC, and 5hmC sites. They use DNA sequence and taxonomic identity as features, and implement five popular transformer-based language models in MuLan-Methyl. MuLan-Methyl is open-sourced, and a web server is also provided for users to access it. Overall, I think the methodology of MuLan-Methyl is clear and innovative, and the experiments seem comprehensive. However, I do have several concerns that I believe should be addressed before the paper is accepted by GigaScience.
  
  Major 1. One major concern is that, in my opinion, DNA methylation is dynamic. Cytosines in the same position of the DNA sequence may have different methylation status in different samples, different cells, or even in different development stages of a cell. So, how can we predict the methylation status of a site based on only its sequence (and taxonomic identity)? -- The authors should clarify that in what cases, MuLan-Methyl (as well as other methods that use only DNA sequence) can be used to study DNA methylation, in Introduction or Discussion section. -- The authors discuss motifs in Fig. 3, but only for positive samples. How about the motif distribution in the negative samples? Can I understand that this method is actually for discovering motifs (or sequence structures) that are highly correlated with methylation? -- How is the performance of MuLan-Methyl without taxonomic identity? 2. The authors compared MuLan-Methyl against iDNA-ABF and iDNA-ABT, especially on the independent test set (Fig. 2E). I think the authors should clarify that whether they trained the models of the three methods using the same training datasets. If not, the authors should clarify the reason. 3. I'm curious about the computational efficiency of MuLan-Methyl. How many parameters in its model? Does MuLan-Methyl have advantages over other methods in terms of computational efficiency?
  
  Minor 1. I don't understand why the references were not ordered from 1 in the main text. 2. I suggest that the authors re-organize the Introduction section. There are too many small paragraphs in this section. 3. At the end of Page 2, "The type 4mC type is present in 4 species" should be corrected.
  
  Re-review:
  
  The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method. The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method.
2. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad054) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Yupeng Cun **
  
  Zeng et al. proposed an ensemble framework for identifying three type DNA-methylation sites, and performed a benchmark comparison in multiple species' genomic data. This paper give a valuable study on how ensemble transfer learners works and the predictability in different species. My suggestion is the manuscript acceptable with following minor revision: 1. Calculated a consensus ranking using Kendall's tau rank distance method for each method in Figure 2-C. 2. the multi-head self- attention and self-attention head formula should redescribed by following this preprint: https://arxiv.org/pdf/1706.03762.pdf 3. MLM and MuLan-Methyl mixed in some cases, which need be used in a consensus way.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.04.522704v2
www.biorxiv.org www.biorxiv.org

A new haplotype-resolved turkey genome to enable turkey genetics and genomics research

2
1. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractBackground The domesticated turkey (Meleagris gallopavo) is a species of significant agricultural importance and is the second largest contributor, behind broiler chickens, to world poultry meat production. The previous genome is of draft quality and partly based on the chicken (Gallus gallus) genome. A high-quality reference genome of Meleagris gallopavo is essential for turkey genomics and genetics research and the breeding industry.Results By adopting the trio-binning approach, we were able to assemble a high-quality chromosome-level F1 assembly and two parental haplotype assemblies, leveraging long-read technologies and genomewide chromatin interaction data (Hi-C). These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity. The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7). Comparative analyses reveal a large inversion of around 19 Mbp on the Z chromosome not found in other Galliformes. Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding.Conclusions Collectively, we present a new high quality chromosome level turkey genome, which will significantly contribute to turkey and avian genomics research and benefit the turkey breeding industry.Competing Interest Statement
  
  **Reviewer 2. Luohao Xu **
  
  This manuscript by Barros et al. presents a high-quality dipoid turkey genome assembly which shows significant improvement relative to the previous one. This new assembly is timely and will likely be used as the reference turkey genome, but the authors should acknowledge that the W chromosome is absent (because the F1 individual was a male?). This manuscript fits more with "Data Note" than "Research" as I see most results are descriptive and confirmatory. While the chromosomal assembly is relatively complete, I am concerned whether it still contains assembly errors (because of not being polished by long reads?) which led to fewer genes annotated. This assembly metric needs to be taken into accounts if this assembly were to be used as a reference. The authors need to provide the QV value (see the VGP standard), and evaluate indel errors in coding regions. Some of the results are very brief without showing details or a figure, so difficult for assessment, for instance those SVs affecting genes. Page 4, "two most important avian agricultural species", I think duck should be the second most important poultry species? Page 5, I believe the "F1 assembly" refers to the primary assembly or collapsed assembly - please define it more clearly. Page 6, it's unclear how the 36 chromosome models are defined, particularly for small microchromosomes (29-35). According to the karyotype of turkey (2n=80), a few chromosomal models are missing. Page 6, "This captures the chromosome arms in a single contig" does it apply to all chromosomes? This is unlikely, and data is not shown. Page 6, any idea why the coverage of two parents differs (110X vs. 137X)? Page 6, "anchored the assemblies to the F1 assembly using RagTag". This suggests and chromosomal assembly of the two haplotypes was not independent, and replied on the F1 assembly. This can potentially lead to missing structural variations between two haplotypes (inversions, translocations). Page 7, please show more data to support the correct assembly of the chrZ inversion, including Hi-C heatmap, and long-read alignment spanning the inversion breakpoints. Note the Z chromosome inversion has been reported in Zhang et al. 2011 (BMC genomics), which is not cited until in the Discussion. Page 8, it's possible some genes were not annotated because of the presence of indels in coding regions. The genome assembly QV value can be calculated to measure the error frequency (Rhie et al, 2021 Nature). Page 8, please provide a statistical result for gene density comparison. Page 8, at the bottom, please cite the sources of these bird genomes. Page 9, "Gene family contractions and expansions". These analyses were a bit crude. " Orthologous groups" is not equivalent to "gene family". Page 10, the phrase "F1 and parent assemblies" is confusing. Both haploid assemblies are derived from the diploid F1. Consider changing to "paternal and maternal genomes". Also, as I commented above, both parental chromosomal assemblies are based on the same reference (Mgal_WU_HG_1.0), so the contigs were ordered and placed in the same way. This process could mask the potential non-co-linear segments. For a more appreciated way to independently assemble two chromosome-level assemblies, see the marmoset diploid genome paper (Yang et al., 2021 Nature). Page 10, please use a figure to show the SV over the BLB2 gene. Page 11, again, please visualize the result on the MAN2B2, GEMIN8, RIMKLB and RALYL cases. Page 11, "Loss of function variation", I am wondering whether variations mentioned in this part are fixed in the corresponding populations? Page 11, "Knockouts of this gene lead.." reference is needed. Page 12, "Avian genomes are known to…" references are missing. Page 12, "Distinct genomic landscapes of turkey micro and macrochromosomes", some patterns have been described in the literature, for instance, 10.1111/nyas.13295. Please also perform some statistical analyses to support the claims, not just a figure. Page 13, "Conserved synteny within the Galliformes clade", please cite 10.1159/000078570 and 10.1007/s00412-018-0685-6 Page 13, "it is evident that especially the Z chromosome" also observed in 10.1038/s41559-019-0850-1 Page 13, "inversion of around 19 Mbp on the turkey Z" also reported in 10.1186/1471-2164-12-447 Page 14, "tail of the chicken Z chromosome lacks synteny" also reported in 10.1038/nature09172. This means figure S11 does not provide a novel finding. Page 14, "Combining long reads and genome-wide chromatin interaction data (Hi-C) enables the capture of chromosome arms in a single contig", again, is that correct, chromosome arms in a single contig? Page 18, it's known wtdgb2 assembly tends to contain errors, but it looks the authors did not use long reads for polishing, but only used short reads? Page 20, "The corrected reads from TrioCanu were mapped to the Triocanu assembly with Minimap2 v2.17-r941 (Minimap2, RRID:SCR_018550) [45], options -x map-pb", what was is used for? Page 20, "Duplicated sequences were removed." How was this done?
  
  Re-review The manuscript has been improved. After reading the revised manuscript, I have a few more concerns.
  
  Chromosome models. I suggest the chromosome naming should follow chicken's, e.g., chr6 can be chr2a, and the microchromosomes should be named according to chicken homology. I then noticed chr32 and chr35 do not have chicken homology which is very concerning. It is either due to novel. chromosomes (very unlikely), or the sequences could be an unlinked contigs. In either scenario, the chromosome models must be clarified. The authors should provide strong evidence to support the chromosome model assembly for chr32 and chr35, e.g. FISH images, Hi-C zoom-in view (Fig. S1 shows the whole genomes where the microchromosome models are not visible), synteny with chicken (note there is a new chicken assembly ASM2420605v1) or zebra finch chromosomes; otherwise, chi32 and chr35 can not be identified as a chromosome. Centromere and telomere. To support complete chromosome assembly, I suggest the authors provide information about the assembly of telomere and centromere sequences, e.g. the presence/absence of TTAGGG at chromosomal ends. Most galliformes microchromosome centromeres are known to contain a 41-bp satellite (10.1139/gen-2022-0012). The authors should investigate whether such centromere satellites are present in the assembly. Data availability. It appears the Hi-C data is not available in NCBI. The raw reads must be provided. In the abstract, there is not such term as "complete scaffold", please remove "complete". Again, I do not see the support for two chromosome models: chr32 and chr35. The chrZ inversion is highlighted in the abstract, but this is not a novel finding - the writing is thus misleading. Instead, the new genome assembly only CONFIRMS this inversion. The subtitle "Lineage specific expansion and contraction of protein-coding gene families" is unrelated to the following text. "a 1.47 Mbp inversion on chromosome 1" I am wondering if this is the centromere? According to chicken chr1 centromere position, it looks like so. In the Table 5, the Parent2 has a much large size of gained copy. Please show more details, e.g. chromosomal distribution "BLB2", is this gene associated with parent2-specific trait? Similarly, what about TRIM36, GRIA2 and MAN2B2, and LRRC41? "The inversion was supported by a normal alignment at the approximate breakpoints (Supplementary File 1: Table S7 - Figure S16) and by the HiC contact map". The writing here is unclear. Hi-c data does not show signal for inversion, instead, it only supports that the assembly is correct. Bellott et al 2020 should be Bellott et al 2017. "Centromeres, however, are too long to traverse reliably in most cases". I do not see any analyses on centromeres. PRJEB42643 does not contain Hi-C data
  
  Re-re-review A new chicken genome has been published during the revision: https://www.pnas.org/doi/10.1073/pnas.2216641120, I suggest the authors revise some parts of the manuscript: e.g. L66, L78, L83-85 L103, please make it clear only the F1 was sequenced with long-read. L117-142, those results are very interesting, but perhaps the language can be more concise. L231-236, this paragraph is not important, please either move them to supplementary material or remove them. In general, this manuscript can be much more streamlined. L310-315, this part has also been reported by Huang et al. 2023 PNAS, so this is not a novel finding. Please either streamline or remove it. L327, ref 36 is not a "recent" finding.
2. GigaScience 21 Aug 2023
  
  in GigaScience
  
  AbstractBackground The domesticated turkey (Meleagris gallopavo) is a species of significant agricultural importance and is the second largest contributor, behind broiler chickens, to world poultry meat production. The previous genome is of draft quality and partly based on the chicken (Gallus gallus) genome. A high-quality reference genome of Meleagris gallopavo is essential for turkey genomics and genetics research and the breeding industry.Results By adopting the trio-binning approach, we were able to assemble a high-quality chromosome-level F1 assembly and two parental haplotype assemblies, leveraging long-read technologies and genomewide chromatin interaction data (Hi-C). These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity. The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7). Comparative analyses reveal a large inversion of around 19 Mbp on the Z chromosome not found in other Galliformes. Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding.Conclusions Collectively, we present a new high quality chromosome level turkey genome, which will significantly contribute to turkey and avian genomics research and benefit the turkey breeding industry.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad051) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Yunyun Lv **
  
  Reviewer Comments to Author: The turkey has importance for agriculture as it is the second contributor to word poultry meat production. This study completes a chromosome-scale genome assembly with long reads sequencing and use trio-binning approach to generate a haplotype-resolved turkey genome, which give scientific significance to further genetic studies within this species. However, I feel the content within this article need improvement. Some parts were unclear and hard to follow, I list some of them as below. After substantial revisions, I will suggest the publication.
  
  In abstract: The sentence "These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity" seems weird and hard to understand directly. Please revise it and make it clear. "The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7)." Please indicate the parameters used for comparison clearly and how prove them with a higher quality. "Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding." The theoretical context of this sentence is not clear, so I suggest more information added to make it clear.
  
  Considering no statistic in the conclusion, I suggest the conclusion sentence can be revised as "we contribute a new high-quality turkey genome at chromosome-level, benefiting turkey genetics and other avian genomics research as well as turkey breeding industry."
  
  In the introduction: "Most of the chromosomes are small microchromosomes, while only a few macrochromosomes are present in the karyotype." Please clearly indicate how many microchormosomes in turkeys and chicken. "most of" is uninformative for readers. "and by current standards would be considered of draft quality". What is the current standards? Please indicate it clearly. "Ongoing efforts in producing high quality assemblies of the microchromosomes in avian genomes have been unsuccessful due to multiple causes" what the multiple causes represent for? Or the features of microchromsomes leads to the unsuccessful assembly as mentioned above? "For instance, improved annotation of (non)-coding genes benefits the functional interpretation of genome wide association studies (GWAS), and aids in identifying targets for gene editing", why are non-coding genes (I understand the non-coding genes are referred as regulatory regions, but actually, they are not real genes.) benefits …? Why protein-coding genes (structural genes) can not undertake the roles? "The genome assemblies of turkey (this paper) and chicken, however, are of considerably higher quality compared to other Galliforme species. This provides opportunities for an in-depth comparison between the two most important avian agricultural species." I cannot follow the logic of why the placement of this sentence is here. Obviously, it should be part of discussion after the comparison of turkey genome with other avian genomes. "In this study we use a relatively new technique, the trio-binning approach, to construct high quality haplotype-resolved turkey assemblies." I feel it is necessary to give an explanation of the term "trio-binning approach" as many readers do not understand what is standard for? And the long-reads sequencing technology within it also connect the former theoretical context closely.
  
  In results: Have you used other assemblers to complete the genome assembly? Such as flye, or nextdenovo, or mecat2 that may have better performance. Have you ever tried 3D-dna for chromosome-scale assembly? which may be better as my experience. The gene annotation should be assessed by BUSCOs.
  
  In discussion: "The quality of the assemblies presented in this study confirms the value of this method in not only providing a quality assembly but also in uncovering structural genomic variation." Please indicate which quality index that reflect your genomic assembly. "Thanks to these recent sequencing technologies, we are able to correct a number of wrongly oriented contigs in Turkey_5.1, a phenomenon often observed in short-read based assemblies." I feel this sentence is not formal in writing.
  
  Re-review: The author has carefully amended the work in response to my prior concerns, and the quality of the new version has greatly improved, hence it is suggested that the manuscript be accepted.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.18.504375v1
Jul 2023
www.biorxiv.org www.biorxiv.org

Mycobacterial Metabolic Model Development for Drug Target Identification

1
1. GigaScience 06 Jul 2023
  
  in GigaByte
  
  Editor’s Assessment
  
  This work has generated metabolic models for the human pathogens Mycobacterium leprae and Mycobacteroides abscessus, alongside a new computational tool that can be used to identify potential drug targets. The standardised genomic scale metabolic models have been developed using the systems biology community standards for quality control and evaluation of models. After providing more detail on reproducibility, comparative performance of the models, and reuse, these resources are now published and are available for reuse by the global scientific community via the GigaDB, Biomodels, and PatMeDB repositories.
  
  This assessment refers to version 1 of this preprint.
  
  Summary
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.31.534705v1
www.biorxiv.org www.biorxiv.org

Training Infrastructure as a Service

2
1. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Hands-on training, whether it is in Bioinformatics or other scientific domains, requires significant resources and knowledge to setup and run. Trainers must have access to infrastructure that can support the sudden spike in usage, with classes of 30 or more trainees simultaneously running resource intensive tools. For efficient classes, the jobs must run quickly, without queuing delays, lest they disrupt the timetable set out for the class. Often times this is achieved via running on a private server where there is no contention for the queue, and therefore no or minimal waiting time. However, this requires the teacher or trainer to have the technical knowledge to manage compute infrastructure, in addition to their didactic responsibilities. This presents significant burdens to potential training events, in terms of infrastructure cost, person-hours of preparation, technical knowledge, and available staff to manage such events.Findings Galaxy Europe has developed Training Infrastructure as a Service (TIaaS) which we provide to the scientific commnuity as a service built on top of the Galaxy Platform. Training event organisers request a training and Galaxy administrators can allocate private queues specifically for the training. Trainees are transparently placed in a private queue where their jobs run without delay. Trainers access the dashboard of the TIaaS Service and can remotely follow the progress of their trainees without in-person interactions.Conclusions TIaaS on Galaxy Europe provides reusable and fast infrastructure for Galaxy training. The instructor dashboard provides visibility into class progress, making in-person trainings more efficient and remote training possible. In the past 24 months, > 110 trainings with over 3000 trainees have used this infrastructure for training, across scientific domains, all enjoying the accessibility and reproducibility of Galaxy for training the next generation of bioinformaticians. TIaaS itself is an extension to Galaxy which can be deployed by any Galaxy administrator to provide similar benefits for their users. https://galaxyproject.eu/tiaasCompeting Interest Statement
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Elizabeth Ryder
  
  This technical note is an informative explanation of Training-Infrastructure-as-a-Service, which is a free service available to facilitate Galaxy training sessions. The service provides an easy way for instructors to set up infrastructure for trainings, enables learners to make progress through the training without long waiting times, and includes a dashboard through which instructors can easily monitor progress of learners. The article provides data showing the large number of events and locations that have benefited from using TIaaS. Because of the utility and general applicability of TIaaS, the article will be of interest to the readers of GigaScience.Minor suggestions:In the Development section: As a practical matter, it would be useful to know the typical timeline for approval of a training session. Also, can anyone who uses Galaxy become an instructor and request this service?In the Usage section, there is a sentence that reads, 'Class sizes have ranged considerably, from the median of 25 participants (std. dev 121) to a maximum of 1500 registrants for afully asynchronous (self-paced) course.' It's a little unusual to talk about a median and standard deviation, since medians are non-parametric measures and SDs are parametric and measured with respect to the mean. I'd suggest using the median and interquartile range instead. I think a histogram of class size distribution would be informative, similar to the event distributions in Fig. 4.Grammatical / spelling errors:I'm not sure why 'Findings' appears before 'Background' - perhaps an editing error?p. 2'a limiting factor for events with large number of participants, 'should read'with a large number of participants''by it's design'should read'by its design''which to to preference'should read'which to preference'p.4'univeristy'should read'university'p.5This sentence is hard to scan as written; I think it needs a semi-colon after 'cluster' to make sense. Galaxy Europe uses it with HTCondor, and job rules that allow spill over to the main cluster, new machines are brought up in an OpenStack cluster specifically for training events and destroyed afterwards.
2. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Hands-on training, whether it is in Bioinformatics or other scientific domains, requires significant resources and knowledge to setup and run. Trainers must have access to infrastructure that can support the sudden spike in usage, with classes of 30 or more trainees simultaneously running resource intensive tools. For efficient classes, the jobs must run quickly, without queuing delays, lest they disrupt the timetable set out for the class. Often times this is achieved via running on a private server where there is no contention for the queue, and therefore no or minimal waiting time. However, this requires the teacher or trainer to have the technical knowledge to manage compute infrastructure, in addition to their didactic responsibilities. This presents significant burdens to potential training events, in terms of infrastructure cost, person-hours of preparation, technical knowledge, and available staff to manage such events.Findings Galaxy Europe has developed Training Infrastructure as a Service (TIaaS) which we provide to the scientific commnuity as a service built on top of the Galaxy Platform. Training event organisers request a training and Galaxy administrators can allocate private queues specifically for the training. Trainees are transparently placed in a private queue where their jobs run without delay. Trainers access the dashboard of the TIaaS Service and can remotely follow the progress of their trainees without in-person interactions.Conclusions TIaaS on Galaxy Europe provides reusable and fast infrastructure for Galaxy training. The instructor dashboard provides visibility into class progress, making in-person trainings more efficient and remote training possible. In the past 24 months, > 110 trainings with over 3000 trainees have used this infrastructure for training, across scientific domains, all enjoying the accessibility and reproducibility of Galaxy for training the next generation of bioinformaticians. TIaaS itself is an extension to Galaxy which can be deployed by any Galaxy administrator to provide similar benefits for their users. https://galaxyproject.eu/tiaas
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Azza Ahmed **
  
  The paper is well-written and neatly reports on the development of Training-Infrastructure-as-a-Service (TIaaS), a free infrastructure resource originally developed by Galaxy Europe and the Gallantries project together with the Galaxy community. TIaaS is a step towards democratizing bioinformatics training, where infrastructure can be a major barrier- even in advanced and well-developed countries.I specially appreciate the value of this resource for instructors and students in low and middle income countries where infrastructure limitations may be exacerbated by the availability of well-trained system administrators able to cater specific training needs. It was indeed gratifying to see training events using TIaaS in such countries in the figure 3 map- especially that it is not clear TIaaS is deployed in such counties. The utility of the resource is self-evident: 438 training events in 48 months targeting > 19000 students. Thus, overall, I congratulate the authors for the success of their project, and the community for having such a great free resource at their disposal.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.08.23.263509v1
www.biorxiv.org www.biorxiv.org

EraSOR: Erase Sample Overlap in polygenic score analyses

6
1. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Samuel Lambert (revision 2)
  
  I commend the authors for doing these extra analyses focused on more real-world applications of the method and adding them to the paper. I think the discussion is better contextualised and my final recommendation is that these warnings/caveats are placed in the software documentation as well (https://choishingwan.gitlab.io/EraSOR/).
2. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Samuel Lambert (revision 1)
  
  The revised manuscript is much clearer and better illustrates when and how to use the EraSOR method. However, I still think important analyses reflecting more common use cases are missing:- Use of EraSOR with multi-ancestry summary statistics- Use of EraSOR corrected sumstats with other PGS-derivation methods (e.g. LDpred or PRS-CS).- Providing results of a real sensitivity analysis for sample overlap. I understand that you won't know the true overlap in UKB but the difference in the adjusted and unadjusted SumStats performance in the presence of known overlap would be illustrative. Adding these analyses to the real UKB section would greatly benefit the manuscript and utility of the method. Apart from that I note that related to line 19, the impact of sample overlap was also outlined as a pitfall by Wray et al Nat Genet (2013, PMID:23774735).
3. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Samuel Lambert
  
  In this paper Choi et al. describe EraSOR, a new tool to remove the effects sample overlap between a set of summary statistics and a target dataset. EraSOR works by running a GWAS in the target dataset and then using LD-score regression techniques to estimate the heritability, genetic correlations of the phenotypes, and number of overlapping samples to decorrelate the effect sizes. The method is thoroughly described, and the simulation scenarios are relevant and well-motivated. However, the manuscript could better describe the inputs and characteristics of the decorrelated summary statistics, focusing more on the degree of bias in effect sizes rather than p-value inflation, and the practicalities of how the tool may be used.Specific Comments: The results of Figure 1/Supp Figure 1 are highly motivating, but the p-value of the association doesn't seem like the perfect measure of inflation. Plots of the effect size of the PRS compared to its expected effect (0, based on heritability) would better illustrate this. The paper proposes a method to remove the effects of sample overlap on summary statistics, but instead mostly focuses on how overlap biases the results of PRS prediction. Additional exploration of the decorrelated summary statistics themselves is needed to illustrate the validity of the method. Specifically, how different are the EraSOR adjusted summary statistics from the true summary statistics measured without sample overlap (e.g. distribution of effect sizes differences); what types of variants does EraSOR fail for or overcorrect (e.g. MAF differences between the summary statistics and the target cohort)? Are the results used as-is in other analyses, or do they have to be filtered in some way? The PRS analyses in the paper all use PRSice to perform clumping+thresholding, selecting the best p-value and LD thresholds on the target datasets. This could be considered overfitting to the target data, and other derivation methods that do not require a sample to optimize hyperparameters (e.g. PRS-cs, LDpred-auto) could be used. It would be good to provide some additional analyses showing that EraSOR outputs also work with other methods of PRS derivation, and that the results are not sensitive to overfitting through hyperparameter optimization. The PRS analysis of the real phenotype data in UKB should be expanded. Currently the analysis uses summary statistics derived in UKB with varying levels of overlap; however, this does not match the real scenario that EraSOR will likely be used in (applying EraSOR to an externally-sourced GWAS and applied to UK Biobank). The authors should perform a descriptive analysis to show that EraSOR is useful in this real-world scenario by downloading summary statistics from the GWAS Catalog (with and without inclusion of UK Biobank), applying EraSOR, and quantifying the difference in accuracy (r2) and effect size. On a related note: does the ancestry of the summary statistics have to perfectly match the target cohort? How well does EraSOR work with multiancestry summary statistics where the LD-panel might be mismatched? The point about insufficient adjustment the authors raise on lines 336-42 is quite important. Proper signposting about the limits of the decorrelation is needed in the software description and the discussion. From this passage that the authors suggest that known sample overlap should be avoided and EraSOR should only be used as a sensitivity analysis to ensure that overlap does not exist? It would be useful to get the authors perspective on whether the evaluation of a PRS in a cohort derived using EraSOR-adjusted summary statistics can be seen as truly external to the source GWAS. The paper should be accompanied by a more detailed user guide and some test data for the EraSOR tool. Are there any diagnostic plots that are produced that could be used to inspect the data quality?
4. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: ** Jack Pattee **(revision 1)
  
  Thank you for your detailed responses; I have no further comments.
5. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see Description), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  ** Jack Pattee**
  
  Overall, I think that this manuscript is strong and describes a well-formulated method to address a relevant problem. There are a few outstanding questions about the performance of the EraSOR method from my perspective, which I'll detail as follows.My understanding of reference [16] indicates that equation (3) of this manuscript only holds for null SNPs, i.e. if SNP g is not associated with the outcome Y. If this is the case, then this should be discussed in the manuscript. I wonder if this can partially explain the 'under-estimation' behavior we see in the application to real data in Supplementary Figure 3. In particular, I am referencing the behavior where the EraSOR correction will under-estimate the predictive accuracy of the PRS in the target data, i.e. where delta-R^2 is negative. This behavior is not seen in the simulation and warrants further investigation and discussion. While the bias appears small, for some cases delta-R^2 approaches -.025, which corresponds to an under-estimation of Pearson's r by roughly .15; this is substantial. Could it be the case that, for highly polygenic traits such as height and BMI, the null-SNP assumption is unreliable and the performance of EraSOR is degraded? Does a fundamental assumption of sparse genetic association underlie EraSOR?I recommend that the real data application play a larger role in the manuscript narrative and be moved out of the supplementary. The simulations are appreciated and helpful, but there is nuance in the analysis of real data that cannot be replicated in simulation.I believe the reference to "Supplementary Figure 2" on line 346 should actually be "Supplementary Figure 3". I believe that the axis labels in Supp Figure 3 are flipped.Lines 82 and 83 reference genetic stratification and subpopulations; I think the relevance of these concepts should be introduced more clearly and they should be defined in this context. EraSOR concerns the overestimation of predictive accuracy and association incurred by sample overlap between the base and target GWASs; to this reader, it's not clear what this central issue has to do with population stratification. I realize that the derivation of the LD score method is motivated heavily by correcting for stratification; however, these concepts should be introduced more clearly in this manuscript.Line 88: consider defining LD score l_j.Lines 94-96: consider outlining the mathematical consequence of the assumption that "the two outcomes and cohorts are identical." It's the case that N_1 = N_2 = N_c = N, correct?Line 109 / equation (11): My understanding is that the relevant quantity of this derivation is N_c / sqrt(N_1 N_2), which allows us to define the correct matrix C in expression (4). If this is the case, perhaps the quantity of interest should be moved to the LHS of the equation in the final line of the expression, for clarity.As discussed in the manuscript, the estimated heritability is in the denominator of the expression for N_c / sqrt(N_1 N_2). The authors correctly discuss that the method should not be applied when there is doubt as to whether the heritability is different from zero. I would take this a step further; in cases where the heritability is zero, we cannot meaningfully apply the EraSOR correction, and thus I am not sure of the utility of the 'type I error' simulations in the manuscript. Perhaps an explicit test for h^2 > 0 should be worked into the EraSOR workflow?Line 148 / expression (12): If beta has a normal distribution here, it is the case that all SNPs in the simulation are associated with the outcome Y. This is a somewhat unusual choice for the distribution of SNP effects in a simulation; other applications such as LDPred (Vilhjalmsson et al, AJHG 2015) and LassoSum (TSH Mak et al, Genetic Epi 2017) use a point-normal distribution for simulated SNP effects, which effectively simulates the sparsity frequently observed in nature. Is there a reference or justification for the non-sparse simulation structure here?Line 215: there may be a typo in the expression for the variance of the residual term. Is it the case that the variance of the residual depends on the variance of a covariance term? If so, I am confused as to the derivation.Line 241: 'triat' should be 'trait'.The simulation results in this paper are based on clumping and thresholding for PRS, which does not estimate joint SNP effects i.e. account for LD. Methods such as LDPred and LassoSum do so. Is there any reason to believe the results would be different for a method such as LassoSum?I am confused by the very low Fst between the simulated Finnish and Yoruban samples in simulation. As detailed on line 385: the reported Fst is > .1, but the simulated Fst is essentially zero. This seems likely to be an undesirable simulation artefact, and potentially invalidates the simulation study (or, at least, doesn't provide evidence that EraSOR functions correctly when Fst is large, which was the ostensible motivation for this simulation). Is there no way to effectively simulate populations with a larger Fst?
6. GigaScience 04 Jul 2023
  
  in GigaScience
  
  Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Christopher C. Chang Reviewer Comments to Author: This paper addresses a significant need that has arisen in the interaction between privacy rules and ever-larger genomic datasets, and I find the results to be very promising and clearly worth publishing. I just have a few comments on some methodological details:line 130: Have you compared the effectiveness of this algorithm with plink2 --king-cutoff?lines 145-155: If I understand this correctly, these simulated quantitative traits are still normally distributed, they just aren't standardized to mean 0 variance 1. If the intent is to "simulate phenotypes that [do] not follow the standard normal distribution", I'd expect it to be more valuable to look at e.g. the log-normal case, where an alert user might transform the phenotype to normal, but some users may fail to do so. A mixture distribution may also be worth looking at.lines 238-239: Have you considered using the "cc-residualize" option of plink2 -glm, which removes most of the computational cost of including PCs in your binary trait analysis?lines 383-387: This is interesting; there is some room for follow-up investigation here. Thanks for posting all the scripts needed for another researcher to easily reproduce this Fst=0.00639 value; this could help facilitate development of a better genotype-simulation tool.Also, some minor copyedits:line 84: "subpopulation" -> "subpopulations"line 342: "overlaps" -> "overlap"line 363: "ErasOR" -> "EraSOR"line 376: "different level of environmental stratifications" -> "different levels of environmental stratification"line 384: "population" -> "populations"line 402: "capture" -> "captured"
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.12.10.472164v1
www.biorxiv.org www.biorxiv.org

The Crown Pearl V2: an improved genome assembly of the European freshwater pearl mussel Margaritifera margaritifera (Linnaeus, 1758)

1
1. GigaScience 03 Jul 2023
  
  in GigaByte
  
  Editor’s Assessment
  
  Like other mollusc species, the freshwater pearl mussel (Margaritifera margaritifera) has a challenging genome to assemble owing to the large size of their genomes, heterozygosity, and repetitive sequence. The first published M. margaritifera genome was highly fragmented, but here an improved reference genome assembly was generated using PacBio CLR long reads to reduce fragmentation levels, missing and truncated genes, and chimerically assembled regions. The number of gene models predicted is a bit higher compared than other molluscan genomes, but after clarification and double checking these seem in line with some Mollusca and Bivalvia with similar and higher numbers of gene predictions. This new genome represents a new resource to start exploring the many biological, ecological, and evolutionary features of this threatened and commercially important group of organisms.
  
  This assessment refers to version 1 of this preprint.
  
  Summary
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.11.528107v1
www.biorxiv.org www.biorxiv.org

Genome assembly of the hybrid grapevine Vitis ‘Chambourcin’

2
1. GigaScience 03 Jul 2023
  
  in GigaByte
  
  Editor’s Assessment
  
  Hybrid genomes are tricky to assemble, and few genomic resources are available for hybrid grapevines such as ‘Chambourcin’, a French-American interspecific hybrid grape grown in the eastern and midwestern United States. Here is an attempt to assemble Chambourcin’ using a combination of PacBio HiFi long-reads, Bionano optical maps, and Illumina short-read sequencing technologies. Producing an assembly with 26 scaffolds, an N50 length 23.3 Mb and an estimated BUSCO completeness of 97.9% that can be used for genome comparisons, functional genomic analyses, and genome-assisted breeding research. Error correction and pilon polishing was a challenge with this hybrid assembly, but after trying a few different approaches in the review process have improved it, and as they have documented what they did and are clear about the final metrics, users can assess the quality themselves.
  
  This assessment refers to version 2 of this preprint.
  
  Summary
2. GigaScience 03 Jul 2023
  
  in GigaByte
  
  Background ‘Chambourcin’ is a French-American interspecific hybrid grape variety grown in the eastern and midwestern United States and used for making wine. Currently, there are few genomic resources available for hybrid grapevines like ‘Chambourcin’.Results We assembled the genome of ‘Chambourcin’ using PacBio HiFi long-read sequencing and Bionano optical map sequencing. We produced an assembly for ‘Chambourcin’ with 27 scaffolds with an N50 length of 23.3 Mb and an estimated BUSCO completeness of 98.2%. 33,265 gene models were predicted, of which 81% (26,886) were functionally annotated using Gene Ontology and KEGG pathway analysis. We identified 16,501 common orthologs between ‘Chambourcin’ gene models, V. vinifera ‘PN40024’ 12X.v2, VCOST.v3, V. riparia ‘Manitoba 37’ and V. riparia Gloire. A total of 1,589 plant transcription factors representing 58 different gene families were identified in ‘Chambourcin’. Finally, we identified 310,963 simple sequence repeats (SSRs), repeating units of 16 base pairs in length in the ‘Chambourcin’ genome assembly.Conclusions We present the genome assembly, genome annotation, protein sequences and coding sequences reported for ‘Chambourcin’. The ‘Chambourcin’ genome assembly provides a valuable resource for genome comparisons, functional genomic analysis, and genome-assisted breeding research.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.84) and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Lingfei Shangguan ** Reviewers Comments: Grapevine is one of the most important fruit crops in the world, and ‘Chambourcin’ is a hybrid wine grape variety in the world, which represented the cross species between North American and European Vitis species. The authors have sequenced the genome sequence of ‘Chambourcin’, and obtained the repeat sequences and gene annotation information. However, the sequence depth was too low for the grape genome, especially the high heterozygosity. They also not applied the illumine sequencing for sequence correction.
  
  Re-review: Since the authors have made some correction and improvement, the genome quality was still low, and the manuscript has not improvement significantly. Authors should provide the haplotype sequences, and describe the genome assembly and correction steps more clearly. Moreover, the innovation of the article is insufficient. I suggest reject.
  
  **Reviewer 2. Pablo Carbonell-Bejerano **
  
  Are all data available and do they match the descriptions in the paper? No. Access to the raw data for the RNA-seq dataset that was used for gene predictions is not indicated
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. Any description of the RNA-seq dataset and its origin or features is fully missing. I could not find other data that would be required according to guidelines in http://gigadb.org/site/guide: - Full (not summary) BUSCO results output files (text) - readme.txt including all file names with a brief description of each - sample metadata that complies with the Genomic Standards Consortium.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  Yes. Sequencing and bioinformatic methods followed are generally sound.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. 1. Availability for the scripts used in bioinformatic analyses and data plotting is generally missing.
  
  L124. Authors describe that minimap2 was used to obtain the dotplot. However, minimap2 alone does not produce dotplots.
  
  L131. It is unclear how ‘PN40024’ 12X.v2, VCost.v3 protein annotations were used as input of BRAKER2. Do authors mean protein sequences instead? Where were these protein data retrieved from? How are proteins aligned to the assembly? Was BRAKER run from masked or unmasked assembly?
  
  Is there sufficient data validation and statistical analyses of data quality? No. 1. Validation of the original material for its true-to-typeness as 'Chambourcin' cultivar genotype is not mentioned, neither the number of different plants used for DNA extraction. While post-assembly validation of the Chambourcin genome assembly genotype from the mapped Chambourcin rhAmpSeq markers may be possible, such genotype validation is not mentioned either in the text.
  
  In general, the quality and the genome variation represented in the Chambourcin genome assembly produced here could have been further tested. For instance, from 2% BUSCO duplication and 501.5 Mb of primary assembly size as compared to the 481.5 Mb haploid genome size that can be inferred from the k-mer analysis presented by the authors indicates, it seems that further duplication purging of the primary assembly is likely needed. This issue could be addressed by looking for assembly regions with reduced alignment depth when all HiFi reads are mapped to the primary assembly. Duplicated regions to be purged could also be supported by co-linear assembly segments sharing BUSCO duplicated genes. For assembly reliability assessment, 10X, rhAmpSeq, or Illumina WGS data that is available for Chambourcin could also be used to validate genome variants represented in this Chambourcin assembly when comparing the inter-haplotype variants detected between primary and haplotig assemblies or the haplotypes with genome assemblies from other genotypes.
  
  Is the validation suitable for this type of data? Yes. The validation is suitable, although it might not suffice in all cases.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No. As described before, there is missing information at several instances, like for the origin of the RNA-seq.
  
  Additional Comments: 1. L171. Is it correct that total length of Bionano maps was as small as 962,964 bp? Or do authors mean kb instead of bp in that sentence?
  
  The mapping of Chambourcin rhAmpSeq markers could have been further exploited to phase contig haplotypes before purging haplotypes and assembly scaffolding?
  
  For the Conclusion in L254, it might be arguable whether the presented Chambourcin genome assembly is the first genome assembly of a complex interspecific hybrid or not. For instance 'Shine Muscat' might also be considered a complex inter-specific hybrid grape cultivar and its genome assembly was published: https://academic.oup.com/dnaresearch/article/29/6/dsac040/6808674 It might even be arguable whether the one presented in this publication is the first Chambourcin genome assembly as there is a 10X Genomics-based assembly available for Chambourcin: https://www.nature.com/articles/s41467-019-14280-1
  
  Re-review: Efforts to improve the accuracy of the MS and the availability of data are clear in the revised version. Authors have included descriptions of M&M procedures and information about the origin of several datasets that were missing. They also included files with commands and original results to the FTP server. In addition, they did further de-duplication of the assembly, added Illumina sequencing for assembly polishing, and included further QC stats and comparisons to another recently published hybrid grapevine genome assembly.
  
  Most revision actions were successful. However, it is not recommended to polish HiFi assemblies with Illumina reads as in most cases it harms the consensus quality more than it improves it, which is particularly true for repetitive and highly heterozygous genomes like the one of Chambourcin grapevine cultivar. In fact, the BUSCO Completeness of 97.9% after Pilon short-read polishing compared to 98.2% in the former version indicates that polishing with Illumina short-reads is indeed harming in this revised version. I indeed agree with authors that 28x depth of PacBio HiFi reads should suffice to produce a quality genome assembly without using more depth or another sequencing technologies as they indicate in their response. I would recommend to remove the Pilon polishing from the final assembly version, which is only recommended in error-prone PacBio CLR or Nanopore assemblies. Instead, authors could use the Illumina reads for k-mer analysis of assembly consensus quality and completeness.
  
  **Editorial Board Member adjudication: **
  
  Comment 1. How many times did you do the polishing with Pilon? This is not clear in the documents provided. It could be 1 round or many. Many would be a concern. When we run error correction on genomes, we monitor BUSCO and when it drops, roll back one iteration. Comment 2. How many sites were corrected in the polishing of the primary and haplotig assembly? Comment 3. Can you run KAT (KAT: A K-Mer Analysis Toolkit to Quality Control NGS Datasets and Genome Assemblies.” Bioinformatics 33 (4): 574–76) to check the diploid, primary and haplotig assemblies? Comment 4. Can you align the mRNAseq and whole genome shotgun reads to diploid, primary and haplotig assemblies and report the percent mapping including the properly paired?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.18.524616v2
Jun 2023
www.biorxiv.org www.biorxiv.org

FriendlyClearMap: An optimized toolkit for mouse brain mapping and analysis

2
1. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Tissue clearing is currently revolutionizing neuroanatomy by enabling organ-level imaging with cellular resolution. However, currently available tools for data analysis require a significant time investment for training and adaptation to each laboratory’s use case, which limits productivity. Here, we present FriendlyClearMap, an integrated toolset that makes ClearMap1 and ClearMap2’s CellMap pipeline easier to use, extends its functions, and provides Docker Images from which it can be run with minimal time investment. We also provide detailed tutorials for each step of the pipeline.For more precise alignment, we add a landmark-based atlas registration to ClearMap’s functions as well as include young mouse reference atlases for developmental studies. We provide alternative cell segmentation method besides ClearMap’s threshold-based approach: Ilastik’s Pixel Classification, importing segmentations from commercial image analysis packages and even manual annotations. Finally, we integrate BrainRender, a recently released visualization tool for advanced 3D visualization of the annotated cells.As a proof-of-principle, we use FriendlyClearMap to quantify the distribution of the three main GABAergic interneuron subclasses (Parvalbumin+, Somatostatin+, and VIP+) in the mouse fore- and midbrain. For PV+ neurons, we provide an additional dataset with adolescent vs. adult PV+ neuron density, showcasing the use for developmental studies. When combined with the analysis pipeline outlined above, our toolkit improves on the state-of-the-art packages by extending their function and making them easier to deploy at scale.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad035 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer Yimin Wang **
  
  This work (FriendlyClearMap) attempts to combine several tools such as ClearMap 1/2, BrainRender, etc., and integrate certain functions into a Docker image for the ease of use. The authors then demonstrated the use of FriendlyClearMap by analysing PV+, SST+ and VIP+ neurons. Some details comments are as below:
  
  1/ P4, second paragraph, line 3, "vs." -> "versus".
  
  2/ P9, third paragraph, line 8, conflict between "lastly" and "finally"
  
  3/ P9, third paragraph, line 8, "our tool allows …".
  
  4/ This work can be regarded as a reengineering effort based on several previous toolkits in order to facilitate the workflow of registration, segmentation, analysis, and visualization. Essentially, no new technology involved is involved in this work and no new application is enabled by FriendlyClearMap. Therefore, in order to emphasize the unique contribution of this work, the author could elaborate how this tool makes biologists' work easier.
  
  5/ The results for Figure 2g are somewhat trivial. The authors might consider replace it with some more impressive analysis.
  
  6/ The majority of the results are related to cell segmentation and counting. Quantitative plots/tables could be provided for more information. In addition, the accuracy of the results could also be discussed.
  
  7/ Last but not least, as there is no substantial novelty in the software, the authors actually could consider change the focus of the manuscript from a tool paper to a resource/results paper, emphasizing new biological findings which is obtained by using FriendlyClearMap.
2. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Tissue clearing is currently revolutionizing neuroanatomy by enabling organ-level imaging with cellular resolution. However, currently available tools for data analysis require a significant time investment for training and adaptation to each laboratory’s use case, which limits productivity. Here, we present FriendlyClearMap, an integrated toolset that makes ClearMap1 and ClearMap2’s CellMap pipeline easier to use, extends its functions, and provides Docker Images from which it can be run with minimal time investment. We also provide detailed tutorials for each step of the pipeline.For more precise alignment, we add a landmark-based atlas registration to ClearMap’s functions as well as include young mouse reference atlases for developmental studies. We provide alternative cell segmentation method besides ClearMap’s threshold-based approach: Ilastik’s Pixel Classification, importing segmentations from commercial image analysis packages and even manual annotations. Finally, we integrate BrainRender, a recently released visualization tool for advanced 3D visualization of the annotated cells.As a proof-of-principle, we use FriendlyClearMap to quantify the distribution of the three main GABAergic interneuron subclasses (Parvalbumin+, Somatostatin+, and VIP+) in the mouse fore- and midbrain. For PV+ neurons, we provide an additional dataset with adolescent vs. adult PV+ neuron density, showcasing the use for developmental studies. When combined with the analysis pipeline outlined above, our toolkit improves on the state-of-the-art packages by extending their function and making them easier to deploy at scale.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad035 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Chris Armit
  
  This Technical Note paper describes "FriendlyClearMap: An optimized toolkit for mouse brain mapping and analysis".
  
  Whereas the core concept of a data analysis tool to assist in spatial mapping of cleared mouse tissues is perfectly reasonable, there are multiple issues with the documentation that renders this toolkit very difficult to use. I detail below some of the issues I have encountered.
  
  GitHub repositoryThe installation instructions are missing from the following GitHub repository: https://github.com/MoritzNegwer/FriendlyClearMap-scriptsThe closest reference I could find to installation instructions is the following: "Please see the Appendices 1-3 of our <X_upcoming> publication for detailed instructions on how to use the pipelines. <X_protocols.io goes here>"Step-bystep installation instructions should be included in the GitHub repository. In addition, the authors should add the protocols.io links to their GitHub repository.
  
  Protocols.ioThe installation instructions are missing from the following protocols.io links:Run Clearmap 1 docker dx.doi.org/10.17504/protocols.io.eq2lynnkrvx9/v1Run Clearmap 2 docker dx.doi.org/10.17504/protocols.io.yxmvmn9pbg3p/v1Both of these protocols include the following instruction:* "Then, download the docker container from our repository: XXX docker container goes here"In the documentation, the authors need to unambiguously refer to the specific Docker container that a user needs to install for each software tool.
  
  Test Data I could not find the test data in the form of image stacks that would be needed to test the FriendlyClearMap protocols. Figure 1 refers to 16-bit TIFF image stacks, and I presume these to be the input data that is needed for the image analysis pipelines described in the manuscript. The authors should provide details of the test imaging dataset, including links if necessary to where the image stacks data can be downloaded, in the 'Data Availability' section of the manuscript.
  
  Platform / Operating SystemsIn the 'Data Availability' section of the manuscript, the authors specify that the Operating Systems are "platform-independent". However, the protocols.io documents lists a set of requirements for Windows and LINUX, but not for MacOS. The authors should provide installation instructions and system requirements for MacOS.I reject this manuscript on the grounds that, due to lack of appropriate documentation and installation instructions, the software tool is too difficult to use and therefore has extremely low reuse potential.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.16.528882v1
www.biorxiv.org www.biorxiv.org

A workflow reproducibility scale for automatic validation of biological interpretation results

2
1. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results are the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.Results We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.Conclusions Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad031 ) , which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer Stian Soiland-Reyes ** Hi, I am Stian Soiland-Reyes https://orcid.org/0000-0001-9842-9718 and have pledged the Open Peer Review Oath https://doi.org/10.12688/f1000research.5686.2: *
  
  Principle 1: I will sign my name to my review Principle 2: I will review with integrity Principle 3: I will treat the review as a discourse with you; in particular, I will provide constructive criticism Principle 4: I will be an ambassador for the practice of open science. This review is licensed under a Creative Commons Attribution 4.0 International License
  
  . --- This article presents a method for comparing reproducibility of computational workflow runs captured as RO-Crates, by calculating a set of genomics metrics ("features") and adding these to the crate's metadata. Overall I find this a valuable contribution and worthy of publication with GigaScience, primarily as a way for users of workflow systems CWL, Nextflow, Cromwell or Snakemake to ensure reproducibility, but also for workflow engine developers who may want to build on this methodology to improve their provenance support. In general the method proposed is sound, however it does have some limitations and inherent assumptions that are not highlighted sufficiently in the current manuscript, particularly concerning the selection of features and the reproducibility of the metrics calculation itself. I have detailed this with some points below that I would like the authors to clarify in a minor revision.
  
  --- Note - the below questions from GigaScience Reviewer Guidelines mainly relate to data, but I also here interpret them for the software described.
  
  Q1: Is the rationale for collecting and analyzing the data well defined? The author's workflow executions https://doi.org/10.5281/zenodo.7098337 are based on three 3rdparty bioinformatics workflows. Although they are not particularly "large-scale", they are representative best-practice pipelines in this field (data sizes from 200 MB to 6 GB) and also fairly representative for scalable workflow systems (Nextflow, CWL and WDL) used by bioinformaticians.
  
  Q2: Is it clear how data was collected and curated? It is not explicit in the text why these particular workflows were selected, beyond being realistic pipelines used in research. I would suggest something like "these workflows have been selected as fairly representative and mature current best-practice for sequencing pipelines, implemented in different but typical workflow systems, and have similar set of genomics features that we can assess for provenance comparison." The workflows have each been cited, but I would appreciate some consistency so that each workflow is cited both by its closest journal article and as their original download sources (e.g. GitHub).
  
  Q3: Is it clear - and was a statement provided - on how data and analyses tools used in the study can be accessed? Yes, full availability statements have been provided both for data and software, archived on Zenodo for longevity.
  
  Q4: Are accession numbers given or links provided for data that, as a standard, should be submitted to a community approved public repository? Yes, the tools have been added to https://bio.tools/ -- I don't think it's necessary to further register the data outputs with accession numbers. RRIDs for tools can be considered at a later stage, perhaps only for Sapporo.
  
  Q5: Is the data and software available in the public domain under a Creative Commons license? Yes, the software and dataset is open source under Apache License, version 2.0. The dataset https://doi.org/10.5281/zenodo.7098337 embeds existing workflows and data, however this is OK as included resources such as the rnaseq Nextflow workflow have compatible licenses (MIT) or are also Apache-licensed. The manuscript has software citations for two of the workflows, but this is missing for the CWL workflow, which is only cited by manuscript (33) (also missing DOI). It is unclear if any of the workflows are registered in https://workflowhub.eu/ but that should primarily be done by their upstream authors. The RO-Crates in https://doi.org/10.5281/zenodo.7098337 don't include any licensing and attribution for the embedded workflows, and its metadata file is misleadingly declaring the crate license as CC0 public domain. While CC0 is appropriate for examples and metadata file itself, the embedded MIT/Apache workflows from third parties can't legally be relicensed in this way and should have their original licenses declared. See https://www.researchobject.org/ro-crate/1.1/contextualentities.html#licensing-access-control-and-copyright I understand these RO-Crates are generated automatically by Sapporo, which does not directly understand licensing, and for documenting the test runs with Sapporo, I think these should not be modified post-execution. Pending further license support by Sapporo, perhaps a manual outer RO-Crate that aggregate these (e.g. adding a direct top-level ro-crate-metadata.json to the Zenodo entry) can provide more correct metadata as well as workflow citations. The authors could add to Discussion some consideration on (lack of) propagation of such metadata for auto-generated crates as part of workflow run provenance. For instance, if a workflow run was initiated from a Workflow Crate https://w3id.org/workflowhub/workflow-ro-crate/ at WorkflowHub, its license, attributions and descriptions could be carried forward to the final Workflow Run Crate provenance together with the Sapporo-calculated features.
  
  Q6: Are the data sound and well controlled? Yes, the data is sound. The testing on Mac gives null-results, but the authors explain the workflows failed to execute there due to archicectural differences, which is flagged as a valid concern for reproducibility. It may be worth further investigating if this is due to misconfiguration on that particular test machine in which case these columns should be removed.
  
  Q7: Is the interpretation (Analysis and Discussion) well balanced and supported by the data? The authors' discussion have some implicit assumptions that should be made more clear, together with implications: The Tonkaz tool assumes the workflow execution has already extracted the features and added them to the RO-Crate This assumes the right features have been correctly extracted by each execution Feature extraction also depend on bioinformatics tools that are subject to change/updates Newer versions of Sapporo-service, and in particular any non-Sapporo executors also making Workflow run Crates, may have a different feature selection Being able to fairly compare two workflow runs therefore depends on careful control of the Sapporo executor versions so that they have consistent feature selection This means the reproducibility metrics proposed has a potential reproducibility challenge itself This is not to say that the approach is bad, as the feature extraction is using predictable measures such as counting sequences, rather than heuristics. This means Future Work should point out the need for guidelines on what kind of features should be selected, to ensure they are consistent and reproducible. The set of features also depend on the type of data and class of analysis. As a minimum, the RO-Crate should therefore include provenance of that feature extraction, noting the Sapporo version, and ideally the version of the tools used for that. The authors may want to consider if feature extraction should be a separate workflow (e.g. in CWL), that itself can be subject to the same reproducibility preservation measures, and therefore also can be performed post-execution as part of Tonkaz' comparison or as a curation activity when storing Workflow Run Crates.
  
  Q8: Are the methods appropriate, well described, and include sufficient details and supporting information to allow others to evaluate and replicate the work? Yes, it was very easy to replicate the Tonkaz analysis of the workflow run crate that is already provided, as it is provided also as a Docker container. The Docker container is provided as part of GitHub releases, and so is not at risk of Docker Hub's automatic deletion. I have not tried installing my own Sapporo service to re-execute the workflow, but detailed installation and run details are provided in the README of both Tonkaz https://github.com/sapporowes/tonkaz#readme and sapporo-service https://github.com/sapporowes/sapporo/blob/main/docs/GettingStarted.md
  
  Q9: What are the strengths and weaknesses of the methods? The method provided is strong compared to naive checksum-based comparison of workflow outputs, which has been pointed out as a challenge by previous work. The advantage of the feature extraction is that the statistics can be compared directly and any disreprancies can be displayed to the user at a digestible high-level. The disadvantage is that this depends wholy on the selection of features, which must be done carefully to cover the purpose of the particular workflow and its type of data. For instance, a workflow that generates diagrams of sequence alignments could not be sufficiently tested in the suggested approach, as analyzing the diagram for correctness would require tools that may not even exist. Perhaps feature extraction should be a part of the workflow itself, so it can self-determine what is important for its analysis? The current approach also is quite sensitive to output data filenames, so changes in filename would mean features are not compared, even where such files are equivalent. This should be made more explicit in the manuscript, for instance workflows should ensure they don't include timestamps or random identifiers in their filenames. Further work could have a deeper understanding of the workflow structure to compare outputs based on their corresponding FormalParameter in the RO-Crate.
  
  Q10: Have the authors followed best-practices in reporting standards? Yes, the details provided are at a sufficient detail level, and the authors have re-used the RO-Crate data packaging. The RO-Crates created by Sapporo-service adds several terms for the metrics, which are declared on the @context according to RO-Crate specs https://www.researchobject.org/rocrate/1.1/appendix/jsonld.html#extending-ro-crate However the terms point to GitHub "raw" pages, which are not particularly stable, and may change depending on sapporo versions and GitHub's repository behaviour. I recommend changing the ad-hoc terms to PIDs such as a namespace under https://w3id.org/ or https://purl.org/ so that these terms can be stable semantic artefacts, e.g. submitting them to https://github.com/ResearchObject/ro-terms to register https://w3id.org/ro/terms/sapporo#WorkflowAttachment that can be used instead of https://raw.githubusercontent.com/sapporo-wes/sapporo-service/main/sapporo/roterms.csv#WorkflowAttachment or alternatively https://w3id.org/sapporo#WorkflowAttachment could be set up to redirect to the ro-terms.csv on GitHub. (discussed with the authors at ELIXIR Biohackathon) In doing so you should separate into two namespaces, the general Sapporo terms like "sha512", and the particular genomics feature sets including "totalReads" (e.g. https://w3id.org/datafeatures/genomics#WorkflowAttachment) as the second are a) Not sapporo-specific b) domainspecific. RO-Crate is developing Workflow Run profiles https://www.researchobject.org/workflow-runcrate/profiles/, although these have not been released at time of my review they are now stable, so the authors may want to check https://www.researchobject.org/workflow-runcrate/profiles/workflow_run_crate to ensure "FormalParameter" are declared correctly in the generated RO-Crate as separate entities, linked from the "File" using "exampleOfWork".
  
  Q11: Can the writing, organization, tables and figures be improved? The language and readability of this article is generally very good. Light copy-editing may improve some of the sentences, e.g. reducing the use of "Thus" phrases.
  
  Q12: When revisions are requested. See suggestions from above for minor revisions: Make explicit why these 3 workflows where selected (see Q2) Make pipeline software citations consistent in manuscript (see Q2, Q5) Avoid declaring CC0 within generated RO-Crate -- move this to only apply to the ro-cratemetadata.json Add an outer RO-Crate metadata file to Zenodo deposit to carry the correct licenses and pipeline licenses for each of rnaseq_1st.zip, trimming.zip etc. Improve discussion to better reflect limitations of the features and its own reproducibility issues (see Q7, Q9) Consider improvements to the RO-Crate context (see Q10) - this may just be noted as Future Work in the manuscript rather than regenerating the crates In addition: p2: Add citation for claim on file checksums different depending on software versions etc., for instance https://doi.org/10.1145/3186266 p3. "We converted Sapporo's provenance into RO-Crate" -- re-cite (20) as this is the paragraph explaining what it is. p10. Citations 7, 8 are missing authors p10. Citation 15 is now published, replace with https://doi.org/10.1145/3486897 p0. Citations 28, 33 is missing DOI
  
  Q13: Are there any ethical or competing interests issues you would like to raise? No, the third-party pipelines selected for reproducibility testing are already published and are here represented fairly, and only used as executable methods (as intended by their original authors), which I would say do not need ethical approval.
2. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results are the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.Results We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.Conclusions Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad031 ) , which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Stephen R Piccolo:
  
  This manuscript describes a methodology for automating evaluation of the reproducibility of datascience workflows for genomics analyses. The authors explain that reproducibility should be evaluated on a scale rather than on a binary basis. They explain concepts related to these issues and apply their methodology to real-world data. The manuscript was well written and addresses an important issue. I believe this manuscript provides new insights. I have a few minor concerns that I would appreciate being addressed:
  
  The manuscript indicates that it's not feasible to compare images automatically. However, this is actually pretty easy. For example, using the Pillow package in Python, you can calculate a percentage similarity between two image files. I'm not suggesting that the authors should do this in their study. But the text should not preclude this as a possibility.
  
  The authors describe scenarios where the outputs might be different but these differences would be immaterial to the overall conclusions. They also describe a few scenarios where the outputs differ for biological features but that the differences are relatively small and could be considered to be acceptable. Examples include when BAM files are sorted differently. I think it would be helpful to add a bit more discussion of scenarios where differences in biological features could occur and what would cause those differences.
  
  Although a person checking the outputs can change the numeric threshold, it would be difficult to know what that threshold should be. Perhaps the authors could describe additional situation(s) where having relatively large differences would be acceptable and other situation(s) where they would not. For example, you could have a single difference in the biological feature outputs and perhaps that would make a huge difference in the interpretation in some cases. Additional discussion would be helpful.
  
  This paper focuses on automating the verification process. I think the big picture could be explained more. Who might perform this verification process in a scientific context? In what context would they do it? - Please add brief discussion about generalizing this methodology beyond Tonkaz.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.10.11.511695v2
www.biorxiv.org www.biorxiv.org

Strategies and Techniques for Quality Control and Semantic Enrichment with Multimodal Data: A Case Study in Colorectal Cancer with eHDPrep

2
1. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Janna Hastings
  
  The manuscript describes a toolkit for the automated semantic enrichment and quality control of electronic health data using ontologies. This is a much needed utility that will add value to electronic data sharing and re-use for many different purposes including the development of machine learning for medical applications and personalised medicine. Overall the manuscript is well written and the functionality offered by the toolkit is well thought out and motivated. The internal consistency checks and the use of ontology-based information content to semantically aggregate variables into more informative meta-variables are particularly welcome functions.
  
  However, I recommend that the description of the tool functionality be clarified in some points, and the evaluation could be strengthened.page 6-7, internal consistency:
  
  How should the user specify semantic dependencies between variable pairs? Would it not be helpful to use a standard format for this specification to enable interoperability and re-use of such specifications?
  
  Should the specification of semantic relationships between variables not be linked to the knowledge from the ontologies? Ontologies are able to represent many different types of logical relationships between classes, which make them ideal for then serving as a standard and interoperable format for specifying this type of constraint. Rules are another promising standard approach for logic-based knowledge representation.
  
  Page 11, figure 4 a: I think it would be informative for evaluating the operation of the tool if the heatmap of variable missingness after application of the tool could also be illustrated beside the current Fig 4a.
  
  Page 13, ontology preparation: The paragraph describes what the authors have done to prepare ontologies for use with the tool. Is this preparation procedure also necessary for users to follow when they use the eHDPrep tool? How can alternative ontologies be incorporated (which may be useful for other domains)?Evaluation: The biggest shortcoming of the presented manuscript is that the evaluation is limited to the application of the tool to one dataset and subsequent manual evaluation of the outcome by one group, the study authors.
  
  The results as presented are positive, but there is a significant risk that the tool performs well on this task, as assessed by these study authors, but then fails to generalise to other tasks and datasets that future users might wish to use it with. To mitigate against this challenge, it would be optimal if somewhat more independent methods could be found for evaluating the performance of the different aspects of the tool. One approach could a rigorous comparison of this tool's performance against the performance of other tools that have similar functionality, e.g. comparison of the semantic aggregation function with other tools that find and recommend MICAs. An alternative approach might be to apply the tool to an additional dataset for which a group outside of the study authors would be prepared to provide an independent evaluation.
2. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Hugo Leroux
  
  This well-written paper describes techniques for semantically-enriching clinical data pertaining to colorectal cancer diagnosis.It describes an R-based tool, eHDPrep, to extract the data, which is subsequently cleaned, actioned for missing and erroneous values, encoded and enriched semantically using SNOMED CT and the GO, and ultimately exported after having undergone some QC.The paper is well-written and the methods really well-explained, for which the authors should be commended.I only have a few comments for the authors:
  
  It is not clear to me how, in the discussion on page 14, the authors have dealt with the issue of representing negative findings and missing values, as described within their enrichment outcomes section.
  
  In the "Ontology Preparation" section, the authors describe how they have taken both the SNOMED CT terminology and performed some transformations to OWL and conversion to CSV format before mapping the Colo-661 variables to it. They don't however discuss the challenges that such an approach entails. The authors might consider perusing through this article (https://doi.org/10.1186/s13326-018-0191-z), which addresses many of the challenges relating to ontology matching
  
  Please insert an additional ")" when stating the "Equations", e.g. page 6: "... zero entropy [27] (Equation (1)) ...", also , page 13
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.09.07.506953v1
www.biorxiv.org www.biorxiv.org

TF-Prioritizer: a java pipeline to prioritize condition-specific transcription factors

3
1. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Kaixuan Luo
  
  This paper develops a novel pipeline TF-Prioritizer to prioritize condition-specific TFs thorough integrative analysis of histone modification (HM) ChIP-seq and RNA-seq data. The pipeline integrates multiple computational tools: calculate TF binding site affinities and link candidate binding sites to genes using the TRAP and TEPIC. It uses DYNAMITE, a sparse logistic regression classifier, to infer TFs related to differential gene expression between conditions. It computes an aggregated score "TF-TG score" to score TFs from multiple types of evidence, and obtains a prioritized list of TFs from all histone modifications using a discounted cumulative gain ranking approach. It also provides additional functionality and web interface to visualize the results.
  
  Overall, the pipeline could be very useful for biologists with a user-friendly web application to automate the entire process from data preprocessing to statistical analysis and obtain interactive reports to gain novel biological insights. However, more systematic evaluations are needed to demonstrate the benefits of this pipeline.
  
  Major comments:
  
  In the computation of an aggregated score "TF-TG score", it uses a multiplicative function to combine differential expression (absolute log2FC), TF-Gene scores computed from TEPIC, and the total coefficients computed from DYNAMITE. One concern about this approach is that it may miss some TFs with support from only one or two types of evidence. In Fig 5, we see diffTF identifies a lot more TFs than diffTF. I don't think we can conclude that diffTF is less specific than TF-Prioritizer simply based on the number of TFs prioritized. Some of the TFs identified only by diffTF may be important but missed by TF-Prioritizer? I would like to see more detailed analysis comparing the lists of TFs identified by diffTF and TF-Prioritizer. Other evidence or metrics in addition to the number of prioritized TFs would be helpful to evaluate the plausibility of the prioritized lists of TFs.
  
  It is hard to interpret and evaluate the contribution of the evidence for prioritized TFs. Figure 6b is helpful, but it is unclear how the users would be able to evaluate the contribution of the components. Does the software run each of the combination separately and outputs a list of prioritized TFs under each combination?
  
  The TEPIC2 paper has already developed a very comprehensive pipeline, including TF affinity calculation by TRAP and computation of TF gene scores by TEPIC, as well as logistic regression to identify TFs between conditions by DYNAMITE, and it is already well paralyzed. The authors should clearly list the novel contributions from this work. It would be helpful to have a table comparing the functionalities and technical features between TF-Prioritizer and TEPIC2.
  
  The software takes histone modification ChIPseq and RNA-seq data as input. It will significantly improve the usage of the software if it supports DNase-seq and/or ATAC-seq, which are widely used. If this software could take ATAC-seq or DNase-seq data as input, it is important to include those data types and provide some examples to illustrate the usage and performance.
  
  The software combines multiple histone modification ChIP-seq datasets using a discounted cumulative gain ranking approach. However, different types of histone modifications have different epigenomic functions and different combinations indicate different chromatin states. Some TFs may be only enriched in a small subset of histone modifications (already discussed by the authors) and may be missed by the simple discounted cumulative gain ranking approach. The authors should provide prioritized TFs from each histone modification ChIP-seq dataset, and evaluate which TFs were prioritized by all the combined datasets, and which TFs by only one dataset. Also, some ChIP-seq datasets may be of poor quality. Does the software provide other options to rank the TFs from different epigenomic datasets? e.g. set different weights for different epigenomic datasets, etc.
  
  The authors conducted cooccurrence analysis based on the overlapping of peaks. It is unclear if the method would calculate some statistical measure (e.g. p-value) for the significance of co-occurrence. Also, since the TRAP model generates quantitative measure of TF binding affinity, I am curious to see if the quantitative TF binding affinity are also correlated for those co-occurred binding sites.
  
  Minor comments: 1. In Figure 1, it would be helpful to highlight which steps were already implemented in existing tools (and label the tools used), and which steps are novel in this study. 2. H3K4me3 data seems to be missing in the L10 time point. How does the method handle missing data? 3. It is unclear how the Pol2 ChIP-seq data was used in this study? Was it included in the model or only in the downstream analysis? 4. It is hard to interpret the browser tracks of the TF predictions ("Predicted xxx") in Figure 3 and 4. Please add more details about those tracks .5. Figure 6, the authors should provide more details to help understand this figure, especially panel b. The figure legend is too short.
2. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Roza Berhanu Lemma
  
  In this manuscript, Hoffmann and Trummer et al. reported a new automated pipeline that utilizes existing methods, namely (1) DESeq2 to perform differential gene expression between sample groups, (2) TEPIC, a method that links CREs to genes using a biophysical model TRAP and (3) DYNAMITE, which provides an aggregate score for TF-target genes that determine the contribution of TFs to condition specific changes between sample groups. Finally, the pipeline utilizes Mann-Whitney U test to prioritize TFs among a background distribution and a ChIP-seq specific TF distribution, which allows the identification of TFs with roles in condition-specific gene regulation. Their pipeline allows large-scale processing of data and returns a feature-rich and user-friendly interactive report.
  
  The authors demonstrated how to use TF-prioritizer using public datasets for mouse mammary gland development study and performed independent validation using datasets from ChIP-Atlas. They were able to capture both known TFs with previously reported roles in mammary gland development/lactation and new TFs that may have a role in these processes. The work is very well thought and executed but to keep the quality of the work even higher, the authors should address the following points.
  
  Major:
  
  Although their validation nicely portrays the potential application of their pipeline in answering biological questions, my fear is for this not to be an isolated case. Therefore, the authors should test their pipeline using another example dataset and convince their readers. A suggestion could be, to run TF-Prioritizer on one of deeply profiled cell lines (e.g. K562, MCF-7, etc) to investigate TF prioritizations for e.g during differentiation (change of cell fate) and see if lineage determining TFs are prioritized in such cases. This may potentially highlight the versatility and robustness of TF-prioritizer. This is also important as your readers are not (certainly not all of them) from the mammary gland development field. As such, dedicating a large portion of your discussion about this process is too much. If you manage to highlight the versatility of your pipeline by capturing more than one specific developmental process will do the paper a great favor by highlighting the different ways TF-Prioritizer can be used, which in turn may attract more users to utilize your pipeline.
  
  I have an issue on how the 'Results and Discussion' section is organized. The authors dedicated separate subtopics for each TFs they prioritized and made literature review of their role in mammary gland development and lactation. My recommendation is to instead have one subtopic and discuss these TFs paragraph by paragraph in a concise manner. A more concrete way to reorganize this will be to separate these into two subtopics, (1) Known TFs with role in mammary gland development/lactation (2) Novel TFs with predicted role in mammary gland development/lactation. To make these reorganization easier/smooth, cutdown details of what you observe in the figures (e.g. p16, line 22-27 and p17, line 1-3), discuss the main message and put the detailed text about the figures in the Figure captions
  
  .3. All figures and tables should have more information in the caption including those in 'supplementary Material'Minor:1. p7 line 9, how often do one find these combinations of data types (modalities) in different conditions, cell types or models being studied. Could some of the HMs be replaced with other data modalities e.g ATAC-seq, DHS data or data from other chromosome profiling methods? Could the pipeline be adapted to incorporate Cut and tag/cut and run or is it specific to only ChIP-seq data. Authors should try to discuss whether this is possible or not.2. P13 line 3, the authors discuss that "ChIP-Atlas provides more than 362,121 datasets for six model organisms…". Could TF-Priotitizer be easily adapted to other databases/resources, which ChIP-Atlas do not cover (e.g. for other organisms) that the community might be interested in?3. p14 line 2 "... expressed gene for this analysis but focus on affinities only". Why this is the case is not argued/discussed. This and other choice of parameters would be nice if they are discussed under a separate subtopic to easily inform future readers/users of TF-Priotitizer
  
  Figures should be cited in chronological order. Adjust the text or reorder the figures
  
  When the authors discuss the evaluation of the prioritized TFs in separate sections, they often start with "In Figure Xa) …" and "Figure Yc) shows that …", etc, such kind of texts best fit as Figure captions instead of in the 'Results and Discussion'.
  
  p21 line 16, "We predicted that several Rho GTPase-associated genes are regulated by the predicted TFs" This sentence sounds a bit circular, you may rephrase as follows 'We propose that our predicted TFs regulate several Rho GTPase-associated genes
  
  '7. Figure 3 and 4 have the same general message/purpose and look redundant. This is reflected in the phrase '...(black arrows) as they are already known to be crucial in either mammary gland development or lactation.' and 'In the heatmaps, we can observe a clear separation of these target genes between the time points X and Y…'. I suggest the authors choose one of them as a main figure and place the other in Supplementary Material.
  
  On Fig.3,4 captions the authors should indicate what the black boxes represent. One can guess what they are from your main text but the captions could profit from a bit more detailed explanation. You should at-least describe some of the things that needs to be highlighted from the figures to easily guide your readers
3. GigaScience 19 Jun 2023
  
  in GigaScience
  
  Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Xiaowo Wang : Markus et al. developed a new pipeline TF-Prioritizer to discover potential cell or tissue-specific transcription factors (TF) with ChIP-seq data of histone modification and RNA-seq data. TF-Prioritizer is mainly based on the framework of the state-of-art method TEPIC to model TFs regulating the gene. The authors extend TEPIC by integrating more information like differential gene expression using DEseq and linking the TF binding in cis-regulatory element to the gene expression using DYNAMITE. They also designed a new statistical method to rank the TFs across different cell types or in the time-serious cells. The authors also provide some cases to validate the pipeline. The pipeline is useful in biomedical research. The manuscript is well-written and provides enough details. The authors addressing or further considering the following issues may benefit readers.1. TF-Prioritizer requires ChIP-seq of histone modification (HM) as the input. It may support different types of HM. Users may want to know how to choose a proper set of HMs? Authors should evaluate some cases to show TF-Prioritizer's performance when inputting different HMs.2. ATAC-seq is more widespread for different kinds of cells or tissues. It seems TF-Prioritizer can also apply to ATAC-seq peaks. Why TF-Prioritizer does not support ATAC-seq now?3. On page 11, there may be some mistakes in the definition of BG(m) and FG(t,m). t \in TF(m) of BG(m) should be moved to FG(t,m)?4. The software is hard to install without sudo/root account. It would be better to provide a docker image that is ready for the users to run the software.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.10.19.512881v3
May 2023
www.biorxiv.org www.biorxiv.org

The Crown Pearl V2: an improved genome assembly of the European freshwater pearl mussel Margaritifera margaritifera (Linnaeus, 1758)

1
1. GigaScience 18 May 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.81), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Jin Sun **
  
  Gomes-dos-Santos et al., have upgraded the freshwater mussel Margaritifera margaritifera genome with the usage of long-read sequencing. Overall, this version has been dramatically improved compared to the former one, with the increased N50 value and BUSCO score and decreased No. of contigs. Considering the important economic value of M. margaritifera and the high quality of assembly, I must congratulate the authors on this. However, in contrast to the high-quality assembly, I am a bit aware of the genome annotation part. To me, the number of gene models predicted is a bit higher compared with other molluscan genomes. This can also be reflected by the low proportion of gene models that can be annotated by Swissprot or GO etc. I suspect that the high number of gene models could be the consequence that only the ab initio evidence was applied in the current study. More sophisticated ways, such as EVM or maker, shall be used to see whether the number of gene models can be reduced without sacrificing the BUSCO scores on the gene models.
  
  Line 76, The official name shall be “Oxford Nanopore Technology (ONT)”.
  
  Fig. 1, it is interesting to see the wide distribution of M. margaritifera. I am a bit interested to know whether there are any genetic differentiations between the European population and the North American population.
  
  **Reviewer 2. Rebekah L. Rogers **
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Y. All methods seem standard and high quality for a genome release.
  
  If the authors could add a table comparing with other Unio genomes, that might be helpful. Gene numbers, BUSCO scores, N50s, and other relevant stats. It will help readers see the value of this more contiguous genome -V. ellipsiforma (Renaut et al.) -M nervosa -P. streckersonii
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.11.528107v1
www.biorxiv.org www.biorxiv.org

Mycobacterial Metabolic Model Development for Drug Target Identification

1
1. GigaScience 06 May 2023
  
  in GigaByte
  
  ABSTRACT
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.80), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Grace Mugumbate **
  
  Please add additional comments on language quality to clarify if needed
  
  Yes. First person reporting has been used with the word "We' used extensively.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. There is need to specify the type, size, standardisation and curation of the data that was used, especially when additional data was obtained from different databases.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  Yes. Sources of data are indicated in the paper, however the size of the data sets and type of data is not clear.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. There is need to give more detail in the methods for reproducibility.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. Validation was performed, however no statistical analyses was mentioned.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. More detail is needed on data retrieval to allow reuse of the dataset.
  
  Additional Comments:
  
  The Authors presented their work entitled 'Mycobacterial Metabolic Model Development for Drug Target Identification'. This is very innovative work that led to generation of M. laprae and M. abscessus models, important tools for drug target identification. Target identification for a number of infectious diseases provides information for structure-based molecular modification of new and alternative diseases. The target specific compounds will help reduce side effects among other things. Generation of the models by the authors is commendable.
  
  There are a few corrections: 1) Under Abstract: Line 4: Please note that Mycobacterium tuberculosis is not a disease but the bacterium that causes the diseases tuberculosis. 2) Mehtods, GEM reconstruction, curation and simulation (i) Line two: Name the "other organism specific databases" (ii) Give a brief description of the COBRApy and the GLPK even if the source had been given. 3) The Method section need to be more informative to allow for reproducibility.
  
  **Reviewer 2. Nagasuma Chandra **
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. It would be useful if the authors could comment on how the models vary between the two species and with respect to M. tuberculosis. Specifically, a note on how the authors deal with alternate enzymes and whether they included enzymes specific to each species, would be helpful.
  
  Is the validation suitable for this type of data?
  
  Yes. A figure depicting the overall capability of the models would be useful
  
  Additional Comments:
  
  Genome-scale metabolic models are useful to the community as they can be used to address a variety of questions. It would be useful if the authors could include a section on the comparative performance of the models and link it to the known metabolic capability of these microbes.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.31.534705v1
www.biorxiv.org www.biorxiv.org

An accessible infrastructure for artificial intelligence using a docker-based Jupyterlab in Galaxy

2
1. GigaScience 02 May 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Philippe Boileau
  
  This manuscript introduces the new docker-based JupyterLab framework in Galaxy, describing its core components and demonstrating its use in the reproduction of two analyses. The proposed framework is also thoroughly compared to competitors, like Google’s Colab and Amazon’s SageMaker. This tool is bound to have an impact on the life sciences: it democratizes computational analyses and facilitates reproducibility. I thank the authors for their important work. However, I think that this technical note should be reviewed for grammatical errors and faulty punctuation. I’ve identified some such issues in the comments below but wasn’t able to address all of them. Included in the comments are other remarks which, if addressed, could strengthen some key takeaways. • The first sentence of the abstract states that AI programs require “powerful compute infrastructure” when applied to large datasets. I think readers would like to know how you qualify an infrastructure as “powerful”. A brief definition could be included in the second sentence instead of repeating “. . . hosted on a powerful infrastructure . . . ”. • Is it “JupyterLab” or “jupyterlab notebook”? The Project Jupyter site seems to use the former. Based on the documentation, JupyterLab is a web-based user interface that can open Jupyter notebooks (.ipynb files). • The statement “Artificial intelligence (AI) approaches such as machine learning (ML) and deep learning (DL) . . . ” implies that ML and DL are distinct aspects of AI. This distinction is insinuated throughout the rest of text. Isn’t DL a subset of ML? I suggest replacing “ML and DL algorithms” by “ML algorithms” and specifying “DL algorithms” only as needed. • I believe there’s a missing comma between “ecosystems” and “enabling” in the first sentence of the Docker container section. • Consider reformatting “A container runs . . . of the running software.” to “A container runs an isolated environment with minimal interactions between it and the host OS. Running software in a container is more secure.” • Related to the suggestion above: Can you explain why this increased security is necessary? An example might help emphasize the importance of a secure container. • I think “Docker container inherits . . . ” should be “The Docker container inherits . . . ”. Same goes for “Docker container is decoupled . . . ”. • Consider reformatting “Moreover, it can easily be extended by installing suitable packages only by adding their appropriate package names in its dockerfile.” to “Moreover, the Docker container is easily extended: additional software packages can be installed by adding their names to the dockerfile.” • Consider replacing “some of the popular ones are” by “including” • I believe there’s an unneeded comma between “. . . platform for both” and “rapid prototyping. . . ”. • I believe that there’s missing a word in the last sentence of the Features of jupyterlab and notebook infrastructure section: “. . . an H5 file.” • “google” and “amazon” should be capitalized. • Consider removing “and non-ideal” from Related infrastructure section. • I believe the comma in “. . . but they come at a price, . . . ” should be replace by a colon. • I believe there’s missing a comma between “. . . free of charge” and “similar to colab . . . ”. • Why is sharing a sessions’s resources across multiple notebooks more useful than operating each notebook in a separate session? Isn’t the latter preferable when a notebook causes a session to crash? • “deep learning” in the Implementation section should be replaced by “DL” for consistency. • I think that readers would find a link to your tool on Galaxy Europe useful: https://usegalaxy.eu/root?tool_id=interactive_tool_ml_jupyter_notebook. The same is true for your tutorial: I think readers would find a URL in the text more easily than in the references. However, the tool failed to execute on usegalaxy.edu with the following error message: “This tool is restricted to authorized users”. I was unable to follow the tutorial. Was this a one-off issue with the Galaxy servers?
2. GigaScience 02 May 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer 2: Milot Mirdita
  
  Kumar et al. present a Docker-based integration of Jupyter Notebooks in the Galaxy workflow system that can utilize GPUs. This notebook is also available in the Galaxy Europe instance.
  
  I was able to create a Galaxy Europe account, find the newly introduced Galaxy tool and submit a job. However, it remained stuck with the message "This job is waiting to run" and the job info "Stopped" for multiple hours. I was able to download the docker image and run it on a local server with multiple Nvidia GPUs. This resulted in a running Jupyter Lab, however running the GPU based examples resulted in driver mismatch errors/warnings (pynvml.nvml.NVMLError_LibRmVersionMismatch: RM has detected an NVML/RM version mismatch; kernel version 470.141.3 does not match DSO version 515.65.1 -- cannot find working devices in this configuration). Thus, the examples ran on CPU only. I did not try to resolve this issue and only repeated some examples.
  
  The authors show two use-cases for the GPU Jupyter Docker and provide a step-by-step tutorial for usage on Galaxy Europe. Shipping machine learning applications that utilize GPUs as Jupyter Notebooks has become popular recently and supporting these through well-known and freely accessible Galaxy servers, such as Galaxy Europe, would be of clear benefit to users. Additionally, it would be very valuable for method developers like me to easily deploy GPU-based methods to Galaxy servers.
  
  Major: - As mentioned before, I had issues getting a running Jupyter Lab on the Galaxy Europe server. Is this due to a limited number of GPUs or was this due to an error? - Our ColabFold Multiple Sequence Alignment server currently processes about 10-20k MSAs per day. We do not know how many of these are running on Google Colab or on users' local machines. However, a substantial number of predictions are running inside Google Colab. The authors claim that Google Colab's and Kaggle's resources are scarce. However, generally, users (with either free or pro accounts) are given an instance nearly immediately on Colab. I recognize that it is extremely difficult to compete with these commercial platform providers. However, providing a long-term, freely available and securely funded, platform with ML accelerators would be extremely beneficial for the whole community. I would like to see a discussion on what GPU resources are currently available to users of Galaxy Europe (and the whole Galaxy Project) and what plans exist to expand these in the future. - The size of the docker container (compressed ~10GB, uncompressed ~22GB) seems difficult to sustain. Both keeping up an up-to-date Docker image and ensuring the availability of older images for reproducibility looks difficult to me, especially with such fast moving dependencies such as machine learning frameworks. How do the authors plan to deal with this issue?
  
  Minor: - Please highlight the tutorial (https://training.galaxyproject.org/training-material/topics/statistics/tutorials/gpu_jupyter_lab/tutorial.html) on GitHub and inside the container readme (home_page.ipynb). It is very easy to overlook. I also nearly overlooked the example notebook repository (https://github.com/anuprulez/gpu_jupyterlab_ct_image_segmentation). I found it confusing, that I could not find the two shown example use-cases inside the Docker container. I only later figured out that I have to clone the example repository into the running container. - The manuscript highlights various workflow methods (elyra, kubeflow, airflow), however it needs clarification on how the Galaxy workflow integration works. I saw that it is possible to give input of another Galaxy output to the tool. I would appreciate a tutorial on how to make the GPU Jupyter Docker into part of a Galaxy workflow with multiple tools running. I think the above mentioned tutorials can be expanded to show how the output can be given to the next tool. - Docker Hub has introduced many business-model changes such as deleting container images that are rarely used, which poses a challenge for reproducibility. I know that Dr Grüning is involved in the Biocontainers project. I would recommend investigating if it is possible to combine these efforts to make this GPU container and derived containers long term available. - The Docker container is explicitly running as a root user, while the manuscript highlights the security benefits of Docker. The cited report by Baset et al. highlights the security benefits and the many security challenges that Docker containers pose. I suggest checking what security best practices for Docker containers are possible to implement, while still allowing GPUs to be exposed to users. - I recommend revising the manuscript for conciseness, with an additional focus on capitalization of words.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.07.08.499333v1
www.biorxiv.org www.biorxiv.org

DivBrowse – interactive visualization and exploratory data analysis of variant call matrices

2
1. GigaScience 02 May 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad025), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Weilong Guo, PhD
  
  Patrick König and colleagues have built a web application for the interactive query, visualization and analysis of genomic diversity data, supportting population structure analysis on specific genetic elements, and data export. The application can also be easily used as a plugin for existing web application. According to its documentation, this application can be easily installed form pip, Docker and conda, which would be useful for population genomic studies. There are still several concerns about this manuscript.
  
  Major concerns:
  
  As for the SNP visualization function, there are only very limited numbers of SNPs can be read on the webpage, without function such as "zoom in" or "zoom out"(it is suggested to add such functions or similar functions). Although the application can export almost all the SNP sites of a whole VCF file, it is far from user-friendly.It is suggested to add a track of chromosomes showing the genomic windows under querying, allowing the cursor to select or adjust the genomic regions (UCSC-browser style), which is necessary for an intuitive user experience.
  
  The BLAST function could serve as a useful entry point. But what is the starting position of the query sequence when mapped on minus strand? The authors should make it more clearly explained on the website.
  
  TThe authors mentioned that their application would convert the inputted VCF file into Zarr format. Thus, more performance evaluation should be declared to show the advantages of this strategy (rather than using the VCF file directly).
  
  The authors should also compared the their applications with other similar existing web applications, such as CanvasDB, Gigwa, SNiPlay and SnpHub, to highlight their advantages and improvemences.
  
  Minor concerns:
  
  The analysis functions are still insufficient. Commonly used analysis tools or methods, such as haplotype analysis, STRUCTURE analysis, distribution of nucleotide diversity and selection sweep analysis, are also suggested to be supported.
  
  Ref. 22 is not completed.
2. GigaScience 02 May 2023
  
  in GigaScience
  
  Background
  
  Reviewer 2: Armin Scheben
  
  The authors present the web app DivBrowse for visualizing genomic variant data. Their code is publicly available, and their web app is well-documented and provides several demonstration implementations for human, mouse and barley. The manuscript is well-written and concisely covers the key features of DivBrowse and summarizes the implementation of the software.
  
  I was able to test the demonstration website and was impressed with how smoothly everything ran and was set up. Due to time constraints, I was not able to test the installation and set up of DivBrowse but the documentation looks sufficient to allow easy set up by experts. Overall, I think this is a useful contribution to the community. One key issue I believe the authors should address, however, is that the manuscripts presents DivBrowse in a vaccum, not providing much mention of or comparison with existing software with overlapping functionality. Below I provide some further details illustrate my point and how it might be addressed, as well as listing several other minor comments.
  
  Main comment
  
  The authors rightly indicate in their introduction that the growing amounts of genomic data generated require robust solutions for visualization and exploration that does not require use of the command-line. But the authors fail to mention that there exists a considerable ecosystem of software that already does this. Moreover, some of the software available offers substantially expanded features compared to DivBrowse.
  
  To help readers better decide when DivBrowse might be the right choice for their needs compared to other options, the authors could cite existing software and provide some comparison. My knowledge of all available software is not exhaustive, but Wang et al. 2020 (https://doi.org/10.1093/gigascience/giaa060) in their publication of SnpHub provide a comparison table including SnpHub itself and Jbrowse. I would consider both of these tools for exploration and visualization of SNPs and additional data, similar to DivBrowse. Jbrowse is relatively widely used and considerably more feature-rich. The standalone offline tool TASSEL (https://academic.oup.com/bioinformatics/article/23/19/2633/185151) also offers many options for visualisation and exploration and analysis of VCF data offline. There may also be other tools I am not aware of, and readers would likely benefit from some brief overview of the landscape and the pros and cons of each piece of software and what differentiates DivBrowse.
  
  Minor comments
  
  The authors can consider the minor comments below as 'take it or leave it' comments. I do not think it is essential to address these, but in my view they may enhance the manuscript.
  
  1) In the discussion, the authors point out the efficiency and low latency of DivBrowse, however this is not quantified in the manuscript. If it were technically feasible without substantial effort, it might be useful to quantify in some way just how efficient DivBrowse can be, especially if this could be one of the stand-out features of DivBrowse.
  
  2) The authors use divergence Bezier curves to increase the amount of variant calls that can be visualized. This is helpful and a useful default. However, invariant sites can also be of considerable evolutionary and breeding/medicinal interest. When collapsing invariant sites, they become indistinguishable from unmapped regions. This is a fundamental issue and many VCF files may not encode information on invariant sites, so it may not be possible to develop robust functionality that allows users to also show invariant sites optionally. Still, this point may be worth briefly mentioning in the discussion, if the authors agree it is noteworthy.
  
  3) One advantage of visualization of relatively raw data like SNPs is that it can reveal patterns that are less obvious in other types of data exploration. To fully take advantage of this tools like Jbrowse allow export of the browser window in SVG format, allowing users to incorporate images into high-resolution figures. I don't expect the authors to necessarily implement this feature for this review, but it may be worth adding it to the list of potential enhancements that could be implemented based on user demand.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.09.22.509016v2
www.biorxiv.org www.biorxiv.org

The Regulatory Mendelian Mutation score for GRCh38

4
1. GigaScience 02 May 2023
  
  in GigaScience
  
  Motivation
  
  Reviewer 2: Mulin Jun Li
  
  In this manuscript, the authors updated their previous ReMM to the GRCh38 human genome build, supported convenient and fast data source. Then, the authors take some examples to demonstrate the usability of the resource. It's original to point that the difference in prioritized tools between different genome build. However, we have following concerns and comments:
  
  Major: 1. How to deal with missing value variants in test datasets when compare new ReMM with other tools, the author mentioned that ExPecto annotated only half of the million negative variants. 2. Although the CADD used the same negative training dataset, it's not suitable to compare it in the ReMM training dataset. How those tools performance in the independent test datasets. 3. The author presumes that new genome build will get better performance, is there some evidence can support this perspective, like the distribution of feature or training data in different genome build. 4. Other existing similar tools can prioritization disease-causal noncoding variant, such as regBase-PAT, NCBoost, ncER, etc. can the authors compare new version of ReMM with these tools.
2. GigaScience 02 May 2023
  
  in GigaScience
  
  Motivation
  
  Reviewer 3: Wyeth Wasserman
  
  SYNOPSIS The manuscript describes an updated release of the ReMM regulatory variant mutation scoring system. The paper presents the performance of an updated version of the system and describes how it was applied to the most current release of the reference human genome.
  
  OVERALL PERSPECTIVE This is a valuable resource for the community of researchers and clinicians working on the interpretation of genetic variants in the human genome. The work appears to be thoughtfully done and appropriate assessments have been provided. The use of the random forest models to weigh the contributions of features was particularly noted for the insights it provided into how features contribute to prediction. My biggest concerns are stylistic, which falls outside the scientific quality of the work. I provide these comments for the authors to consider and do not expect that my stylistic preferences will be uniformly accepted. A fair amount of justification of the manuscript focuses on the value of having a release for version 38 of the human genome, pointing to the field as not having done so broadly. I think this is misguided, as by the time people are reading the manuscript such points will have lost relevance. I suggest a focus on the science be given, as there is no need to justify things based on where other resources have progressed in releasing their version 38 updates. Points below include language/text clarifications that can be assessed by the authors. Writing styles differ, so stylistic comments should be optional.
  
  MAJOR POINTS None. Well done and clearly presented.
  
  MINOR POINTS 1. The word "various" is vague and often shows up when people are too busy to provide an accurate statement. Starting the manuscript with it makes a bad impression on this reader. You do not have to change it, but I thought you might appreciate knowing this impression. You could delete it with no harm to the sentence. (Not to get carried away, but the next sentence starting with "some" heightens the impression of 'hand waving'.) 2. I think I understand ", we apply cytogenic band-aware cross-validation using ten folds" but I encourage the authors to provide clearer wording for this point. 3. I would allow the reader to make their own judgement of performance. So please remove "excellent" from "we achieve an excellent performance" 4. "Rather than using ReMM scores for ranking, some users need to specify score thresholds" is confusing. I would change 'need to' to 'choose to' 5. "with lots of false positives" is a bit informal. I suggest "with a high false positive rate" 6. I am confused by "from three genomic regions (genic content and not overlapping with assembly gap changes) " as the brackets include two items, not three. 7. "maybe due to better mapping" - "maybe" should be "may be" 8. I think the language like "seems to be the only tool directly trained on training data and features derived from GRCh38." Is not particularly valuable long term. This is a useful contribution, but many tools are being updated to 38 and by the time this appears and is read, such statements decline in relevance. I would focus on providing this valuable resource, and not try to justify it based on a transient perception of where the field stands in updating versions. 9. "It is worth noting that in the context of extremely unbalanced data…" - you do note it. So I would change the wording to "In the context of extremely unbalanced data…"
3. GigaScience 02 May 2023
  
  in GigaScience
  
  The Regulatory Mendelian Mutation score for GRCh38
  
  Reviewer 1： Yan Guo
  
  In the abstract "Some methods and annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software and pipelines was slow." Not sure what the author is referring to by some methods, this could be a grammar problem.
  
  "Restricting variants to non-coding only removes a small proportion of variants", what is the proportion? Also, I don't understand the need to remove coding variants, shouldn't your model works also with coding variants?
  
  The method the author used is based on a previous publication. However, there is still the need to give the detail of the method in this manuscript. There is a lot of missing information. For example, what is the outcome, whether a position is deleterious? How is the probability for deleteriousness calculated?
  
  by a few specific variants. Thus, the overall Mendelian disease-related variants should be low. I am guessing that's why 406 hand-curated variants were used in the previous version of ReMM. If my assumption is correct, there shouldn't be a lot variants for Mendelian disease. How many variants are found to be positive in the entire genome?
  
  In the online application, the results are limited to 500, the rest cannot be seen or downloaded. I would be better to allow the user to download the entire results.
  
  The authors performed comparison with other tools and generated ROC curve which is dependent on knowing the true positives. There is no description of the dataset that was used for the comparison. Did the authors make sure that the training variants is not used for the comparison?
4. GigaScience 02 May 2023
  
  in GigaScience
  
  Motivation
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad024), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.03.14.484240v2
www.biorxiv.org www.biorxiv.org

The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics

2
1. GigaScience 02 May 2023
  
  in GigaScience
  
  Background
  
  Reviewer 2: Ben Woodcroft
  
  Cornet et al have generated a collection of NextFlow pipelines which provide a pipeline to analyse data associated with genome or raw sequencing data of microbial organisms and protists. The methodology appears sound and reproducible. My main concern with the manuscript is that it is not well described in the abstract, introduction or GitHub repository. It isn't clear whether the analyses are specific for genomics questions arising from culture collections, or if it is more broadly applicable. There is also no discussion about other pipelines which achieve similar things e.g. ATLAS https://metagenome-atlas.github.io/
  
  I also had a number of minor concerns, detailed below.
  
  A number of grammatical errors detected, these should be fixed. Parts of the manuscript are also slightly too informal e.g. "This confirms the interest of 221using ORPER to spot interesting SSU rRNA sequences" It would be helpful if the GitHub front page could provide a concise description of what the software aims to achieve, to make its use more understandable. 106: "as it happened" grammatical error "Assembly.nf" Commonly assembly is a separate process to binning, but here binning has been included. Perhaps a clearer name might be Genome-recovery.nf ? 124: "Researchers interested in a better understanding of these tools can read the recent review on the detection of genomic contamination made by Cornet et al. [15]." While not inappropriate, this is perhaps too much self-citation. Why is contamination assessed but not completeness? 129: "annotation of bacterial proteins is automatic" Automatic in what sense? Annotation also refers to describing the function of the protein usually, but here the meaning appears to be restricted to ORF calling. I found this somewhat confusing. Also "in the different GEN-ERA workflows" is unclear - does this mean that prodigal is run as part of the Assembly.nf workflow for instance? 143: "Orthology.nf automatically provides the core genes, shared by all the organisms in unicopy" what is meant by "all organisms" here? 145: "The OGs of proteins 145 can be further enriched" what does "enriched" mean? 163: GTDB.nf is described in the "Other workflows" section, when it is phylogeny-related. 172: "it was 173 technically not possible to include Mantis in a container" I am curious as to why this was the case? I do not have any specific insight or ability to judge the accuracy of this statement, just curious. Inclusion of a sentence describing the difficulties might help other workflow developers and/or the Mantis developers. 190: "Gloeobacterales are the most basal order of the 191 Cyanobacteria phylum" This statement is somewhat controversial, because the GTDB has defined the Melainobacteria as being a part of the Cyanobacteria phylum based on RED values. I would suggest removing "the most basal" or making it clear that cyanobacteria refers to photosynthetic cyanobacteria rather than the phylum. 189: The methods for this section are not described in the methods section. They are only briefly described in the Findings section. A clearer link to these methods should be made from the maintext and methods. 212: Showed -> show. 215: "estimate the sequencing level of the order" it isn't clear what meaning this has. 224: Our results demonstrate the absence of one metabolic 225pathway" There are many metabolic pathways, presumably it is missing more than one. 233: "examples of the practical usage of the GEN-ERA toolbox are available in Supplemental 234File 1." this does not make it clear that this refers to the methods for this specific example.
2. GigaScience 02 May 2023
  
  in GigaScience
  
  Background Microbial culture collections play a key role in taxonomy by studying the diversity of their accessions and providing well characterized strains to the scientific community for fundamental and applied research.
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad022), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Shakuntala Baichoo
  
  Paper Title: The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics The GEN-ERA toolbox provides a number of containerized workflows to researchers (without any specific training in bioinformatics) to study the diversity of well-characterized strains for fundamental and applied research. More specifically It facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree phylogenetic reconstruction. It additionally provides workflows for average nucleotide identity comparisons and metabolic modeling. The supplementary file provides details of how to run the whole workflow (through 10 steps), found in the GEN-ERA toolbox on basal, for an empirical dataset of early emerging cyanobacteria. It provides an up-to-date phylogenomic analysis of the Gloeobacteralesorder, the first group to diverge in the evolutionary tree of Cyanobacteria. The github repo located at https://github.com/Lcornet/GENERA also provides more details about the GEN-ERA toolssuite. Though in the manuscript it is mentioned that the call to Mantis could not be included in the Singularity call, on the github repo they have indicated that Mantis is now installed in a singularity container for the Metabolic workflow (install is no longer necessary). The tool has been tested on an empirical dataset of 18 (meta)genomes of early-branching Cyanobacteria and the time taken as well as the results of the run are documented in the supplementary file. The authors claim that the toolsuite can be used to study the diversity of microorganisms, including bacteria and fungi. From the github repo, it is clear that a number of publications in high-impact journal papers have already resulted from the development of the GEN-ERA.
  
  1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? This study aims at describing a toolbox, named GEN-ERA, and the methods section defines the various steps of the toolsuite. Looking at the supplementary file and the github, it is easy to follow the manuscript. The versions of the programs used in the case study are provided in the forms of nextflow scripts.
  
  2) Are the conclusions adequately supported by the data shown? The results of running the toolsuite on an empirical dataset of 18 (meta)genomes of early-branching Cyanobacteria, at each step, as well as the time taken to download the files and the running each step, are convincing that it works fine, at least for Cyanobateria. But this is found in the Supplementary Material. There should be section on Discussion and Conclusion in the main text.
  
  3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? But t The use of English language is adequate and concise and can be understood clearly, by researchers interested in studying diversity of micro-organisms.
  
  4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? The statistics involved in the phylogenetic analyses are integrated in the existing programs. Hence I am not able to assess the statistics.
  
  5) Final Comments The proposed toolbox/toolsuite described in this manuscript is very relevant and worth a read for researchers interested in studying the diversity of microorganisms, including bacteria and fungi, especially as it helps to facilitate their life through the use of well-defined containerized NextFlow workflows.
  
  I strongly believe that there should be a section on the Discussion of the results of running the toolbox for the case study and a Conclusion in the main manuscript. This will help readers in understanding the importance of the toolbox better.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.10.20.513017v1
Apr 2023
academic.oup.com academic.oup.com

The Global Atlas of Bamboo and Rattan (GABR) Phase II: new resources for sustainable development

1
1. GigaScience 12 Apr 2023
  
  in Gigascience Annotations
  
  was launched in 2017
  
  See the announcement https://doi.org/10.1093/gigascience/gix046
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac113/6780306
www.biorxiv.org www.biorxiv.org

Contamination detection and microbiome exploration with GRIMER

2
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Background
  
  Reviewer2-Raphael Eisenhofer
  
  Piro and Renard introduce GRIMER, a tool that automates microbiome-related analyses and creates rich, offline-supported report that can be shared with collaborators or hosted online. I think that they gave a great summary of the problem of contamination in the microbiome field, and clearly explain the gap that their software fills. They exhibit GRIMER on previously published datasets, which are available to view online. Overall, I'm very impressed with the dashboardâ€”it looks great, is easy to explore datasets, and highly portable. I can certainly see myself using GRIMER on some of my future datasets, and I have no doubt that it can be a valuable tool for others in the field. I do however think that the documentation and usability of the tool can be improved, and I give some suggestions below. Addressing these issues will, in my opinion, lead to a wider adoption of the tool by researchers in the field.Usability:I managed to test GRIMER on a 16S amplicon dataset, but given the sparsity of the documentation, this took me a little longer than expected (in addition to quite a few steps), and I think that there are improvements that could be made to make it easier for people to use GRIMER from formats that people commonly generate.For example, QIIME2 is perhaps the most used 16S amplicon analysis pipeline, so the ability to import directly from .qza files (e.g. table.qza, taxonomy.qza) would give GRIMER much greater reach. If this is beyond the scope to incorporate within the GRIMER codebase, at least provide the exact code needed in the documentation for people to export their .qza files to files compatible with GRIMER.Likewise from phyloseq, a commonly used R package for microbiome analyses. Could some documentation/code be added about how best to export phyloseq objects to a format that GRIMER can handle?I mostly analyse shotgun metagenomic datasets (genome-resolved), and I foresee more users using these types of data in the future. Therefore, the ability to parse gtdb-tk outputs directly would be very helpful. Perhaps have a flag --gtdb that parses the 'gtdbtk.bac120.summary.tsv' and 'gtdbtk.ar53.summary.tsv' files.Following on from this, CoverM (https://github.com/wwood/CoverM) is quite commonly used for generating final MAG count tables (.tsv), so the ability to import them directly would be a really nice quality-of-life addition, and something that would not require much coding to accomplish.I believe that these adjustments will make the tool far more accessible for everyday users and increase the adoption of GRIMER by the wider community.For the actual report, if possible, I would like the ability to export ASVs/features/MAGs from the report that the user thinks are contaminants. This could be a list that the user could copy/paste, or the direct export of a .txt/.tsv. Perhaps the user could tick a box next to the ASVs/features/MAGs to save them to a list/viewer? The reason for this is that the logical next step I see after using GRIMER is to go back to your dataset and filter out the putative contaminant ASVs/features/MAGs. Being able to produce such a list will make subsequent filtering by the user easier.I couldn't get decontam to work with my dataset, here was the error:raise KeyError(f"None of [{key}] are in the [{axis_name}]")KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan],\n dtype='float64')] are in the [index]"I can post this as an issue on the repo if you'd like.Regarding the specification of negative and positive controls in the config.yaml, would it be possible for this to be implemented from the executable? For example, there could be a flag '--control-column' that specifies the column in the user's metadata file. '--control-column control' would parse the 'control' metadata column, and for cases where are values 'negative', 'positive' assign them automatically. This is just a suggestion that could make it a bit easier for users to set control samples, rather than having to create a new .txt file and change the config.yml.Dependencies:When installing via conda, I ran into the following error:ImportError: cannot import name 'PearsonRConstantInputWarning' from 'scipy.stats'It seems that this can't be imported from later versions of scipy, but I managed to fix it by forcing scipy=1.8.1. You should be able to force this version in the conda recipe.Minor grammar:Line 16: replace 'perform' with 'performs'Line 50: 'found in the [9]'Line 56: replace 'as technicians body' with 'microbes from laboratory technicians'Line 60: I would remove the 'environmental' adjective here, as contamination affects all low-biomass samples.Line 63: I would use 'samples' in place of 'environments' here. You may also consider suggesting that some samples may even contain no microbial DNA. E.g. replace 'low amounts of' with 'little to no'.Line 64: Replace 'ideal scenario for an exogenous contaminants' with 'an ideal scenario for exogenous contaminants'.Line 72: perhaps consider referencing decontam here.Line 79: replace 'due to increase in costs' with 'due to the increase in cost associated with their inclusion'.Line 81: Consider referencing first author's last name, e.g. 'Moreover, XXX et al. [45] reported…'Line 88: remove 'outcomes'
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1-Gavin M Douglas
  
  Piro and Renard present GRIMER, which is a bioinformatics tool for summarizing microbiome taxonomic data in various ways, with the main purpose of identifying putatively contaminant taxa. The authors convincingly argue that there is great value in looking at several different aspects of a dataset when determining which taxa are potential contaminants. I think this tool could potentially be very useful for the field, but I think at the moment there are several places where users might be confused and perhaps be overwhelmed without more documentation.The main point of confusion I'm concerned about is regarding the "common contaminants". It's not convincing that you can just classify a taxon as a contaminant regardless of what environment is being profiled. Also, under this approach, if a taxon is identified once as a contaminant in an earlier study, would it then be classified as a contaminant in all datasets processed by GRIMER? This would mean that a lot of high-abundance taxa in certain environments would be wrongly thrown out. For instance, you can imagine high-abundance taxa on the human skin might be more likely to be contaminants during sequencing preparation, but of course many researchers are very interested in profiling the skin microbiome. I think the authors realize this, but I'm concerned that typical users may not appreciate this point. I think explicit discussion of this point in the discussion is needed and also an example of how this might look in practice (e.g., if skin microbiome samples were input to GRIMER, as part of a larger tutorial that could be online [see next point], would help avoid this mistake).The authors do a great job of walking through some results in the text, but more documentation is needed for the reports. The authors should include a basic tutorial that provides example input files and then walks through each individual tab. This could done all through text with screenshots of the GRIMER, or perhaps with a video tutorial. In addition, for someone just opening the example reports, I'm sure they will be wondering what data was produced by GRIMER (e.g., they might wrongly think GRIMER did the taxonomic classiciation) and what data was needed as input.The authors should expand on how the correlation step is used to identify contaminants. There is great interest in identifying clusters of co-occurring taxa, so identifying a cluster of 9 genera in Figure 5 doesn't seem like evidence of contamination to me. Perhaps it is when considered with other lines of evidence though, but this should be made clearer. Currently this legend implies that it alone points to reagent-derived contaminationThe figure text needs to be increased in size. Using more panels split across additional rows and removing unnecessary info (e.g., not all control categories need to be shown in Figure 1) would make these figures easier to interpret. I realize that you were hoping to use the raw GRIMER figures, but based on the current display items it does not seem like they are publication ready.The acronym WGS generally refers to "whole genome sequencing" (i.e., for single isolate organisms) not "whole metagenome sequencing". The standard acronym for the latter case would be "MGS", for "metagenomics". Also, the term "shotgun metagenomics sequencing" is mostly commonly used in this context, I've never come across "whole metagenome sequencing" before. Either way, "WGS" will mislead casual readers with the current usage, so this should be changed on your website and in the manuscript.The taxa parsing capabilities sound like they will save a lot of tedious, manual data mapping! Just checking - how does it perform with new taxa names / typos?Text editsL11 - "are challenging task" should be "is challenging"L12 - can remove "by design"L12 - "helping to" should be "to help"L13 - "can potentially be a source" I think should be "that could reflect"L14 - "evidences" should be "evidence"L13 + L14 - Unclear what is meant by "external evidences, aggregation of methods and data and common contaminant" - should be clarifiedL15 - "that perform" should be "that performs"L17 - "towards contamination detection" should be something like "to help detect contamination"L41 - "hypothesis" should be "hypotheses"L42/43 - "analysis can hardly be fully" should be something like "the required analysis is difficult to fully…"L56 - "technicians body" should be "a technician's body"L60 - "strongly affects environmental" should be "especially environmental," (note comma)L64 - "ideal scenario for an" should be "an ideal scenario for"L67 - "not to bias measurements and not to" should be reworded, possibly as: "to not bias measurements and to ensure that bias is not propagated into databases"L75 - "were proposed. They are " should be "have been proposed. These are"L77 - "among others" should be ", and others" (note comma)L79 - "increase in costs" should be "the required increase in costs"L88 - add "a" before focusL90, L196, L265, and elsewhere - "evidences" should be "evidence"L99, L104, L117, and possibly elsewhere - "analysis" should be "analyses" (when plural)L106 - "each samples/compositions" should be "each sample/composition"L110 - add "a" before taxonomy database and "the" before "DNA concentration"L132 - "specially" should be "especially"L134 - remove "a" before "the"L151 - add "of" after "thousands"L182 - "is" should be "are"L196 - "evidences" should be "evidence". And rather than "Evidences towards" it would be correct to say "Evidence for" or "Evidence supporting"L208 - add "the" before "overall"L246/247 - "generated several studies and investigations" should be something like "motivated several investigations"L248 - should be something like "from the maternal and fetal sides"L279 - remove "a"L280 - Add "the" before "Jet"L284 - capitalize "Qiita" and re-word "Pick closedreference OTUs with 97% annotated with greengenes taxonomy"L293 - Should be "Furthermore" rather than "Further"L295 - I think it should be "with low and high human exposure, respectively"? Or do you mean they both have highly variable exposure?L297 - "could be a also an" should be "could be driven by an"L300 - "against" should be "and"L304 - "correlated genus" should be "correlated genera" (and in other cases, such as in the Fig 5 and 6 legends, where "genus" should be plural version, i.e., "genera")L305 - "Such pattern" should be "Such a pattern"L307 - Should be "groups" rather than "organisms groups", or just "genera" as I believe each is a genusL313 - Remove "a"Fig 5 legend: "point" should be "points"Fig 6 legend: "taxa is abundant" should be "This taxon is abundant" and "inversely correlate" should be "inversely correlated". "a contamination evidence" should be "potential contamination"
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.06.22.449360v3
www.biorxiv.org www.biorxiv.org

The Australasian dingo archetype: De novo chromosome-length genome assembly, DNA methylome, and cranial morphology

4
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  there
  
  Reviewer4-Madeleine Geiger
  
  This well written study integrates different approaches and methodologies to tackle the still obscure nature and origin of the dingo and its sub-populations by thoroughly characterising and comparing an "archetype" dingo specimen. I have read and commented on the abstract and the introduction, as well as the morphology related parts of the methods, the results and the discussion. The methods of morphological comparison, as well as their description and the reporting of the results are sound. However, in some sections it is difficult to comprehend the results and their interpretations, as well as the significance and nature of the suggested "archetype" specimen Cooinda. I therefore made some suggestions for additions and edits to the text and the figures, which hopefully help to increase comprehensibility and consistence of the text (see my comments below). I could not check and comment on the raw data because the links to the supplement given in the manuscript (figshare) do not work. Sorry if I'm stating the obvious here, but to be able to access the raw data is particularly important if the described dingo should act as a reference archetype. L. 74: Add Â«of the dingoÂ» after "ecotypes": "[…] compare the Alpine and Desert ecotypes of the dingo […]". Otherwise it's not really clear what this is about. L. 91: It's unclear to me what you mean by "this female". I would suggest to exchange this expression with the previously used name of the animal. L. 94 ff.: The conclusions do not really fit to the rest of the abstract, specifically the aims as stated in the beginning. What I read from the "Background" section is that this work is about defining a "dingo archetype" via different approaches (genetic and morphological). The conclusion, however, is centred around the individual Cooinda. I would suggest to open up this section, to also make conclusions concerning the previously stated aims of the paper. L. 105 ff. and L. 369 and L. 508: A very nice opening! However, I feel that there is a somewhat misleading interpretation of the domestication process as a discrete trichotomy: wild > tamed > domesticated, when in fact domestication is a continuum with various stages in between the two extremes of the "wild" and the "intensively bred". There are various forms - even today - of "half-domesticated" populations, such as e.g., many of the Asian domestic bovids, or the reindeer. Thus, I would strongly argue that the dingo - although special due to the almost complete lack of human influence on its evolution in the last millennia - is not the only link between the "wild" and the "domesticated". See e.g.: Vigne, Jean-Denis. "The origins of animal domestication and husbandry: a major change in the history of humanity and the biosphere." Comptes rendus biologies 334.3 (2011): 171-181 L. 117: How do you define "large carnivore"? And: Are dogs more numerous than cats? I don't know the tallies overall, but in many parts of the world domestic cats are more frequent than dogs. L. 120 - 121: I think this sentence does not contribute to the manuscript and I would suggest to delete it. I also think that these are not the usual characteristics to discern the wolf from other canids. 123 - 125: I do not understand this distinction. In my opinion, the dingo could well be both, a tamed intermediate between wolf and domestic dog AND a feral canid. If I understand the current view of dingo evolution correctly, the dingo most probably constitutes an early domestic stage of the dog, which became feral. L. 150: I do not understand the reference to Figure 1 at this point. If you want to keep the figure reference at this place, I would recommend to extend the legend in order to be more descriptive about the significance of this individual dingo. Also: Is the question mark on purpose? Intro and Results in general: Cooinda is central for the research question and the paper. However, I do not really understand her position and significance right away from the text. Maybe this is just a matter of sequence of the paragraphs (some information is given at the beginning of the methods section at the end of the manuscript), but I think it would be crucial to introduce and explain Cooinda and her role (as kind of a reference "archetype") for the aims thoroughly already early on, preferably in the Introduction. This would e.g. also include: why of all the dingoes in Australia is Cooinda an appropriate choice to function as the "archetype". Further, it would be helpful to maybe have a figure showing the geographical distribution of the compared populations (alpine and desert, as well as Cooindas origin) to better understand the setting. L. 320 ff. and Figure 5: Would it be possible to add a visualisation of the shape changes described in the text into the figure? It is otherwise impossible to evaluate these shape changes. L. 328 - 345: It would be interesting to pursue the variation along PC2 further: Do you maybe have information from the raw-data if specimens of both the alpine and the desert group that were found to have particularly low or high values for PC2 are especially young and female, or old and male? In other words, do you find evidence in the dataset that there is an actual age and/or sex gradient along PC2? And what age was Cooinda when she died? L. 347: As also pointed out below, it would be important to note somewhere if these two specimens died at about the same time and/or were similarly treated (because of brain shrinkage in specimens that were frozen or otherwise fixed for a long time). L. 472: I would suggest to rewrite as: "Cooinda's brain was 20% larger than that of a similarity sized domestic dog […]". Further, I do not agree with the rest of the statement in this sentence. One of the hallmark characteristics of domestication is brain size reduction, which might be the result of selection for tameness (which you also describe later on). However, selection for tameness (an evolutionary process within a population) is not the same as taming (on the level of the individual). I would therefore suggest to re-write this sentence. Further and in general concerning the brain size part of this study: It would greatly increase the significance of this part of the work if you would compare the dingo brain size not only to one domestic dog, but set it into a larger context. There are plenty of published references for wolf, domestic dog, and dingo brain size estimates and it would be enlightening to compare your findings with those. Of course, there are methodological issues, but maybe a meaningful comparison is possible for some of them. For this I could recommend this review article: Balcarcel, A. M., et al. "The mammalian brain under domestication: Discovering patterns after a century of old and new analyses." Journal of Experimental Zoology Part B: Molecular and Developmental Evolution (2021). L. 483: Many of the surviving populations of re-introduced (i.e., feral) domestics were part of a fauna that did not correspond to the one of their wild relatives, but was somehow characterised by reduced predation or competition. This was certainly the case for the dingo (few other large predators in Australia) and for some island populations. Maybe you should double-check if this is really the case for the provided examples, but maybe it would be better to write that brain size reduction persists in feral populations at least under certain circumstances. L. 527: Why is it important that the reference dingo is a female? Please explain. L. 535 ff.: Please explain the significance of these special characteristics. Why and how are they special and important for the current study? Also: I'm not a native speaker, but I have the impression that some of the sentences in this section are a bit unusual. Please double-check the grammar. L. 739: What do you mean by "below" in the brackets? L. 741: Is this the right figure reference? I do not find this figure. Do you mean supplementary Figure 9a? 744 - 745: Could you briefly explain in one sentence the nature and number etc. of landmarks used in this reference study? (For those who cannot check the referenced work.) This would be quite important to be able to interpret the results. L. 744: Delete "earlier". L. 755: Could you briefly explain here if these were freshly dead specimens, or if they were already older (e.g. frozen, stored in a liquid etc.) please? This has some implications on brain morphology and size. L. 784 ff.: The figshare-links don't work. L. 884: I would suggest to re-write the sentence like this: "This was required because the brain was removed immediately after death, which caused some damage to the braincase." Supplementary Figure 9c: It's hard to match the reds of the convex hulls with the reds of the legend. Would it be possible to write down the names right next to the corresponding convex hulls? L. 895: Position remains the same relative to which other analysis? Maybe make a reference to text and/or figure (I guess Fig. 5) here.
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  functional
  
  Reviewer3-Sven Winter
  
  The manuscript " The Australasian dingo archetype: De novo chromosome-length genome assembly, DNA methylome, and cranial morphology" does describe a de novo genome assembly of the Alpine dingo based on PacBio, ONT, 10X Genomics Chromium, BioNano, and Hi-C. Furthermore, it describes cranial morphometrics and methylation patterns to describe an Alpine dingo "archetype". The methods used seemed overall sound, yet, the writing was often confusing and unclear, so it was difficult to understand what was done and why. The writing, in general, is my biggest criticism of the manuscript, so much so that I was wondering if the authors, by accident, uploaded an earlier version of it. Throughout the manuscript, multiple writing styles and skill levels are evident, and I am sorry to say that it seemed as if the manuscript was copied together from different sources written by the different coauthors rather than a coherent manuscript. Some of the Figure Captions have superscript numbers to highlight individuals and at the same time proper labels that make them obsolete. I first thought they were remnants of footnotes in a previous version. Unfortunately, the methods section, for me one of the most important sections, needs serious improvements. I am not a big fan of having the methods at the end of the manuscript (I know that is how GIGAScience likes it), especially when some methodology is mentioned in the results in a way that you must look up the details in the methods section to understand it. Unfortunately, that is the case with this manuscript. As there are so many paragraphs in this manuscript that need some improvements, I can only focus on some of them in this review but encourage the authors to have a careful look at the whole manuscript before resubmission, as in the current state, I would not recommend it for publication. Detailed Comments: Abstract: The abstract is overall too long and needs to be much more concise, e.g., the discussion on taxonomic designation (L75-78) should be part of the discussion section but not the background paragraph of the abstract. L91 "this female" which female? Introduction: I am missing a short review of the taxonomy of the dingo, mostly with respect to the dog or wolf. I know you do not want to draw taxonomic conclusions from this study, but a short review of what others proposed would be helpful. Also, even though everybody knows what a dog is, scientific taxon names are a requirement in scientific writing and should be added at the first mention of any taxon (e.g., dog, gray wolf, dingo, etc.). L113: intermediate in what sense? Morphological, behavioral, ecologically? L123: Please explain in more detail why a type specimen is needed, especially, as I would argue, that a population-level genomic, morphological and behavioral study would be better to answer if dingoes are feral dogs or an intermediate form instead of a single individual type specimen. L139: reevaluation L142: this can be more concise, e.g., "Zhang et al. (17) found evidence for a separation of Australian dingoes into a northwestern group and a southeastern group, clustering with New Guinea Singing dogs" L150: remove "?" L151-154: This belongs in the discussion. L153: "… being characterized. However, we suggest …." Results: L161: Please consider changing it to chromosome-scale or chromosome-level genome assembly, which is much more common. L162-170: This whole paragraph is a short summary of the methods and does not include a single result. I know it sometimes is nice to recap the methods but in this case I do not see the need for it. Or at least it can be shortened even more to something like: "The final assembly after hybrid long-read assembly, polishing, and scaffolding has a total length of 2,398,209,015 bp …." L163: I'll highlight it again in the methods, but Supplementary Figure 1 shows 18 pacbio SMRT cells were used, but the methods say 15. L164: please be more precise were they pacbio CCS or CLR reads? L167: Please provide Supplementary Figure 2 with a better contrast allowing us to see the high and low contact density on the centre of the scaffold squares. L172: ungapped is not a term I would use; instead, I prefer to refer to the assembly as scaffolded or scaffold-level if it is in scaffolds with gaps and contig-level if the scaffolds are split up into contigs for statistics or analyses. In this case, I would only state the total scaffolded length and maybe the amount of N's or gaps. Also, the second sentence would be better combined with the first e.g., "The final assembly had a total length of 2,398,209,015 bp in 477 scaffolds and a scaffold and contig N50 of 64.8 Mb and 23.1 Mb, respectively." L174: What does full-length mean? L175: please reference the dog genome properly with the accession number and reference if available. L176: Please rewrite. There is something not quite right with the bracket and the following remaining sentence. L178: Carnivora_odb10 L182: Please check the manuscript and the supplementary data for consistent spelling of Cooinda (or Cooindah). L184: what does "were full-length by BUSCOMP" mean? Please give more details here on which basis this is determined and what it means that the two other genomes hat a few more. Also, I am not sure if you have to repeat the list with canine assemblies if you have them properly listed in the methods. Again, that's why I prefer to have the methods before the results. Table1: Again "ungapped" sounds not right. Please consider changing it to be clearer. As a general side note, when you want to compare two assemblies of different assembly length, it would be better to compare NG50 instead of N50. I doubt that in this case, with only a 40-50Mb difference, it would change the results much but consider adding NG50 values. Number of gaps is also not very clear, as the gaps can be of different sizes and can be of a determined length or a standard number of N's as a placeholder for a gap of unknown length. L198: "to align Alpine dingo long reads to the Desert dingo assembly" seems not to fit here. Please check the sentence structure and rephrase. L199: "These plots show low variation on the X chromosome" More context is needed. low compared to? Why are the results only so briefly mentioned after multiple lines of "methods". This is an issue I see throughout the results. There are barely any results and mostly method summaries. Figure 2: Explain what the plot shows. I am, in general, not a big fan of these multi-layer circus plots as each individual plot is way too small to show much. However in this case the lower amount of SVs on the X chr is visible enough, but the caption needs more details. L211: Why list a reference for something that is a results of this study? Supplementary Fig. 5: Each chromosome is too small and the resolution too low to see details of the SVs. L217: So why is that important to mention? If there is no further reason I would remove it, it does not add to the story. L226: "In addition, however, we also found …" ïƒ change to "In addition, we found… " or "We also found …" L227: Consider joining the two sentences: "We also found two structural events on Chromosome 26 (SFig. 6) containing mostly short genes…" L227: What does perfectly conserved mean? L228-229: Why not show it? L230-232: This can be more concise and easier to read for example: "The Alpine and Desert dingo both have a single copy pancreatic amylase gene (AMY2B) on Chr 6. However, only the copy in the Desert dingo includes a 6.4kb long LINE." I am not sure why reference 10 is cited twice here in the results. Is this a result already known before? If so, this belongs in the discussion. L233: Again, the whole section is a short methodological summary, and there are absolutely no results. Figure 3: The figure caption needs to be rephrased completely. Not sure how this ended up here, but "NOTE: A and C as well as B and D are similar plots. However, A and B use SNVs while C and D use indels." really does not belong in a proper caption, especially as each plot is listed before stating if it is based on SNVs or indels. Bootstrapping usually does not need to be explained in a figure caption. Instead, it would be more important to mention what type of phylogenetic tree it is and on how many SNVs it is based. There is also no scale on the trees, does that mean these are pure cladograms? For B and D, please explain what an ordination analysis is and change the axis labels to something meaningful. Labeling the x and y axis "Axis 1" and "Axis 2" is absolutely pointless. I am quite surprised that this passed the final ok from all co-authors. L260: please use Desert dingo and not Sandy. L263-265: Not sure why this is important here if it is not discussed later. Figure 4: Again, the figure caption needs a complete rewrite. For example, L281 "dingo Sandy is in this clade" is very unclear and confusing; "In this figure," ïƒ remove!; What are the superscript numbers for? Please remove them, they look like they belong to some footnotes from an earlier manuscript version, which are now missing. Instead, important info such as the type of network, the meaning of the small lines, and the scale are missing. Methylome: L295: I would remove "the" before MethylSeek and a period is missing before UMRs. L297-299: I think here it would be very nice to not just mention that there are other studies but give some examples and comparisons. "These analyses" could either refer to the MethylSeek analyses or the analyses of reference 55, please rephrase to be clearer. Also, it is unclear what previously reported numbers mean, again give more details ("… in line with previously reported numbers of promoters and enhancers in, e.g., humans (promoters xxxx, enhancers xxxx), mouse (xxxx), and rat (xxxx).") L301: what does "we lifted over the former UMRs to the latter genome" Please rephrase, it is very unclear to me what you mean. L302: why was average DNA methylation calculated for UMRs. Should they not be unmethylated by definition? L306: Why have a sentence about that a gene is highly conserved but not perfect and then give the percentage of identity instead of just stating that it is 99.8% identical? I have now mentioned quite a few examples where the manuscript could be much more concise. I cannot list them all but would encourage you to read through the manuscript again and make it more concise. Morphology: My knowledge of morphological analyses is limited, but, despite the unfamiliar terminology, this section of the manuscript is easy to read and focuses in a more concise way on the actual results. My only suggestion would be to label Supplementary figure 9a with the different morphological features mentioned or adding an additional schematic to the supplementary, so non-morphologists can easier follow. L353-354: I would suggest adding the sizes after you mention the individual to avoid repeating dingo and dog brain. For example: "… the dingo brain (75.25cm3) was 20% larger than the dog brain (59.53 cm3) (Figure 5B)." Figure 5: Please add an explanation of what the polygons in 5A represent. Also, consider changing the labels in 5B. I assume LHS and RHS are short for the left-hand side and the right-hand side. This, for me, is usually used to describe positions in unlabeled figures. I would suggest changing it to Cooinda dingo (CD) and domestic dog (DD). Discussion L375: It is not clear why Cooinda should be considered the archetype at the beginning of the discussion. I would place this in the conclusions and base it on the results and the discussion. L394-395: Please rephrase and shorten, e.g.,: "There is a single copy of AMY2B in both dingo genomes; however, they differ by a 6.3 kb retrotransposon insertion present in the Desert dingo." L394-405: I would like to see a more in-depth discussion on the differences between wolf, dingo, and dog. If there is no LINE in the wolf but both in the dingo and the dogs, when did the transposition happen? Could be two independent events in the dog and dingo lineages or one in the ancestral lineage. Are the LINEs in dog and dingo at the same position in the gene region? Could it be the same insertion that was reduced in length in the dog lineage, and what does that mean for the evolution of dogs and dingoes? L431/432: please use Alpine and Desert dingo instead of the individuals' names. L471-473: Not sure if a single sample (Cooinda) is sufficient to come to this conclusion, also how does it compare to the wolf? She could just have been a dingo with an exceptionally large brain. I think a more in-depth discussion is needed. Methods: Overall, the methods need to be more concise but at the same time clear and complete. L530-531: Why is solving the puzzle-box experiment important? Does that not suggest an exceptionally intelligent dingo if she was the only one, and could that not potentially explain the large brain size? How does brain size and intelligence or the potential to solve the puzzle-box correlate? L532: her brothers L533: What is the importance of the ginger color? As it is stated here, it is a bit out of context. Why is it important? L535-542: Why is this detailed report on her appearance of importance? I am often missing logical connections in the manuscript. L541: I would usually not expect to read such a statement with an emotional connotation in a scientific manuscript. L545ff: When were the samples taken? What type of samples were taken? How were they preserved? As it is stated that fresh blood was used, I assume Cooinda was still alive at that point. Are there any sampling and ethics permits to be mentioned? L552-56. This whole section about the pulse-field electrophoresis can be much shorter without losing any information, e.g., "Molecular integrity was assessed by pulse-field gel-electrophoresis using the PippinPulse (Sage Science) with a 0.75% KBB gel, Invitrogen 1kb Extension DNA ladder (cat ….) and 150 ng of DNA on the 9hr 10-48kb (80V) program." L556: What libraries? You have not explained how the libraries were prepared. CLR or CCS? L557: Which Sequel platform was used? Sequel I, II or, IIe? L558: remove the hours of movies, that does not matter unless you used a custom sequencing program. L559: AS 15 SMRT cells were used, I assume the sequencing was performed on the Sequel I. L561: I usually avoid starting a section or paragraph with "for". Please consider rephrasing as you start most paragraphs that way. L564: 119 ng of library, especially for long DNA-molecules, seems very low for a decent ONT run. I have mostly used the MinION, and I would usually only load a library with so little DNA if I only needed a few reads. I am just curious how well that worked on the larger PromethION flow cell. L573: In some sections, this manuscript reads like an early draft that was accidentally submitted. "User Guide, manual part number CG00043 Rev B." Please rephrase. L575: For me personally, it does not matter where Qubit measurements were taken, but if you include that info, please try not to repeat it as you did in L578. L576-577: Please shorten the two sentences about sequencing to one, e.g., "Sequencing was performed in 150bp paired-end sequencing mode on a single lane on the Illumina HiSeq X Ten platform with a version 2 patterned flowcell." L581-582: Does reference 8 use the same protocol version? If so, I would remove the brackets. If not, Is there no version number of the protocol available? L594-601: Why is this a mixture of insufficiently described methods and results? Please give additional information about trimming and assembly using canu. Why mention the number of sequences, bubbles, and unassembled sequences in the methods? L597: How were the reads aligned to the assembly? I have not used Arrow but if the pipeline uses mapping tools, please mention them. L598-601: These are results and should not be part of the methods. L614: what does finishing mean? In the literature, it is more common to write "manually curation of" or scaffolds were "manually curated" or "manually edited". L615ff: Again, these are results. I would not place them in the methods. L621: Was gap-closing performed only once? Were pacbio and ONT reads combined on one iteration of gap-closing? Maybe PBJelly suggests only using it once, but in my experience, gap-closing can be performed iteratively to further improve the contiguity. L622-623: Again, results in the methods. L634: Why is it important when the chromosome mapping was completed? You did not specify when the sequencing was performed or when the samples were taken. L635: Please add accession number and, if available the reference. L644: Circos is a tool for plotting data in a circular plot. How were the SNV, and indels identified? L647: X chromosome L652: I usually use GeMoMo for homology-based gene prediction. I would like to see a short description of the method rather than just linking to a previous publication. L658: "processes that produce differences" is not very precise, please give some more info here. I would usually remove Indels from phylogenetic datasets due to the uncertainty of their mutational history. How were they coded and how were they analysed? L659: What is WA distance? Reference? L660: The Glazko et al reference is quite out of context here. Better phylogenetic properties than what? Why use distance-based phylogeny? How many SNVs were used? How ere they filtered L662: Maximum parsimony is not frequently used anymore for phylogenetic reconstruction. Why not use a Maximum-likelihood or Bayesian approach? L664: It is mentioned that the wolf should be the outgroup, but the dataset itself is not mentioned in the methods. List all samples that were included in the phylogeny. If the dingo is assumed to be an intermediate between wolf and dog why did you not use a different canine as outgroup to avoid bias? L665: include version and the URL of the tool if there is no paper to cite. I cannot judge the methods for Methylation and Morphology, as this is not my expertise, but these method sections read very well and seem clear to me. Availability of supporting data: There are quite some broken sentences and misplaced periods in this section. Please check the text again and make sure that the links to your datasets are functioning. Overall, the presented data are interesting and a valuable resource, but the manuscript itself needs some major improvements to make the interesting results available to the reader in an easier-to-follow and more understandable form. It is obvious that multiple authors with different expertise worked on different sections of the manuscript. The challenge during the revision is to bring it together into a concise and easy-to-read manuscript with a consistent writing style. I hope that my comments, questions, and suggestions can help in that process. Please take my writing suggestions as such, feel free to adjust and change it in a different way as long as the result is more reader-friendly and more concise.
3. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Background
  
  Reviewer2-Jack Tseng
  
  My evaluation of the manuscript was restricted to the geometric morphometrics (GM) section. The authors seem to have followed a standard GM procedure in their analysis of cranial shape differences among dingo skull samples. My only suggestion is that additional detail be provided in the landmark data collection for the GM analyses: -Reference 58 was cited as the source of the landmarks used in this study, but no other details are provided. A list of landmarks that forms the basis of the geometric morphometric analyses should be presented in order for the reader to fully interpret the PCA plots.
4. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1-Andreas Chavez
  
  The article "The Australasian dingo archetype: De novo chromosome-length genome assembly, DNA methylome, and cranial morphology" is well written and interesting study examining the evolutionary relationships between the focal species, the dingo, and related canids (both domestic and wild). This study uses an impressive amount of state-of-the-art genomic data and resources to produce a (chromosome-length) de novo assembly of the dingo genome. The approach for assembling the genome are all adequate. The comparisons of chromosomal structural variation and methylation patterns with another dingo ecotype and other canids show interesting patterns of divergence that are potentially important regions of adaptive differences. I have two major concerns and some minor concerns for this paper. Major concerns: I believe this paper would be stronger if it contained analytical methods that addressed admixture between dingos and domestic dogs more explicitly. The authors state that admixture between dingos and domestic dogs (Line 453) is one of a few hypotheses that may explain phenotypic differences between the two dingo ecotypes. To evaluate this hypothesis with the genomic data, they rely primarily on phylogenetic analyses to explore the evolutionary relationships between the dingo, wolf, and domestic dog lineages. They show that the dingo lineages are outside of the domestic dog clade and that wolves are outside of the dog/dingo clade. Although it is probably true that dingos are a unique evolutionary lineage, phylogenetic analyses are not the strongest tool for assessing admixture and the contribution of genomic variation from different ancestral source populations. I would recommend using methods that would test admixture hypothesis more explicitly. D-statistic tests (ABBA BABA test) and related tests would seem appropriate for this kind of data and sampling scheme. I also have concerns about the interpretations of brain size differences between dogs, dingos, and wolves. Although I am intrigued by the idea that domestication may have driven reductions in brain size and shape variation, I find it hard to not consider natural selection pressures in the case of dingos and wolves. The best scenario for testing the domestication-driven hypothesis would be if dogs, dingos, and wolves evolved in a common environment and domestication practices were the most notable differences between them. However, given that wolves and dingos in Australia evolved with different prey and habitats on different continents, it seems hard to me to not consider environmental adaptations as another important factor in the evolution of brain-size and shape variation. Minor concerns: Introduction: It took me awhile reading deeper into the manuscript to understand what was meant by the name Cooinda. For awhile, I thought it was the name for a dingo subspecies or ecotype. I would suggest including a brief section in the introduction stating that the genomic and morphological data in this study is based off of a single individual named Cooinda and that there are questions about it's ancestry and placement as one of the dingo ecotypes. Line 172: ")," should be replaced with ")." Line 371: I don't think it is necessary to say "The passing of Cooinda the dingo" Line 463: more is needed to finish the point of "will illuminate" Line 494-497: The role of venomous animals as barriers to gene flow is conceptually not clear and is not supported by the citations from what I can readily tell. Line 540: dewclaws? Line 541: "Regrettably" isn't necessary to include
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.26.525801v1
www.biorxiv.org www.biorxiv.org

Cell type-specific interpretation of noncoding variants using deep learning-based methods

3
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  n hum
  
  Reviewer3-: Borbala Mifsud
  
  Gigascience - Cell type-specific interpretation of noncoding variants using deep learning-based methods Sindeeva et al. have developed DeepCT, a convolutional neural network-based model that predicts sequence and cell type-specific epigenetic profiles from available epigenetic data. The novelty of the approach is that the model can learn unmeasured epigenetic profiles in a given cell type, if there is another cell type that has the target feature measured and shares one or more other epigenetic data types with the cell type it aims to predict in. The authors demonstrated that the framework works well and that the model learns both sequence context and cell type-specificity and used the model to predict which de novo variants, identified in the Simon Simplex Collection, have the highest effect in any of the cell types they studied. Focusing on one variant with high predicted effect in glial cells they suggested a mechanism, whereby the variant in a putative enhancer element within the SMG6 gene reduces FOS binding, which might affect SMG6 expression in these cells. I have a few minor comments to clarify the applicability of this model. Minor comments: 1. In Figure 2D the authors showed that adult heart and fetal heart cell state representations cluster together even though they did not share the measured epigenetic features. This is an interesting observation, however one of them had ATAC-seq data while the other had DNase-seq data which are highly correlated. It would be good to know how much this can be generalized to other cases. What is the level of correlation between two epigenetic features that is required for correct clustering of the cell states between two cell types that do not share epigenetic features? 2. In both Figure 2C and in Supplementary Figure 1, the 2D visualization of the cell state representations, show that some cell types cluster well together while others do not cluster at all. Even those that cluster well, like "Digestive", "Kidney" or "Muscle" cells have many cell types that do not cluster with the others. Apart from biological differences, could this be also reflective of cell types with lower quality epigenetic tracks? How much does the quality of the tracks effect the model? 3. Figure 3E shows that there are some points where the accuracy of the model is much higher when leaving out certain epigenetic tracks from the training of the model. Is that also related to quality of those data or is there a specific epigenetic feature where the model consistently shows higher accuracy when the feature is left out? 4. The authors used 1000bp for representation of the sequence, but the target sequence that is checked for overlap of the epigenetic features is only 200bp. Does the model learn from the additional 800bp? 5. For the cell state tail the chosen emb_length was 32. Based on Supplementary Figure 1, I assume this is due to the number of cell type groups expected, but it would be good to include the rationale in the methods. 6. For the GO term enrichment what background was used? I would expect that the nearest genes of all de novo variants found in autism cases would show enrichment for similar GO terms. 7. Pg.11 last line should be "FOS transcription factor binding" instead of "grinding".
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Interpretation
  
  Reviewer2-Yuwen Liu
  
  The manuscript entitled "Cell type-specific interpretation of noncoding variants using deep learningbased methods" interpreted the non-coding genomic variants by integrating the single-cell epigenetic profiles with the convolution neural network. The author found the CNN can capture the cell typespecific properties and generate a biologically meaningful cell state representation by embedding the cell to the latent space. In general, the architecture of the convolution neural network is novel, and, to a certain extent, the model may be helpful for improving our understanding of genomic non-coding variant effects at single-cell level. Major comments: 1. In Figure1C the author intended to quantify how often unmeasured epigenetic marks can be inferred from available profiles. Although, in fact, the modification of the epigenetic marks is correlated and sometimes colocated in the genome (Ernst and Kellis 2015). However, the connected graph is not a piece of strong and solid evidence or data for quantify the predictive ability of the epigenetic marks. They should provide other compelling evidences or undertake more analysis. 2. The author used an empirical p-value threshold to detect the peak position along the genome. The definition of the peaks for epigenetic mark is crucial for the whole study. At least they should plot the distribution of p-value and explain why they choose the empirical threshold of p-value as 4.4 in detail. Furthermore, the false positive outcome of the test should be corrected. 3. Some epigenetic marks present broad modified regions of the genome, the 150 bp DNA sequence may not contain all the sequence determinants for that broad peak. That is may the prediction performance is poor for most of epigenetic marks. 4. In Figure3D and Supplementary Figure 2, the majority of epigenetic marks presented very poor prediction performance. The author should discuss the potential biological reasons that lead to this result and perform some analyses to preclude these confounding factors. 5.The author should scrutinize their data because they also use some epigenetic profiles form heterogeneous tissues which are composed of different cell types. And these heterogeneous profiles may weaken the predictive power of the convolution neural network model and impair the interpretability of the model. 6. The authors only used SSC data to showcase their predictive power in pinpointing potential causal non-coding variants of ASD. I suggest use GWAS data from a wide varieties of complex traits and diseases to generate a more thorough evaluation of the specificity of their prediction. Furthermore, the authors used prediction leveraging signals from 794 cell types in predicting non-coding causal variants for ASD. Including a large number of ASD-irrelevant cell types would likely bring strong noise and make the results hard to interpret. I suggest the authors mask the epigenetic marks of ASD-relevant cell types (treating these cells as if they do not have available epigenetic data), and then use epigenetic marks from other cell types to predict non-coding variants with high impact on epigenetic marks in ASDrelevant cells. Then use this new prediction to rerun Fig 4A and 4B. Achieving good performance with this new analysis would better demonstrate the core advantage of their new model, i.e., predicting celltype specific non-coding effects of cell using epigenetic information from other cell types. Minor comments: 1. The author defined peaks as 150 bp genomic intervals, however, they use 200 bp DNA sequence as the center when preparing the data for the CNN input. 2. The resolution of the figure should be greatly improved.
3. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1-: Fangfang Yan
  
  In this manuscript, Sindeeva and colleagues describe a novel neural network-based algorithm, DeepCT, to cluster epigenetically similar cell types and infer unmeasured epigenetic features, which then can be used to interpret non-coding variants. The manuscript is well structured and well written, it is potentially interesting to a broad readership. Yet, the algorithm itself in the manuscript lacks rigor and thoroughness. Major points: 1. Lack of comparison with competing methods 2. As the authors state themselves in the results and discussion, the performance of DeepCT among some features is very low, such as H3K9 and H4K20 monomethylation. Could authors add more discussion and explanations of this almost zero average precision? 3. The authors said "statistically higher" or "outperforms" in a lot of statements but no statical test results. For example, on page 8, the authors write: "This analysis confirmed that average cosine similarity for embeddings representing cell types from the same tissue was significantly higher than for embeddings of randomly selected cell types". On page 9, "we note that this baseline has performance metrics substantially higher than expected in random (baseline AP=0.417)." 4. On page 8, the authors write "we show co-localization of muscle cells, as well as co-localization of digestive cells (Fig. 2C)". However, Figure 2C looks not quite convincing. Minor points: 1. Providing high-resolution vector-friendly figures will help a lot. I can barely see the content of the figure in the current version. 2. A jupyter notebook tutorial on the Github repo would be helpful for users to apply DeepCT quickly.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.12.31.474623v1
www.biorxiv.org www.biorxiv.org

ViReMa: A Virus Recombination Mapper of Next-Generation Sequencing data characterizes diverse recombinant viral nucleic acids

2
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Genetic recombinat
  
  Reviewer2-Fadi G Alnaji
  
  In this work, Sotcheff et al provide a comprehensive and nicely-written report about using the algorithm Virus Recombination Mapper (ViReMa) to identify and characterize different kinds of recombination events in different viruses. ViReMA was first reported - by the same group - in a separate paper (Routh et al, NAR, 2013) as a python-based algorithm that, by accounting for the high-diversity nature of virus populations, can efficiently detect a wide range of virus recombination junctions within virus-derived Next Generations Sequencing (NGS) datasets. In this paper, the authors described a couple of important updates on the original algorithm that enables ViReMa to cope with the new technological advances in NGS, including the read length and the significant increase in NGS library size and NGS-based experiments. Notably, the authors implemented a powerful validation approach by challenging the algorithm with a different type of NGS-based data containing various types of junctions from different viruses to highlight the contextual computational and biological connotations. Overall, the paper used a robust analysis method and sufficient controls to clearly demonstrate the capacity of ViReMa to detect different types of recombinant molecules in different NGS datasets and viruses with high sensitivity and specificity. I only have very few minor comments.Minor comments1) Since Fig 2E is showing the gradual effect of the permissibility imposed by the error-density values, transforming the tables into figures e.g. bar or scatter plots can render the effect more observable visually.2) At lines 500-501, the author found that the majority of reads mapped directly to the virus genome. Looking at the aligned read number, this dataset seems fairly large, I was wondering if using the newly added function --Chunk can come into play at this scenario to speed up the analysis? If it is the case, then maybe mentioning it would be valuable.3) At line 478, the authors stated: "The 'Reads' columns describe the number of reads at each particular nucleotide position", is this the average read number?4) Typos at line 206 "red", and at line 397 "(NL4-3)"
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1-Diogo Pratas
  
  This article describes a pipeline (coded in Python) to detect and analyze recombination events of viral genomes using short-read FASTQ data. The paper presents some level of work accomplished by the authors. Usually, these types of articles hide numerous hours of coding and experimentation. Moreover, the authors present actual accomplishments that typically are unique architectural designs and important alternative ways to the area, including several results. However, many points require attention, namely:1) This pipeline expects exactly a specific virus. Hence, it uses a specific reference. However, this reference might not be the most representative because of the recombination events. Although it may be appropriate for smaller recombination events, detecting large-scale recombinations may face substantial difficulties. Moreover, since it is not prepared to deal with more significant variations (without de-novo support), it is exclusively for targeted support. Therefore, the article could be more descriptive about this specificity.2) The article states that the improvement is also inspired with the read length increase that NGS is bringing. Also, the reported depth coverages are very high. So, why not use de-novo assembly? For example, the de-novo assembly can be used to create scaffolds that can generate a reference sequence to be used after by the aligners. Please, comment on this.3) About the use of artificial poly-(A)tales to allow the mapper to align the reads, what happens when the read size is smaller than the k-mer hash of the aligner? Usually, repetitive A-sequence content appears in almost all samples because they have lower entropy and a higher probability of being generated. Wouldn't this create ambiguity, especially when there are very high-depth coverages? Please, comment on this matter.4) What is the minimum read size allowed to be considered a valid read for downstream analyses? Are the reads collapsed (in the case of Paired-ends) or considered split? Although less probable, the trimming is fundamental for excluding "events" generated at the tips of the reads that very rarely overlap, depending on the nucleotide distribution.5) Are the reads clipped above a particular depth coverage? This feature is especially critical in repetitive viral content, such as hairpins or poly- (A)tales - removing mountains that become the most significant factor in sequence depth coverage.6) Have some of these viruses been enriched for targeted capture? Please, provide this information in the manuscript. In some parts of the article, the coverage depth is very high: 300'000 - is this 300000? The simulated data used this coverage which may not be entirely similar to reality. Also, allowing lower depth coverage helps to understand how the pipeline behaves. Moreover, some aligners may have problems in older versions with these depth values.7) It was unclear which types of duplications were flagged and if the pipeline covers them.8) How does the pipeline deal with contaminants?9) This article states that the pipeline works for viral sequences. However, the tests used do not include large genomes. What about larger genomes? Some larger genomes contain repetitive content that provides additional reconstruction challenges. Therefore, the benchmark could have an example of this nature.10) While looking for recombination events, specially fusions with the host, what are the differences between sequenced viral integrations and fusion events at the analysis level? How do we distinguish both using this pipeline? Please, comment on this.11) The authors state that the pipeline provides accurate results. Regarding the calculation of accuracy values, several good practices and recommended by many experts in the field:a)https://www.sciencedirect.com/science/article/pii/S1386653220304339b)https://www.sciencedi rect.com/science/article/pii/S138665322100079212) Augmentation of existing pipelines in the area could guide the reader to other solutions and sometimes complementary. See, for example:a) ASPIRE: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08649-8b) TRACESPipe: https://academic.oup.com/gigascience/article/9/8/giaa086/5894824c) V-pipe: https://academic.oup.com/bioinformatics/article/37/12/1673/610481613) Line 113: "in range a of plant" - please correct;14) Line 120-121: Please, rephrase.15) There are several acronyms; perhaps an abbreviation list would improve the reading of the article.16) Line 394: ART is defined as "antiretroviral treated," but this acronym overlaps the ART simulator. Perhaps, in this case, adding another letter or changing it would remove the ambiguity.17) Line 753-754: Reference 27 is missing at least the title, journal, and year.18) Please, consider to add ViReMa to Bioconda.19) I've tried to clone the repository from sourceforge, and it came out empty. I had to download the package manually. I faced some problems, perhaps because it was not easy to follow. Possibly, users may face the same difficulties, which may be an obstacle to using the software. Please, consider having an elementary example for running ViReMa (already including some tiny read sample and reference along with the code and command description - including how to run the GUI). Please, consider using Github in the following times.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.03.12.484090v1
www.biorxiv.org www.biorxiv.org

Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim

2
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Nanopore seq
  
  Reviewer2-Hadrien GourlÃ
  
  Thank you for a great piece of software and great article.A few minor points:l166-168: the clause about structural variant is unclear to me, and perhaps to the reader. Please consider rephrasing.l209-210: I understand that the number of mapped reads is unadequate for abundance estimation of ONT data, but k-mers should not suffer from the same problem, shouldn't they? the number of k-mers matching to a genomic region (or a genome) will scale appropriately with read length. I have therefore a hard time understanding why k-mers are presented as problematic in the first sentence of the Abundance Estimation paragraph.l379: What do you mean by "pronounced". Please consider rephrasing.l438: geomes -> genomesfigure 2: panel 2 should have the same theme as the other subplots of the figuresoftware comments:- I'd like to be able the examples present in the documentation out of the box: please add a link and instructions on how to download and unpack the zymo community- Speaking of the zymo community, why do Campylobacter and S. cerevisiae have a different path than the other genomes in the examples?- If you plan to not update the pre-trained error models to a more recent version of scipy, please pin scipy 0.22.1 to the bioconda recipe, so that users can use pre-trained models out-of-the-box- Please make a new release of the software including pull request !67- In a future version of Nanosim, I urge you to consider gathering all scripts into subcommands (i.e. nanosim simulate [-- params] instead of simulate.py [--params]. I realise this a big breaking change but it is good practise, and avoids polluting a user's PATH with many scripts. This change is in my opinion not required for the paper to be published, but something I'd like you to consider for a future release
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  ABSTRACT
  
  Reviewer1-Andre Rodrigues-Soares
  
  This manuscript is of very good quality - as such, my review is quite limited. I would like to congratulate the authors for the development of the software and its comprehensive and extensive benchmarking.On l. 64 - Given the range of samples that can currently be sequenced using Nanopore sequencing and the recent focus on short reads as opposed to the previous highlight given to long reads, this statement is out-of-date. I would recommend describing instead the current range of sizes that can be sequenced via Nanopore.After reading the manuscript, I am however, left to wonder how the authors would approach the different error rates namely of different types of flowcell (R9 vs R10). It can be assumed that error rates of R9 sequences might indeed average 10% as stated in the manuscript, but with the advent of R10 flowcells (more recently R10.4.1) and respective updated chemistries, the error rates have decreased significantly. One specific error rate of 10% as stated in the manuscript can't be assumed for the sequencing technology as a whole anymore. While I don't think this should be central to the development of the tool, I think this should be addressed in the manuscript in some way.I would also have liked to see distributions of PHRED quality scores in the simulated reads in the analyses conducted in the manuscript. Although the assembly and genome recovery statistics namely in Figure 4 indicate these should have the expected distributions, I would have liked to understand how quality scores are distributed in the generated reads. If the two issues above are addressed in the manuscript, I will be happy to recommend its publication.I have no further reviews to add as the manuscript covers all other factors I would think could be worrying regarding a tool simulating Nanopore metagenomic reads.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.11.19.469328v1
www.biorxiv.org www.biorxiv.org

FAIR Data Station for Lightweight Metadata Management & Validation of Omics Studies

2
1. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Background
  
  Reviewer2-Sveinung Gundersen
  
  The paper describes the FAIR Data Station, which is a lightweight application written in Java that facilitates FAIR-by-design by allowing the collection of structured metadata from the first phase of a project. To this end, the authors have applied and extended the ISA metadata framework to form a core data structure wherein attributes from a library of 40 frequently used minimal information checklists can be placed. The FAIR Data Station contains tools for generating and validating Excel metadata files, as well as conversion to RDF format as well as to a European Nucleotide Archive(ENA) compatible XML metadata file for submission.General comments:The FAIR Data Station (FAIR-DS) seems to be a useful application to help life science researchers to collect and structure metadata according to the FAIR principles. The software is based on core community standards, ontologies and checklists. As for deposition databases, the software currently seems to only integrate with ENA, which, on the other hand, is a central deposition database.The three main contributions of FAIR-DS is to my mind A) the metadata schema that has been carefully constructed by the authors, B) the validation functionality of metadata against said schema, and C) functionality for conversion of validated metadata into RDF and deposition formats There are, however, some architectural choices and technical limitations in the implementation that I have issues with and which makes me uncertain whether the software shows enough "innovation in the approach, implementation, or have added benefits", as mentioned in the "Instructions for Authors"(https://academic.oup.com/gigascience/pages/technical_note). I would therefore invite the authors to address the following issues:1. The authors state that "the FAIR-DS uses an extended version of the original three-tier Investigation, Study, Assay (ISA) metadata framework [https://isa-tools.org]". This leads the reader to think that the software applies the full ISA Abstract Model (https://isa-specs.readthedocs.io/en/latest/isamodel.html), which is not correct. Only the top level objects and a few attributes are retained. It is also not clear why the authors have found it necessary to add additional, custom object types, such as "Observation unit", explained as "the "object" from which the measurements are taken". The ISA model includes an attribute "source material" which seems to overlap. The authors have also added "sample" as a top-level object, even though there is already a "sample" attribute in the ISA model. It is unclear to me what is improved by adding new object types and whether any such improvements will outweigh the obvious drawbacks that comes with not following a community standard for the metadata schema.2. The FAIR-DS makes use of Excel files as an intermediate format for collection of user metadata. While the feature set of Excel and its familiarity for most users are good arguments its adoption, I miss a discussion on the fact that a commercial product is included in the core architecture of the system. FAIR principle I1 promote that: "(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation". As Excel is only an intermediate metadata format, while RDF is used for the final output, the FAIR-DS does not directly break principle I1, however I think the choice of a commercial file format is not following the "spirit" of FAIR. I see no reason why CSV could not be included as an alternative to Excel and that the authors could recommend an Open Source application as alternative for users that wish their entire software suite to remain in the Open Source domain.3. The metadata schema is not represented in a standard schema format, such as JSON Schema, Frictionless table schema, or similar. Using a shared format for representing the metadata schema makes it possible to make use of general validation libraries (such as the ELIXIR Biovalidator: https://doi.org/10.1093/bioinformatics/btac195). Shared schema formats also allows for reuse of the schema in other contexts/software. In FAIR-DS, the metadata schema seems to be primarily represented in an implicit way in the Java source code that generates the Excel files as a secondary representation of the schema. Even though the FAIR principles might not directly include a recommendation to share of the metadata schema in a FAIR way, one can argue that this falls under R1.3: "(Meta)data meet domain-relevant community standards". It would in any case be in "the spirit of FAIR".4. As a consequence of issue 3, the validation functionality is also specified implicitly in the Java source code and does not seem to reuse much external validation functionality. I particularly miss validation of ontology terms against the relevant ontologies, as well as more stringent validation of PMIDs, DOIs etc, preferable using CURIEs instead of URLs. All of these data types only seem to be validated as general strings, which is of limited use. Users might for instance introduce spelling variants for ontology term labels without this being detected by the validator.5. Due to the hard-coded nature of the metadata schema, the validator and the conversion functionality, I suspect the authors might not have designed the system flexibly enough to allow for easy updates based on updates in the external dependencies, i.e. the minimal information checklists, ontologies, or deposition schemas. For instance, EMBL-EBI, who are hosting ENA, are moving towards requiring the submission of sample data/metadata to BioSamples, prior to submitting the metadata to ENA, which might have consequences for the checklist requirements. Also, ontologies in particular are known to be updated regularly.6. I am not convinced that the authors have done a careful enough search of the literature to list relevant software solutions for comparison. For instance, the FAIRDOM Seek solution (https://doi.org/10.1186/s12918-015-0174-y) is not cited directly, although the functionality seems to be highly overlapping.7. The manuscript would benefit from careful proofreading of the language and grammar.When addressing these issues, I would urge the authors to better demonstrate "innovation in the approach, implementation, or ... added benefits",
2. GigaScience 10 Apr 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1-Dominique Batista: An overall a strong paper that creates a new bridge between the ISA model and the FAIR principles.A few points should be addressed:- page 2: "As one Investigation can have several research lines, each Study layer has a unique identifier ...": how do you generate these identifiers and control their uniqueness, persistency and stability? Are these identifiers resolvable ? "As an extension to the original three-tier ISA-model in between Study and Assay two additional layers of information were added Observation unit and Sample": would you clarify what problems were addressed ? More generally speaking, does the FAIR-DS integrate with existing implementation of the ISA model ? Did you consider a conversion and submission to external systems such as the ones mentioned in the conclusion ? The text for figure 1 is good, but the corresponding text in the core of the document is hard to read and understand. "Model specific attributes are optionally selected by the user": Does this mean users can add extra fields on top of the provided packages or that they have to select fields within the given package ?-page 3: "In addition, we included regular expressions obtained from the ENA checklist, such as "(0|((0Ë™)|([1-9][0-9]Ë™?))[0-9]*)([Ee][+-]?[0-9]+)? (g|mL|mg|ng)" for sample volume or weight for DNA extraction": good point. Is their a mechanism for users to add new regex ?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.03.502622v1
Mar 2023
www.biorxiv.org www.biorxiv.org

Genome assembly of the deep-sea coral Lophelia pertusa

1
1. GigaScience 29 Mar 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.78), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Takeshi Takeuchi
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  Scaffolding with the Chicago and Hi-C libraries did not significantly improve the assembly. In general, Hi-C scaffolding can produce a chromosome-scale assembly. I would suggest that the authors describe the quality of the Chicago and Hi-C sequence data. For example, the mapping rates of the Chicago/Hi-C reads to the assembly should be informative.
  
  **Reviewer 2. Yang Zhou **
  
  This is a fascinating study on the assembly of the first deep-sea scleractinian coral, Lophelia pertusa. The manuscript is well-written and easy to follow. I have gone through your manuscript and would like you to address the following concerns/comments before publication.
  
  Line 47: 1.2 454 pyrosequencing reads means 1.2Gb 454 pyrosequencing reads? Line 51-52: Please add some references. Line 72: As far as I know, the DNA extraction process of stony corals is affected by calcium carbonate skeletons. How did you deal with this problem during the DNA extraction process? References: Please double-check the references for errors. Italics for species names, capitalization of journal titles, and so on.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.27.530183v1
www.biorxiv.org www.biorxiv.org

The first genome assembly of the amphibian nematode parasite (Aplectana chamaeleonis)

1
1. GigaScience 29 Mar 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.79), and has published the reviews under the same license. These are as follows.
  
  **Reviewer 1. Xuanmin Guang **
  
  Han et al. had carried out genome assembly of Aplectana chamaeleonis, analysised the genome’s repeat content and annotated the genome. They descripted the geneset’s function and done a PSMC analysis. The genome is a key source for research, but there are so many mistakes in the manuscript, I suggest the author to revies the manuscript carefully and the grama and content should be re-organized. Some suggestions have been listed below:
  
  In the context part, the first two sentence lacks continuity in logic, please change them.
  
  The author didn’t mention which sequence platform they had used in the context, I think this should be added.
  
  The average sequence length in the table is 496kbp, but the author it as 496Mbp , this is a mistake.
  
  In table 1, why there aren’t any gaps in the scaffold genome?
  
  The author said that “This suggests that the significant expansion of repeating elements is an important manifestation of species differences”. Its unreasonable to get this conclusion only based your genome repeat analysis.
  
  In the text they claim that 12887 function gene had annotated, I want to know how much gene they have annotated? Please add this in the manuscript.
  
  Too many decimal places have been used in the Table2.
  
  Re-review: The author revised the paper as I concerned in the report and the paper could be accepted now.
  
  Reviewer 2. Jianbin Wang
  
  In this manuscript, Hou et al. present a genome assembly for Aplectana chamaeleonis, a parasitic nematode that infects amphibians. They report a genome of ~1 Gb, most of which is composed of repetitive elements. This genome draft is significant as it is the first assembled for this or any Cosmocercidae species. It may provide insights into the evolution of the nematodes – if it is thoroughly compared to other nematode genomes. It may also allow for better species identification than previous morphological methods. While the conclusions on genome size and composition described in the paper appear sound, there are many questions that go unanswered. The reasoning behind why this research was undertaken is not clear. What is the ecological or agricultural and economic impact of the species? How would the genome provide a better understanding of this species? More specific information is also needed to better understand the genome. How many chromosomes does this species have? Is there any cytology to help answer this question? Any notion of sex chromosome vs. autosome? This genome is much bigger than most of the assembled parasitic nematodes. The author did not make any efforts to explain what might contribute to this. Could the big size due to contamination in the samples used? Judging from the images, it does not look very convincing to me how clean the sample was for the genomic DNA extraction. Overall, there is a lack of in-depth data analysis and comparison between this genome and many other available nematode genomes. About the overall presentation and organization of the manuscript, the context is often lacking from results. How do these results compare to related species? How does figure 4/the demographic history fit in to this story? A round of general proofreading needs to be done for grammar, punctuation, capitalization, italics, etc – see below for some specific examples. In the Abstract, the repeat content in the Ascaris genome is 72.45%, and the total length is more than 742 Mb. The math does not add up (1.1 Gb x 72.45% = 797 Mb). Or do you mean the Aplectana genome? Should say total length of repeats. Why is this “Ascaris” genome? Ascaris is a parasite that infects pigs and human. Some sentences need addressing/clarification: Page 1. “and their diversity is also very high, many of which are above the national second-level protected animals” – what is the significance of this/how are these ideas related? Page 2. “Through the characteristics of the genome sequence, it shows that the genome is a highly continuous genome” – need to be more specific with metric and data. Page 4. “In addition, the enrichment of A. chamaeleonis genes in all metabolic pathways was found in twelve metabolic pathways.” – not sure what you are trying to say about the all or 12 pathways. Figure 1. - Images need scalebars. In A, what is the mat of material? For A, crop out area around the worm and enlarge the worm image. In B the worm is dark/shows little contrast or detail. In C, label which image is the head and which is the tail (or specify left vs. right in the legend text). The images in B and C look like they were taken using a cell phone pointed at a computer monitor – are there higher quality images? Table 1. – Why is the data in all four columns the exact same? What is the difference between each column? This appear to be a mistake when preparing the table. Very sloppy and unfortunate! Table 2 – Significant figures on the %s?. Is the “other” category needed (same for Fig2C)? Table 3 – Check text spacing (e.g. % in genome). Figure 3 – Recommend to redo the spacing of figures, increase size of text in each part of this figure. Need to refer to parts of figures in the body/text (Fig 3a vs. 3b vs. 3c). Can 3b be sorted from most number of genes to least? Figure 4 is not referenced in the body text. Consider merging Fig 4 with Fig 3. Figure 4 is lacking a description in the legend – what are the grey lines, definition of LGM? The x-axis scale and orientation are unintuitive – is the present on the left and the past on the right? Past should be on the left. Methods Genomic DNA was purification for Long-reads libraries preparation – should say purified What is the meaning of “The generation we used was 0.17” – what generation is this? and “the mutation rate was 9×10-9” needs units. The sentence “we used the pairwise sequentially Markovian coalescent (PSMC) model to estimate the effective population size of A. chamaeleonis within last million years.” should be moved to the section immediately after its current location.
  
  Re-review: Overall, the writing has been improved in several places and is somewhat clearer than in the previous draft. These changes are mostly related to the minor concerns raised. However, many questions related to the broader impact of this research and how the new genome compares to other nematode species remain unanswered. The following comments were largely ignored. 1. The reasoning behind why this research was undertaken is not clear. 2. What is the ecological or agricultural and economic impact of the species? How would the genome provide a better understanding of this species? 3. More specific information is also needed to better understand the genome. How many chromosomes does this species have? Is there any cytology to help answer this question? Any notion of sex chromosome vs. autosome? 4. This genome is much bigger than most of the assembled parasitic nematodes. The author did not make an effort to explain what might contribute to this. 5. Overall, there is a lack of in-depth data analysis and comparison between this genome and many other available nematode genomes. How do these results compare to related species? 6. About the overall presentation and organization of the manuscript, the context is often lacking from results. Another round of general proofreading needs to be done for grammar, punctuation, capitalization, italics, etc. – see below for additional specific examples. The authors, not the reviewers, need to make a concerted effort to read and proofread their own manuscript.
  
  In addition to the big picture points raised above, several other issues that were either brought up last time or are new and need to be addressed: 1. Not sure Table 1 is present the right way. The columns and rows should be reversed, I think. If so, there will be only one column - do you still need a table? 2. “Through the characteristics of the genome sequence, it shows that the genome is a highly continuous genome.” Unclear. The authors mentioned that they have fixed this in their response to the reviewers, but no change was seen in the updated manuscript. 3. “The generation we used was 0.17, and the mutation rate was 9×10-9 [8].” These numbers need units after them. Again, this was addressed in the response but not written out or clarified in the revised text. 4. “In addition, the enrichment of A. chamaeleonis genes in all metabolic pathways was found in twelve metabolic pathways.” Not sure what the authors were trying to say about the all or 12 pathways. Still confusing. 5. Photographs of the worms are still lacking scale bars. 6. Make sure that all genus and species names are italicized (in body text and in Fig.3). 7. Make section heading format is consistent (check capitalization). 8. “The results showed that 91 % of the sequences were compared to Arthropoda (1898/2088) and 7 % were compared to Arthropoda (122/2088).” Both of these say Arthropoda - is that a mistake? Also "compared to" is not the correct word, maybe "similar to"? 9. LGM acronym is defined after the second use of "last glacial period", should appear after the first use. Also, LGM stands for last glacial maximum, not period. This should be corrected.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.20.533390v1
www.biorxiv.org www.biorxiv.org

3D-Beacons: Decreasing the gap between protein sequences and structures through a federated network of protein structure data resources

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1: Lim Heo
  
  In this manuscript, authors described a new platform, 3D-Beacons, which is an interface for accessing multiple sources of computational protein models (e.g., AlphaFold DB, SWISS-MODEL) and experimentally determined structures. As the number of protein sequences increases much faster than the growth of experimental structure database (e.g., PDB), computational protein structure models are great alternatives for proteins that do not have experimentally determined structures. Nowadays, many accurate protein models have become available thanks to the progress in template-based modeling techniques for decades and recent advances in de novo protein structure prediction methods using machine-learning approaches. However, those model sources were scattered at their own databases, so there has been difficulties in accessing these models. Thus, in my opinion, the development of a new database or platform, 3D-Beacons, for accessing various computational models is a great movement in the structural biology field. The manuscript well described the description of the platform and some technical details. I have a few minor comments on this work.1. I recently noticed that RCSB PDB also made it possible to search computational protein models by extending its web interface. The database included ~1 million models from AlphaFold DB and ~1,100 models from ModelArchive, which are main sources of this work as well and are maintained by some of the authors of this work. Even though the number of models and the diversity of the sources accessible via the RCSB PDB interface are fewer than this work, I think the purpose of both works are similar. As there are some overlaps between this work and the RCSB PDB interface in terms of data providers (and authors), what is the significance of this work compared to the RCSB PDB interface?2. Most computational models rely on a few data providers, AlphaFold DB, SWISS-MODEL Repository, and AlphaFill (for ligands). In my opinion, it would be better to make the platform richer by recruiting more diverse data providers with different points of view (e.g., conformational ensembles) or different modeling approaches (e.g., machine learning-based approaches with pre-trained protein language models such as OmegaFold). Is there any plan for such progress or promotion of the platform?3. It would be better to have a guide of model selection if there are multiple searched models for an Uniprot ID. Alternatively, providing universal quality assessment scores for models would be an option (by additional data provider). Currently, pLDDT scores are provided, but they are difficult to compare between modeling methods as they were trained independently for each method.4. I was able to search on the 3D-Beacons web page a few days ago. However, I could not at the moment of writing these review comments (Sept. 13, 6 p.m. in EDT).
  
  Reviewer2: Carlos Rodrigues
  
  This manuscript describes in detail the 3D-Beacons platform/initiative, which aims to facilitate access to 3D data as well as meta-information about experimentally determined and computationally predicted protein structure. This resource is very valuable for the broader scientific community in a time where the number of protein structure data available rapidly increases an many structures may be available for the same protein.A minor correction is required on page 7, where the authors describe 4 different types of protein structures: Experimentally determined, Template-based, Ab-initio anc Conformational Ensembles. On many examples available on the website (e.g. https://www.ebi.ac.uk/pdbe/pdbekb/3dbeacons/search/P15056), there is one extra category which is structures derived from "Deep learning" methods. I am assuming this comprises a sub-set of Ab-initio structures, which the authors decided to keep as a separate category after submitting this study for publication. The main text should be updated to reflect this change as well as Figure 4.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.01.501973v1
www.biorxiv.org www.biorxiv.org

Making Common Fund data more findable: Catalyzing a Data Ecosystem

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1: J. Harry Caufield
  
  This manuscript by Charbonneau et al. details efforts to address challenges in enhancing the value of metadata among projects in the NIH's Common Fund Data Ecosystem. They specifically detail how a new metadata model was developed and deployed to unify data properties across projects. Assembling such a model is a major accomplishment and a necessary step in promoting data reuse. Applying the model is another commendable achievement. The manuscript text undersells the value of these efforts. How has the value of data in the CFDE improved due to implementation of a unified metadata model and new infrastructure? The authors clearly delineate the challenges in searching CFDE data; these issues frequently appear in efforts toward improving biomedical data FAIRness and are directly relevant to the core challenges identified by Wilkinson et al. (2016) in their FAIR guiding principles. Much more emphasis could be placed on the overall impact of a consistent metadata model, whether within the CFDE alone or in the broader realm of bio-data management. Major issues: 1. As noted on page 11, "All C2M2 controlled vocabulary annotations are optional". Data producers will use terms outside the controlled vocabulary as needed, and are unlikely to consult any CFDE working groups in every instance. Is there some automated system for term normalization in place? How will data producers be encouraged to preferentially use controlled terms? Are they warned during submission, as noted on page 22 regarding data contents? Minor comments: 2. The first example of the mismatch between user expectations and actual results of searching for Common Fund program data is very illustrative. I appreciate how it notes that even instances like matching Dr. Phil Blood's name in a search can complicate Findability. 3. The abstract could include some brief description of the broader relevance and impact of the metadata model, including its potential for use outside the CFDE. 4. On page 5, the sentence "Thus, a researcher interested in combining data across CF programs is faced with not only a huge volume, richness, and complexity of data, but also a wide variety, richness, and complexity of data access systems with their own vocabularies, file types, and data structures" feels somewhat redundant and could benefit from some editing. 5. The structure of Figure 1 (or should this be Table 1?) is confusing. The general idea is clear - metadata types, properties, and formats are inconsistent across projects - but the two-column format presents issues with direct comparison. 6. It is interesting that, among all values presented in Fig. 1, just one includes a CURIE (HMP's ENVO:02000020). This may be worth further comment as it is striking that few of these projects have adopted unique identifiers within their metadata schemata. 7. Slightly more detail regarding the interviews with Common Fund programs would be helpful for understanding how these interactions contributed to the process. Were interviews primarily with PIs? Were several prominent issues repeatedly discussed in the context of multiple projects? 8. Is the C2M2 master JSON schema publicly accessible? 9. Some redundancy is present between the first and second paragraphs under the heading "Entities and associations are key structural features of the C2M2" - e.g., core entities and container entities are both described twice. 10. In Figure 2, some lines connecting tables are very close to the edge of the figure borders and are difficult to see as a result. 11. Is there a mechanism for dealing with obsolete terms as the ontologies contributing to the controlled vocabulary change? In the even that the NCBI Taxonomy renames a genus, for example, how will CFDE metadata change (if at all)?
  
  Reviewer2: Carole Goble
  
  The article is a very useful contribution to the growing number of metadata models and data catalogues in the life science data ecosystem. The recent NIH mandates in data sharing emphasize the need for findability of datasets, and the need to operate within a federation and ecosystem recognises the reality of independent data centers and legacy data collections. The paper states the context of the CFDE well, setting up the need for a centralized portal capable of ingesting, indexing, search and supporting cross dataset comparisons of dataset from different, independent data centers without the need for those centers to move, reformat or rehost their data. This is a common pattern that many data infrastructure providers will recognise. The incremental approach that supports minimal uploads and respects local DOI implementation is a pragmatic approach that has made onboarding the data centers feasible, I suspect. The insight that mapping to common ontologies does not actually lead to harmonised dataset and nor does it support search is a useful lesson that resonates and is useful to reiterate (although it is already well known). Given the approach is tabular, Frictionless data makes sense. The process of working with the Centers is interesting as is the choice of three core entities. Some more discussion on why these three and only these three would be appreciated. The ingest pipeline and process is not so clear. - It seems that each Center is required to map its datasets to the current C2M2 model in 48 TSV files, in a data package that is then uploaded to the catalogue and ingested into the portal's database. Is the data package a complete reupload each time or is the data package additive? There are hints in the text that it is a replacement each time. - What is the cost and complexity of this mapping and upload borne by the Centers? Any insights would be valuable. Is and tooling provide to help beyond the documentation? - Figure 5 could be improved to include the data that flows between the steps, and the actors. Could Figure 3 and 5 be merged? - If the datasets are reuploaded afresh each cycle, how are between-release analytics managed? By the use of the PIDs? Are there any restrictions on what cannot be changed between releases? - As the datasets can be incrementally improved with each release, are there any trends between releases that indicate changes in metadata enrichment - On page 18 you state that "DCCs get better at using the C2M2" - The data package needs clearer description: relationship between the TSV files, the Frictionless Data JSON and BDBag is of interest to many in the community and warrant a more thorough discussion. The portal - Why were these three basic kinds of search chosen? Were there user stories collected from the listening tour? - It would be helpful if there were some indications of the use of the catalogue by users rather than just the ingest and publishing pipeline. Page 5 the arguments are made that reusing common fund data for cross-cutting analysis is challenging and requires the hiring of dedicated bioinformaticians ("at considerable cost of NIH"). How does making the datasets available through a catalogue relieve the burden of skilled bioinformaticians to analyse data? The data still needs to be processed. Hasn't the burden just shifted to the Centers to prepare the TSV files for the ingest pipeline? Page 5 claims that the sociotechnical framework of the CFDE is a self-sustaining community. How? Working groups have been established but to what extent are these managed by the community and not the dedicated action of developing the portal? What is the sustainability of the portal? The easy expansion of the C2M2 seems to depend on two things: the incorporation of domain specific vocabularies and the cycle of ingest-releases at time points. Does this latter point constitute expansion that is easy? This would require each data center to adapt to the new table templates. Page 9 Containers are mentioned but it is not clear what the difference is between a container and a collection. Containers do not seem to appear again in the browsing. Page 18 the visibility of Biosamples changing over time in Figure 4 wasn't so clear to me
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.11.05.467504v4
www.biorxiv.org www.biorxiv.org

Utilizing artificial intelligence system to build the digital structural proteome of reef-building corals

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1: Jianyi Yang
  
  The authors present the predicted structures for the proteome of the reef-building corals. 8382 protein sequences were obtained by experiments, which are fed into ColabFold for structure modeling, generating 8166 structure models. Overall, this is a valuable study toward the understanding of the reefbuilding coral. Here are a few comments for possible improvement. 1. It becomes trivial for proteome-wide structure predictions nowadays with AlphaFold2 and other methods. I think the major contribution of the current study is the determination of the proteome sequences rather than the structure prediction. Thus, I would encourage the authors to spend more effort in analyzing the sequences, for example, how the sequences cover the Pfam families, how redundant the sequences are, how much they overlap with the sequences in UniProt, etc. 2. It may be meaningful to compare the predicted structure models to the SCOP or CATH database to see the fold distribution and if there is any new fold. 3. What happened to the ~200 proteins that ColabFold failed to work? 4. I suggest adding a browse function to the server for browsing the data.
  
  Reviewer2: Brendan Robert E. Ansell
  
  Zhu and colleagues report the generation of predicted protein structures via alpha-fold, for three coral species: A muricate, M foliosa and P verrucosa. Mass-spec analysis of the proteome of the three species is also performed. The authors describe a handful of structures that appear to be orthologues across the species and may have functions as pore-forming toxins, in calcium deposition and host-symbiont interactions. The generated protein structures will be of use to the scientific community and the web server is quite good. Major comments: Please ensure that the entire structure repository is available for unrestricted download as per http://corals.bmeonline.cn/prot/release.php Incorrect use of 'co-expression'. Assume the authors mean protein orthologues (i.e., homologues across species). Please replace with 'homologous proteins' throughout including in http://corals.bmeonline.cn/prot/release.php The link from 'CoralBioinfo' gives a 404 error: http://corals.bmeonline.cn/index.php In http://corals.bmeonline.cn/blast/, please include a link back to http://corals.bmeonline.cn/prot/ Although the manuscript lacks bioinformatic analysis of the structural proteome, this is not required for the data note category but would enhance the value of the publication if provided. In terms of validation, there is a technical control for the alphafold instance that this project used, which the authors should include. Specifically, please report the RMSD between structures predicted in this work with the published alphafold structures for the same proteins Acropora muricata ( 20 proteins), Montipora foliosa (8 proteins) and Pocillopora verrucosa (70 proteins), available at e.g. https://alphafold.ebi.ac.uk/search/text/Montipora%20foliosa%20?organismScientificName=Montipora %20foliosa Please detail in methods how the mass spec data relates to improving the genome or proteome annotation of each species. How was the mass spec data used? I presume it was used to identify 3-way orthologues between the species, and producing the "8,382 co-expressed proteins" that were selected for structural prediction. The data dump would be stronger if the mass spec proteomics data was also made available. What proportion of the structural proteome has mass-spectral support? Please include a supplementary text file containing the key features of each predicted protein e.g. % high confidence structure, gene id, interpro domain annotations , and top blast homologues. The long proteins could be split by domain to provide some structural information. To boost the value of this data, the authors might also consider predicting the coral symbiont proteomes followed by integrative analysis of host and symbiont proteomes to predict interacting partners. What are the domain and sequence features of the low and very-low confidence predictions? Is the reference genome available for any species? What is the completeness and content. How does the mass spec and structural data improve the genome annotation and vice versa? At present large parts of the discussion are irrelevant. Comments about covid-19 and the role of bioinformaticians are outside the scope of a research report. Minor comments: Comment on whether toxicity is reported for these coral species. Use full genus names on first use Proofreading of grammar required throughout, and elimination of non-scientific phrasing. Drop irrelevant arguments regarding COVID 19 and the call to arms for bioinformaticians.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.27.497859v1
www.biorxiv.org www.biorxiv.org

What the Phage: A scalable workflow for the identification and analysis of phage sequences

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1: Satoshi Hiraoka
  
  In this manuscript, the authors developed a new tool, What the Phage (WtP), for comparison of the output from multiple bioinformatics tools to predict phage sequences from genomic or metagenomic datasets. The purpose of this study is some or less meaningful. As the authors described in the Introduction section, currently it is difficult to predict reliable viral genomes, especially from cultureindependent metagenomic datasets precisely because of the lack of knowledge about viral genomes in current protein/genome databases. There are many bioinformatics tools already proposed and some of them are widely used in microbiology, however, the outputs from these tools are frequently varied and conflicted among them. However, there is no good integrative platform to compare the outputs. Here, the proposed tool easily generates well-summarized output derived from multiple tools, and thus, the tool might be facilitated the analysis of phage prediction in the field of microbiology. Indeed, the authors conducted (only but) one case study using real phage genomes and reported reasonable performance. I feel the tool has some potential to contribute to the wide fields of viral genomics. However, the user of this tool should keep in mind the fact that the tool just summarizes the output of multiple phage-prediction tools, meaning does not evaluate the reliability of the output, as described in the Discussion section. I feel thus the tool sometimes may lead to misunderstandings or make the users confuse rather than help them. It should emphasize that the majority decision among the multiple tools does not always bring the best result. The users may need further detailed analysis for the precise prediction of viral genome from metagenomes. Also, I feel that, because the development of bioinformatics tools is quite rapid, integrated platforms like WtP will be outdated very soon without continuous effort for maintenance and upgrade to assimilate future novel tools. I understand the 'sustainability' of the tool is out of the journal scope, but the perspective on this point will be better to be described in the manuscript or GitHub page. I have some suggestions that would increase the clarity and impact of this manuscript if addressed. [Background] Some tools (e.g., Virsorter2) can be used to predict viruses out from common bacteriophages, e.g., NCLDV and virophage (See the original article of VirSorter2). Those kinds of viruses should be described briefly in this section as well as common dsDNA phages. Assembly-free long read is described here, but I think this is a bit far from the scope of this manuscript. Indeed, the dataset used in this study (ERR575692) is derived from Illumina HiSeq and the performance of assembly-free long-read dataset was not analyzed in this study. I think the descriptions could be moved to the Discussion section rather than the Introduction. Rather than that, it would be better to add more attractive descriptions about studies of phage genomes identified from short-read metagenomes to emphasize the importance of phage prediction and the value of the proposed tool, WtP. e.g., History of viral genomics using metagenomic dataset, recent technical improvement of metagenomics, phylogenetic diversity of phages, discovery of novel phage lineages from environmental metagenome, etc. Only 5 out of 11 tools that used in WtP were introduced here. The remaining 6 tools would be better to also cite here with a brief explanation of those strategies for virus prediction. Also, MARVEL was cited here but not used in WtP. [Design and Implementation] Figure 1 is different from the one on the GitHub page ( https://mult1fractal.github.io/wtpdocumentation/figures/wtp-flowchart-simple.png ), which seem to be better than the Figure 1. What 'DAG' means? [Prediction and Visualization] 'a metagenome assembly' could rephrase like 'metagenomic assembled contigs' Metaphinder and Seeker are here with 'no release version'. I understand the situation but I feel this description is not good for reproducing the analysis. To specify the version of tools even if lack the official release version, mention the last commit date (For Metaphinder, Aug 10, 2021) or GitHub commit ID ( bebc447d00ec9ff9f4960f38b627d8651262ff72 ) is likely a good way. [Functional annotation & Taxonomy] In this manuscript, Prodigal was used for gene prediction. However, accurate gene prediction from phage genome is still difficult (see https://academic.oup.com/bioinformatics/article/35/22/4537/5480131). This fact have been affect both the phage prediction and functional gene annotation in the field of virology. I think the difficulty of gene prediction from phage genome and potential room for improvement should be noted in the discussion section. [Result report] The sentence ' ~ IMG/VR, iVirus, or VERVE-NET' here should be with appropriate citations or URLs. I found a paper of iVirus: https://www.nature.com/articles/s43705-021-00083-3 [Other features] WTP -> WtP [Analysis] Figure 3. X-axis title of left-bottom bar plot and Y-axis title of top-right bar plot. viral -> phage What 'prediction values' mean? Are these scores generated by each prediction tool? Figure 4. X-axis texts. Unify the format to either NodeID:assignment (e.g., NODE_5:unknown) or assignment:NodeID (T3:NODE_14). ' The sequences matched with 100% identity to Salmonella enterica (Salmonella enterica strain FDAARGOS_768 chromosome, complete genome), but not to prophage sequences. ' here. Does the sentence mean that the contig NODE_5 and NODE_8 were mis-predicted as prophage by CheckV? Table 1. completeness -> completeness (%) [Discussion & potential implications] Add citation in the line ' At least one multitool approach was implemented on a smaller scale by Ann C. Gregory et al. (comprising only VirFinder and VirSorter). ' [References] 16. Lack doi. 18. Lack doi. 19. Lack doi.
  
  Reviewer2: : Huaiqiu Zhu
  
  In this manuscript, the authors developed an integrated workflow WtP for identification, annotation and taxonomy of phage sequences. Based on Docker and Nextflow, WtP integrates 11 phage sequence identification tools (including 14 approaches), two functional annotation and taxonomy tools (Prodigal and HMMER), and a visualizing tool (chromoMap). When using WtP, it is convenient that users do not need to install each tool and can avoid the conflict between each installation package and between operating systems. Also, the WtP tool was applied to the artificial microbiome. The threshold of each phage sequence prediction tool can be manually adjusted and outputted. Annotation and taxonomy results of phage sequences can be further visualized by CheckV and by chromeMap tool. However, there are some limitations in this manuscript. For the annotation and taxonomy stage, only the Prodigal tool was used for gene prediction, and no other gene prediction tools (especially the phage-specific tools). It is necessary for an integrated workflow to include other similar tools. WtP needs at least 4 GB of memory and 75 GB of storage, so the author should develop a web version or at least a graphical interface version of WtP for its prevalence. Major comment: 1. Except for sequence identification, host prediction (e.g., HoPhage, PHP, and VirHost Matcher-Net) and lifestyle prediction (e.g., DeepPhage, PhagePred) of phage sequences are also important in microbial communities. However, WtP did not involve those functions. 2. In addition to the web version or graphical interface version of WtP, the author can also consider a video demo or usage illustration. To clarify the purpose of this study, I think it would be better to add the phrase 'a web server of ...' or 'a GUI platform of ...' into the title. 3. In 'Analysis' Section (Page 12), only four contigs of phage sequences can be annotated in artificial data: P22 (NODE_12), T3 (NODE_14), T7 (NODE_13) and phiX174 (NODE_30). The 'predicted_organism_name' of the remaining 102 phage contigs are 'no match found'. Can WtP improve or add more databases to annotate more contigs? 4. In 'Analysis' Section (Page 14), the author mentions 'No specialized phage assembly strategy or any cleanup step was included during the assembly step'. I think it is unreasonable, and the downstream analysis will inevitably be affected by the impurity sequences. 5. In Figure 2, it is possible to export results in the form of 'csv', 'pdf' or 'excel'. Can WtP export all the predicted phage sequences in the form of 'fasta'. The author should describe how to change or add the database during the annotation and classification phases. Minor comment: 1. In 'Functional annotation & Taxonomy' Section (Page 8), 'Figure 3' in the sentence 'All annotations are summarized in an interactive HTML file via chromoMap (see Figure 3)' should be 'Figure 4'. 2. The column of 'Computeness' in Table 1 missed the unit, and the author could add an outer border to Table 1. 3. Figure 2 and Figure 3 need to be clearer. 4. Page 5. 'approach to gain' should be 'approach to gaining'. 5. Page 13. 'In addition to' should be 'In addition to'.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.07.24.219899v3
www.biorxiv.org www.biorxiv.org

Defining the Characteristics of Type I Interferon Stimulated Genes: Insight from Expression Data and Machine Learning

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer 1: Milton Pividori
  
  In this manuscript, the authors analyzed different characteristics that are potentially related to the expression of human genes under IFN-a stimulation. A classification model is built to predict ISG (genes that are upregulated following IFN-a stimulation) from the human fibroblast cell. The model also performs feature selection, and the authors used different test sets (on different types of IFN) to validate their model. The authors provide a web server that implemented this machine learning model. I liked the introduction, the background and motivation were clear. However, the Results section was a bit hard to follow, in particular the implementation of the machine learning models, with different classifiers applied inconsistently across distinct features sets. At the beginning of this section, the authors perform extensive manual feature analyses across different feature types (related to alternative splicing, duplication, and mutation) to build a refined dataset. These analyses basically correlate each individual feature with the expression of genes in the presence of IFN-a. I have several concerns here, related mainly to the correlation between features, that I describe below. General comments: * Regarding reproducibility, the authors provide a Github repository with source code, the model trained and data. From the documentation and notes in the manuscript (lines 1015-1023), looks like this can only be run on mac OS, which makes it very hard for me to test (I'm a Linux user). I recommend the authors to read and follow the article "Reproducibility standards for machine learning in the life sciences" (https://doi.org/10.1038/s41592-021-01256-7). Having, for instance, a Docker image to download and run your analyses would be fantastic. * The authors perform a comprehensive analysis of features that differentiate different gene classes. I wonder why didn't they use first a machine learning model to automatically find these important features, and then try to analyze which features were selected (instead of the other way around as done in the study). I think there is perhaps too much manual feature engineering in the previous steps of training an ML model. * Related to the previous point, in my comments below one of my concerns is about feature correlation. The authors compare individual features regarding their ability to separate different gene classes (ISG vs background vs non-ISG). But one can imagine that some features are highly correlated. Some features might not be useful to separate gene classes from a single-feature analysis (as the authors do at the beginning), but they could be useful in combination with other features. Unless I'm missing an important point, I would leave the machine learning model to learn this and then analyze each feature individually after the model identifies them. * Authors are concerned that including too many features in the support vector machine (SVM) model would complicate the prediction task. To remedy this, they manually select the features according to, in my opinion, a more subjective criterion. Why didn't the authors use a feature selection algorithm here? I know that they propose a model including feature selection, but I guess I don't understand well all the previous manual feature analyses. Using a known feature selection method here would provide a more data-driven approach to improve classification, in addition to their manual expert curation (which is also valid). * They run several classification models, but not consistently across the same set of features. For example, only SVM is run across genetic, parametric, all features, etc, but not the other models. Why is that? * The manuscript would really benefit from a figure with the main steps of the analyses performed, models tested, datasets employed, etc. It's hard to get the big picture as it is now. Results/Evolutionary characteristics of ISGs: Paragraph between lines 131-148: * I think the window size used (mentioned in the text) should be added to the Figure 2 caption * What's the vertical dashed line? In the text, you say that those at the left of this line are IRGs, but I don't understand the meaning of that vertical line (-0.9 log fold change). This explanation, which I didn't see, should be added to the figure caption also. * From the text, I understand that in the subfigures in Figure 2 you have IRGs, non-ISGs and ISGs. Would it be possible, or meaningful for the reader, to add an extra vertical line to separate them? Results/Differences in the coding region of the canonical transcripts: Paragraph between lines 193-208: * If GC-content is underrepresented in ISGs more than non-ISGs, the ApT and TpA should be expected to be more enriched in ISGs, right? Sounds like a redundant analysis. I would expect these two sequencederived features to be correlated. If this is the case, maybe it would be better to highlight other features instead of a correlated/expected one? * Figure 4: here the authors divided the parametric set of features into four categories and compared their representations among ISGs, non-ISGs and background genes. The figure shows p-values of the tests on the y-axis, and the four categories of features on the x-axis. I think it's important to run a negative control: could you please run these tests again, say, 100 times, with gene IDs/names shuffled, and check whether some of these results also appear in these null simulations? Maybe you can keep the same figure, but remove those also found in the null simulations. Paragraph between lines 209-227: * Is it possible that the comparison of codons frequencies (third category of features) is correlated with previous findings (like GC content or ApT/TpA enrichment)? If so, would it be possible that maybe the analysis is also expected or redundant? For example, in ISGs there is an underrepresentation of GCcontent, and you also found that ISGs there is an underrepresentation of "CAG" codons. I might be missing something, but aren't these expected to be correlated? Results / Differences in the protein sequence: Paragraph between lines 302-323: * Figure 6: I would suggest adding the same negative control suggested before. Results / Differences in network profiles * I think it's important to define what are all those eight features in the network analyses (closeness, betweenness, etc), otherwise it's hard to follow what comes next. Results / Features highly associated with the level of IFN stimulations * Figures 9 and 10: it would be good to add the sign of the correlation in the figure, in addition to mentioning it in the caption (as it is now). Results / Difference in feature representation of interferon-repressed genes and genes with low levels of expression * Given the unique patterns or differences between non-ISG class and IRG class, wouldn't it be better to perform different analyses excluding IRG genes? The authors also acknowledge these risks in lines 539- 541. Results / Implementation with machine learning framework * It was hard for me to understand the workflow in this section: you used different machine learning models applied to distinct features sets, for example. Why don't you apply the same set of models to the same set of features? I think this section needs an initial paragraph with a global description of what you are trying to do. * For example, I don't think I understand very well the concept of "disruptive feature". What does it mean? * Table 3: I don't understand the threshold selection here. I guess you refer to classification or decision threshold from a model that outputs a probability of a gene to be ISG or non-ISG. First, I think there should be a line separating each performance measure to clearly show those that are "Thresholddependent" and "Threshold independent" * I also understand that, during cross-validation, you selected for each model/feature set combination, the threshold that maximized the MCC (this is explained in Table 3 as a footnote, but it should be more explicitly mentioned in the text). * Table 3: What is the "Optimum" set of features? Why is this "Optimium set" only used with SVM? * How does the "AUC-driven subtractive iteration algorithm (ASI)" compare with other feature selection algorithms. * Table 5: you mention this in the text, but it would be good to have an extra column indicating which datasets were used for training and which are for testing. * Figure 13: it would be good to have the AUROC in the figure, not only the curves. Web-server: * I think, in general, that the web application needs to be more intuitive and have more documentation. For example, the main interface says "Predict your human genes of interest", what does that mean? What does it predict?
  
  Reviewer2: Muthukumaran Venkatachalapathy
  
  First of all, this manuscript is well-written after a thorough research investigation. I enjoyed reading about interferons, interferon stimulating genes (ISGs), mechanisms and signalling pathways. In the introduction, the authors have highlighted the different methods (including other bioinformatics databases) available to identify ISGs and their potential pitfalls. This unmet need is addressed using in silico approaches which were used to classify interferon stimulating genes from non-stimulating ones in human fibroblast cells. Here, the authors have applied a combination of expression data and sequential/compositional features and designed a machine learning model for the prediction of ISGs from non-ISGs. Apart from features like duplication, alternative splicing, mutation and presence of multiple ORFs, the authors extracted various sequential features and found them to be correlated well with ISG prediction. For example, ISGs are prone to GC depletion and a significant difference in the codon usage among ISGs was found. In that context, the authors claim that ISGs are evolutionarily less conserved, codon usage features, genetic composition features, proteomic composition features and sequence patterns (especially like SLNPs and SLAAPs) are optimal parameters that can cumulatively help in differentiating ISGs from non-ISGs. When it comes to building a machine learning model, the authors faced challenges due to similarities between ISGs and IRGs. They have experimented using different algorithms for model building ranging from the decision tree, and random forest and found decent results with support vector machine. Limitation: Model Prediction accuracy was close to 70% for type I and III IFN and it performed below par when it comes to predicting ISGs activated by type II IFN system. There is scope to improvise the model prediction accuracy and extend its usage to type II IFN systems. If the authors could briefly add few points on how to improve the model accuracy and also highlight the application/impact of this work in their discussion, that would help scientists from other background to resonate with this manuscript. Relevance: I believe there are inherent attributes (genetic, compositional, expression) with ISGs which may facilitate or even elevate their expression after IFN stimulation. On the other end, I think these properties may also be leveraged by the viruses to escape or evolve from IFN mediated antiviral response. This study is relevant during the on-going pandemic, this bioinformatics tool can help design better drug target and may indirectly aid in developing novel antiviral compounds. I recommend this work for publication without any changes.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.10.08.463622v1
www.biorxiv.org www.biorxiv.org

Whole-genome sequencing of Chinese native goat offers biological insights into cashmere fiber formation

1
1. GigaScience 24 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer1: Mahesh Neupane
  
  Nicely written paper on selection signatures for Cashmere goats with detailed analysis and possible deletion. Here are some of the suggestions to improve the paper:
  
  How was the optimal size of K determined in admixture? Please review the formatting on the manuscript, for example page 5 and page 6 figure have some formatting errors. Was sample size enough for all the comparison? What was the power of study design. How the results from mouse and human cell line justified for comparison with goat? Very good job of supplying all the codes used in the programs. Perhaps this codes or parameters can be combined together as supplemental material or GitHub repository.
  
  Reviewer2: Yixue Li
  
  The authors raised an interesting question, hoping to discover the genetic mechanisms associated with cashmere traits for breed improvement. The authors sequenced 120 native Chinese goats, including 2 cashmere goat breeds and 6 common goat breeds. Through analysis, the authors found and believe they confirmed that a 582 bp deletion at 367 kb upstream of LHX2 is involved in regulating cashmere yield and cashmere diameter. The results are very interesting, and if they convincingly answer the questions below, acceptance for publication is recommended. 120 goats, are they inbred, and are there enough to get a statistically significant result? What is the statistical basis? The article begins to describe: 582del and 504del are both correlated with cashmere yield, but only 582del is also significantly correlated with fiber diameter, while 504del has no significant correlation with fiber diameter. Then he added: The interaction effect between 582del and 504del was significantly correlated with cashmere fiber diameter, indicating an interaction between the two genes. What is the mechanism of this interaction? How they are significantly related to cashmere fiber diameter, further elaboration is needed. On the one hand, the text mentions that the deletion sequence 582del in the upstream region of the LHX2 gene may act as an insulator, preventing the function of the LHX2 enhancer. Later, it is mentioned that the deletion of the LHX2 insulator 582del increases the expression of LHX2 and promotes the growth of cashmere fibers during the growth period. There seems to be a contradiction here, please explain. To confirm the insulator function of the 582del sequence, the authors synthesized a 551 bp DNA fragment and inserted it into the pGL3 plasmid downstream and upstream of the SV40 promoter, and then co-transfected human 293T cells and mouse 3T3 cells, and thus confirmed the insulator function of the 582del sequence. Here: (1) the two sequences are not identical in length and identity, (2) using mouse and human cell lines respectively, how do we conclude that the 582del sequence will have the same function in goat? What is the experimental logic and biological logic here? Hopefully a convincing explanation can be given.
  
  Reviewer3: Yu Jiang
  
  This manuscript resequenced 42 cashmere goats and 78 ordinary goats, performing Genome-Wide Selective Sweeps and then a 582 bp deletion in the thirteenth intron region of DENND1A upstream of LHX2 was found to increase cashmere yield. This discovery provides resources for the development of the wool industry and the enrichment of animal genetic resources. However, the description in some parts of the manuscript is very rough, and many attachments are missing. Major concerns: 1.In the result "Plausible Causative Mutation near LHX2", you selected a lot of cashmere and fiber diameter data for association analysis and get Fig 5e, but your results may have high false positives. The author should demonstrate that the deletion is significantly associated with phenotype after excluding factors such as gender, age and so on. 2.It is found from Fig.2 a that MT and JNG contain a large number of cashmere goat pedigrees. When they are selected and analyzed with cashmere goat samples, the results may be affected by this mixing. The author should consider the particularity of these samples when using them. Perform subsequent analysis. 3.In Fig 3, are strongly selected loci linked to surrounding loci? Do these sites have an effect on gene expression? 4.Fig 3c & d, the horizontal and vertical coordinates of your two graphs are the same, but the trends of the graphs are different. Please mark clearly what the two graphs describe in the legend and the paper. Minor concerns: 5.In the abstract, "Luciferase assay shows that the deletion, which acts as an insulator, restrains the expression of LHX2 by interfering its upstream enhancers", but in the result, "Therefore, the deletion of the LHX2 insulator increases the expression of LHX2 and promotes cashmere fiber growth at the anagen stage, while deletion of the FGF5 enhancer reduces the expression of FGF5, inhibiting the regression". The conclusion is inconsistent, please clarify the logic of the paper and draw the correct conclusion. 6."Luciferase assay shows that the deletion, which acts as an insulator, restrains the expression of LHX2 by interfering its upstream enhancers. Our study discovers a novel insulator of the LHX2 involved in regulating cashmere production and diameter." These two sentences are easy to confuse us, and it will make people understand that two insulators are found on LHX2, one to suppress expression and one to regulate cashmere production and diameter, and it is recommended to modify. 7.There are inconsistencies in the sample names in the paper. For example,you use IRWG in the front of the sentence and IRW in the back,please unify the name. At the same time, the paper contains many spelling and symbol errors, such as "mddle", "goatswith", symbol repetition and so on. (1)In the STRUCTURE analysis, When K = 4, we observed five separate clusters: IRWG and ANG in west Asia, YNBB, GZB, JTB and CDB in southwest China; cashmere goat in north China; MT and JNG in mddle east China; and Korean goats in south Korea. At K = 6, goats in the southwest China further split into two geographic subgroups: the Yunnan-Kweichow Plateau group including YNBB and GZB goats, and the Chengdu Plain group including CDM and JTB goats. Two west Asian goats (IRW and ANG) were also separated . (2)More interestingly, we found that the 582del has a high frequency in the IRWG population (80.9 %), while the 504del was absent, which is also consistent with previous research. (3)10 JTB from Jintang County of Sichuan Province; 12 CDB from Chengdu City of Sichuan Province".." To further evaluate whether these two deletion variants were related to cashmere (4)traits, we selected 235 CDMC goatswith cashmere yield (Supplementary Fig. 22, Supplementary Table s8) and 581 CDMC goats with fiber diameter records (Supplementary Fig. 23, Supplementary Table s9) for association analysis. (5)Fig 2b, "KOG". 8."We inspected all variants within exons to identify the potential causal mutation around the DENND1A-LHX2 locus; however, no coding variants were found. " Please put the information of the relevant sites in the attachment, only one sentence will make the paper unconvincing. 9."Analysis of the 582del deletion region using the BLAST program revealed that it is not a highly conserved element but was found in the genomes of primate and ungulate species. " "582del deletion" is a repetition.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.11.06.467539v1
www.biorxiv.org www.biorxiv.org

TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles

3
1. GigaScience 05 Mar 2023
  
  in GigaScience
  
  taxonomic
  
  Reviewer name: Francesco Asnicar (revision 1)
  
  This reviewer thanks the authors for their revision. However, the quality of the figures and the main goal the authors would like to reach with the tool named TAMPA is noThe main goal of TAMPA is to allow to compare taxonomic profiling tools, but it is evident from the supplementary figures that the software cannot allow such comparison when the taxonomic tree is large enough, as the circles added to the branches become unreadable. This I believe is a major flaw of the tool that aims to do that specifically and for such cases a smarter way that allow comparing taxonomic profilers should be found. For instance, a legend to each figure created by TAMPA should be added to make immediately clear what the colors represent. Also, for such taxonomic trees that the visualization fails in allowing comparing the taxonomic profilers a different and complementary data should be provided, for instance a table listing all branches and the numbers the depicted circles represent. In addition, such table should allow to overcome the limitation of just 3 tools allowed in the comparison.
2. GigaScience 05 Mar 2023
  
  in GigaScience
  
  computational
  
  Reviewer name: Alessio Milanese (revision 1)
  
  Many thanks to the authors for their detailed responses to my comments.The edits have improved the manuscript and I have only few minor comments.COMMENT 1:In Figure 4b I can see that "Tenericutes" and "Planctomycetes" are both in orange, meaning that they bothhave been measured only by mOTUs. But in the main text I read "mOTUs failed to detect theTenericutes group, while MetaPhlAn failed to detect Planctomycetes", which is wrong.COMMENT 2:I would improve the figure legends. In particular, the description of 4b is the same as in 2a and 3a and 1:"The size of the discs represents the total amount of relative abundance at the corresponding clade in theground truth, or the tool prediction if that clade is not in the ground truth. If the tool predictions agree,a disc is colored half orange and half teal. The proportion of teal to orange changes with respect to thedisagreement in the prediction of that clade's relative abundance between the two tools being compared. Highlighted blue text represents clades where the difference between the relative abundances of the prediction and ground truth exceeds 30%".I would suggest to have this description only for figure 1, and then have a shorter description for thefollowing figures.COMMENT 3:The second color is described sometimes as "green" and sometimes as "teal". For clarity, I would suggestusing just one of the two.
3. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Metagenomic
  
  Reviewer name: Francesco Asnicar
  
  The manuscript by Sarwal et al. presents a novel tool for a standardized visualization of metagenomic taxonomic profiler tools, named TAMPA, that also enables a more general assessments of performances of taxonomic profiler tools by providing an extensive of different metrics.It would be interesting to see (if possible) the comparison of three (or more) taxonomic profiles at the same time. The evaluations shown are always binary, but in a real-case scenario where a user would like to evaluate 3 or 4 different taxonomic profiling tools on his community, it would be great to be able to do it.Other than the evaluation on the agreement between two (or more) taxonomic profiling tools, it is not clear how TAMPA can drive improvement over biologically-relevant question. Although it is clear, as the authors stated in the introduction, that different taxonomic profilers (with different parameters settings) can produce very different taxonomic representations, to support this statement it will be important to be able to show, at least one case, where TAMPA can suggest a different taxonomic interpretation of a microbial community that is also biologically relevant.Figures in general appear to be of low-quality and stretched, please consider improving them as they are the main point of TAMPA.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.04.28.489926v1
www.biorxiv.org www.biorxiv.org

Contrast Subgraphs Allow Comparing Homogeneous and Heterogeneous Networks Derived from Omics Data

5
1. GigaScience 05 Mar 2023
  
  in GigaScience
  
  identify
  
  Reviewer name: Raul Guantes (Revision 1)
  
  In the revised version and the response letter, the authors have clarified all the questions and addressed the comments raised in my previous report, and I think the manuscript is now suitable for publica
2. GigaScience 05 Mar 2023
  
  in GigaScience
  
  techniques
  
  Reviewer name: De-Shuang Huang (Revision 1)
  
  I think the paper can be accepted.
3. GigaScience 05 Mar 2023
  
  in GigaScience
  
  entities
  
  Reviewer name: Thomas Schlitt
  
  The manuscript "contrast subgraphs allow comparing homogeneous and hetereogeneous networks derived from omics data" introduces and illustrates the application of contrast subgraph analysis to gene expression, protein expression and protein-protein interaction data. The method can be applied to weighted networks. The authors give a good description of the method and the context of other available methods.The authors apply the contrast subgraph analysis to three different omics data sets - overall these analysis are not very detailed and do not yield surprising results but they provide a nice illustration of the potential usefulnes of the contrast subgraph analysis in the context of omics data. To my opinion this is really where the merit of the paper is: to promote and make accessible the method to a wider audience of researchers in the field of bioinformatics/molecular biology. The authors have also applied their method to brain imaging derived networks, but that work is not part of this publication.The contrast subgraph analysis is particularly interesting, for data that is collected under different conditions but for the same set of nodes (i.e. genes, proteins, ...), i.e. where the nodes present do not change (much), but their interaction strengths differes between conditions. It remains to be seen where this method can deliver unique value that is not achievable by other means, but the approach is very intuitive. Its rationale can be readily understood, reducing the temptation to use it as a "black box" without critically questioning the results as might be the case for more complex methods. One of the downsides of the presented approach is that it does not provide any measures of confidence in the results - while there is a parameter >alpha< that allows some tuning, little information is given on how to choose a suitable value for this parameter (which obviously depends on the data). Another issue that might come a little too short is how to derive graph representations from experimental omics data in the first place. Usually these methods do not yield yes/no answers, but rather we obtain a matrix of pairwise measurements (e.g. correlation of coexpression) and to obtain a graph a threshold on these numbers is applied to obtain an edge or not. Various methods have been proposed to choose thresholds, but in the end, moving from a full matrix to graph representation means loosing some information - it would be interesting to see a deeper analysis on how much this thresholding influences the outcomes of the proposed method - this question is obviously linked to obtaining some confidence information on the results.Overall, the method described here is very interesting, it shares downsides with other graph based methods (thresholding), the biological examples given are brief, but illustrative for the use of the method, the manuscript is well readable. The manuscripts stimulates to add this method to your own toolbox and to apply it to interesting data sets to see if it yields results that were not obvious from other approaches.Minor comments:-figure captions esp 1-3 - please provide more information in the figure captions to make the figures "readable" on their own without a need for the reader to refer back to the text; figure captions for Fig 1-3 are almost identical, yet very different data is shown - a clear indication that important information is missing in the figure caption - such as what is the underlying data?Please explain all terms used in the figure in its caption: here what is "GeneRatio"? Figs A/B what is the x-axis showing for the violin plots?-figure 3c and para on Protein vs mRNA coexpression (p2-5) - are the differences really that striking - in 3C, the box plots do not look that different, super low p-values are probably due to very large number of data points, but not sure it is really that meaningful here (effect size?)-figure 4 is too small, nodes are barely visible, colours cannot be distinguished-algorithm 1 and description in text - I would probably move the description of the algorithm from the text to a "figure caption" for the algorithm box, to make it easier for the reader to find the definitions of the terms.
4. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Biological
  
  Reviewer name: Raul Guantes
  
  In this manuscript the authors apply the method of contrast subgraphs (developed among others by some of the authors), that identifies salient structural differences between two networks with the same nodes, to several biological co-expression and PPI networks. This method adds to the extensive toolkit of network analyses that have been used in the last two decades to extract useful biological information from omics data. In particular, the authors identify subgraphs containing maximum differences in connectivity between two networks, and basically use functional annotations to assign biological meaning to these differences. Of note, contrast subgraphs is not the only method that provides 'node identity awareness' when comparing networks. For instance, identification of network modules or community partitions are common methods to identify groups of nodes that highlight potentially relevant structural differences between two networks, and have been applied to many biological and other types of networks.I find the manuscript well motivated and clearly written in general, but lacking detailed information on part of the Methods. The discussion connecting their findings on structural differences between networks to potential biological functions is also a bit vague and could be worked out in more detail. I feel that the paper is potentially acceptable in GigaScience after a revision to provide more details on the methods and on their findings. Here are my comments:Methods:1.- Coexpression networks for luminal and basal cancer subtypes:1a.- The authors don't give enough information about the data they are using to build these networks. How many samples/points are they using to calculate correlations? Do they correspond to different patients, expression dynamics after some treatment…? Is there any preprocessing in the data (e.g. differential expression with respect to healthy tissue) or they just take all quantified transcripts and proteins with minimal filtering (they only specified that filter out genes with FPKM < 1 in more than 50 samples in transcriptomic data)? How many nodes and links have the final coexpression networks?.1b.- To determine links between genes/proteins they calculate Spearman rho and transform it to (0.5(1+rho)^12 to give a 'signed' network. But since Spearman correlation ranges between +1 and -1, this transformed quantity lies between 0 and 1, so I don't see the sign. Moreover, why the exponent 12 in the transformation??. Please clarify because I don't know if they are analyzing just weighted networks, unweighted networks or signed networks in the end because somehow they 'keep track' of the sign of rho. They spend some space in Methods discussing the extension of the contrast subgraph method to sign networks, but I don't know if they finally apply it, since coexpression networks built in this way and PPI networks are not signed.1c.- Do they keep all links or use some cutoff in rho by magnitude/significance? Presumably yes, because otherwise the final network would be a clique and unmanageable, but they don't give any info on that. Again, which is the final size (node/links) of the coexpression networks?1d.- As for coexpression networks based on relative abundance data as those from transcriptomic/proteomic experiments, it is well known that correlations may be misleading due to the possible large number of spurious correlations (see for instance Lovell at al., PLoS Computational Biology 11(3) (2015) e1004075). The use of correlations requires some justification, and at least to acknowledge the potential pitfalls of this measure.1e.- How many nodes/links are in the first contrast subgraphs shown in Figures 1-2? Is the degree calculated within the whole network or just within the extracted subgraph?1f.- Page 4, last paragraph before 'Protein vs mRNA coexpression in breast cancer' section: 'the results obtained with the two independent breast cancer cohorts show good agreement, with the top differential subgraphs significantly overlapping for both the basal-like and the luminal-A subtypes (Fisher test p < 2.2 Â· 10-16)'. I guess the overlapping is in terms of functional annotations, how is this overlapping and the corresponding statistical test calculated?.2.- Protein versus mRNA coexpression:2a.- Please provide again information about the number of samples, how the 'subset of breast cancer patients included in the TCGA' is chosen and if transcriptome and proteome are quantified in the same conditions (relevant if one is directly to compare both networks). Provide also details about the number of link/nodes of each subnetwork and corresponding subgraph. Since transcriptomic data are provided usually in FPKM and proteomic in counts (sum of normalized intensities of each ion channel), are data further normalized to facilitate their comparison?3.- PPI networks:3a.- Since they are going to compare PPIs about different 'contexts', a brief explanation about the tissue origin and peculiarities of the three cell lines investigated is in order.3b.- Please provide details about number of proteins/interactions in the contrast subgraphs obtained from the comparisons of the three cell lines. Since these subgraphs are going to be compared to RNA expression data from a different dataset, please specify if these data are obtained from the same cell lines. Why PPI data are compared only to upregulated genes? (and not to up-down regulated). Also, concerning the criterion for 'upregulation' (logFC>1), is this log base 2?. How do they quantify the overlap between proteins in PPI and upregulated genes? They just state that 'did indeed significantly overlap the corresponding up-regulated genes'. How much is the overlap and what does 'significantly' mean?3c.- Discussion of the results shown in Figure 4 is not clear to me. First, the authors state 'We thus analyzed in more depth the first contrast subgraphs obtained from the comparison of the HEK293T PPI network with those obtained from the other two cell lines'. Does this mean that they analyze four subgraphs (2 for HEK vs. HUVEC and 2 for HEK vs. Jurkat?. When they say that the 'top contrasts subgraphs were identical', do they mean that the four subgraphs contained exactly the same nodes?. Also, in main text Figure 4 seems to contain the subnetwork of these subgraphs with only the nodes annotated as 'ribosome biogenesis' and 'signal transduction through p53', and the links would be the PPIs. But in the caption to Figure 4 they state that 'green edges join proteins involved in the two biological processes' (probably a subset of the PPIs). Please clarify. Why do they give only the comparison between HEK and HUVEC, and not between HEK and Jurkat if the same nodes are present?Interpretation of results:1.- Coexpression networks in two cancer subtypes: they find that the subgraph with the stronger connections in the basal subtype is enriched in 'immune response' and the subgraph denser in the luminal subtype is enriched in categories related to microenvironment regulation. If they identify clearly enriched genes they should discuss in more depth their known roles in connection to these two functions in their biological context. This would enrich and support their findings. It is tempting to speculate that, since the basal type is less aggressive, cancer cells are challenged by the immune system of the organism but, once they developed mechanisms to evade the immune system (becoming more aggressive as in the luminal subtype) they are committed to manipulate their microenvironment to proliferate. Are there any evidences for this in these subtypes of cells?2.- Comparison of transcriptomic and proteomic networks: From their analyses in Figure 3 they claim in the Discussion that 'adaptive immune system genes are more connected at the transcriptional level, while innate immune systems are more connected at the proteomic level'. This is a rather vague statement based on the functional enrichment analysis. First, they should identify and discuss in more detail the genes/proteins responsible for this enrichment, to see if their documented function supports their speculations (and since the data they use are from breast cancer, I don't know how general could be this observation of if it is specific of this type of tumor). Moreover, caution should be exerted when interpreting these coexpression networks: the most connected transcripts are not necessarily those who are being simultaneously translated. Also, since apparently the network is not signed the abundance of connected transcripts may be anticorrelated. Finally, Figure 3 is not clear: which panel corresponds to the transcriptomic subgraph and which one to the proteomic one? This should be specified in the caption or with titles in the panel.Minor comments:- The distinction between 'heterogeneous' and 'homogeneous' networks in the Introduction is a bit confusing, as they classify mRNA and protein coexpression networks as 'heterogeneous'. Why is that? Is that because they are built from many different samples/individuals or time course data?.- Although I have nothing against how the authors display differences between the first contrast subgraphs in panels A-B of Figures 1 and 2, it may be more eye-catching to display these differences as usual boxplots or violin plots, with perhaps the test for significant differences between the means of both degree distributions.
5. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad010), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: De-Shuang Huang
  
  The authors proposed an algorithm based on contrasting subgraphs to characterize the biological networks, so as to analyze the specificity and conservation between different samples. It is interesting and I think there are some problems that need to be clarified.1, Sub-graphs are generated by dividing the whole graph in a certain way, and the similarity and difference of the samples are described by the comparison between the sub-graphs. The authors should discuss the advantages of the proposed approach in a non-heuristically way compared with the previous methods. Besides that, I wonder why subgraphs need to be non-overlapping.2, For TCGA or other databases, I think the authors should state the details of the samples, such as the number of samples, sequencing technology, batch effects, etc. In addition, the authors should describe the relationship between the subgraphs and GO modules to explain the results and draw some biological conclusions.3, The authors performed a similar analysis on protein networks and compared the results with RNA-seq, and get some conclusions. I'm a little confused whether the GO enrichment analysis of proteomics is to map the protein ID to the gene ID. If so, the authors can easily combine transcript co-expression and protein co-expression networks through ID-to-ID mapping, and I look forward to the results of such an analysis.4, I would like to know how the proposed method handles heterogeneous graphs by treating heterogeneous graphs as Homogeneous graph to generate subgraphs? I didn't figure out which dataset is the heterogeneous graph scenario.5, In addition to the elaboration of results such as degree and density differences between subgraphs, I would like to see the relationships between these results and the biological problems.6, Authors may consider citing the following articles on networks in molecular biologyBarabasi A L, Oltvai Z N. Network biology: understanding the cell's functional organization[J]. Nature reviews genetics, 2004, 5(2): 101- 113.Zhang, Q., He, Y., Wang, S., Chen, Z., Guo, Z., Cui, Z., ... & Huang, D. S. (2022). Base-resolution prediction of transcription factor binding signals by a deep learning framework[J]. PLoS computational biology, 2022, 18(3): e1009941.Hu J X, Thomas C E, Brunak S. Network biology concepts in complex disease comorbidities[J]. Nature Reviews Genetics, 2016, 17(10): 615-629.Z.-H. Guo, Z.-H. You, Y.-B. Wang, D.-S. Huang, H.-C. Yi, and Z.-H. Chen, "Bioentity2vec: Attribute-and behavior-driven representation for predicting multi-type relationships between bioentities." GigaScience 9.6 (2020): giaa032.Z.-H. Guo, Z.-H. You, D.-S. Huang, H.-C. Yi, K. Zheng, Z.-H. Chen, Y.-B. Wang, MeSHHeading2vec: a new method for representing MeSH headings as vectors based on graph embedding algorithm[J]. Briefings in bioinformatics, 2021, 22(2): 2085-2095.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.07.26.501547v1
www.biorxiv.org www.biorxiv.org

Workflow sharing with automated metadata validation and test execution to improve the reusability of published workflows

4
1. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Conclusions
  
  Reviewer names: Alban Gaignard (Report on revision 1)
  
  The reading of the revised paper would have been easier by providing updates in a different color but thank you for taking into account the comments and remarks, and clearly answering the raised issues. I also appreciated the extension of the discussion. However, I still have some concerns regarding the proposed approach. The proposed platform targets both workflow sharing and testing. It is explicitly stated in the abstract: "the validation and test are based on the requirements we defined for a workflow being reusable with confidence". It is clear in the paper that tests are realized through the GitHub CI infrastructure, possibly delegated to a WES workflow execution engine. Although I inspected Figure 3 as well as the wf_params.json and wf_params.yml provided in the demo website. It doesn't seem to be enough to answer questions such as: how are specified tests ? How can a user inspect what has been done during the testing process ? What is evaluated by the system to assess that a test is successful ? I tried to understand what was done during the testing process but the test logs are not available anymore (Add workflow: human-reseq: fastqSE2bam Â· ddbj/workflow-registry@19b7516 Â· GitHub) Regarding the findability of the workflows, in line with FAIR principles, the discussion mentions a possible solution which would consists in hosting and curating metadata in another database. To tackle workflow discoverability between multiple systems, accessible on the web, we could expect that the Yevis registry exposes semantic annotations, leveraging Schema.org (or any other controlled vocabulary) for instance. This would also make sense since EDAM ontology classes are referred to in the Yevis metadata file (https://ddbj.github.io/workflow-registry-browser/#/workflows/65bc3bd4-81d1-4f2a8886-1fbe19011d81/versions/1.0.0).
2. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad006), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Kyle Hernandez
  
  Suetake et. al designed and developed a system to publish, validate, and test public workflows utilizing existing standards and integration with modern CI/CD tools. Their design wasn't myopic, they relied heavily on their own experiences, work from GA4GH, and interacting with the large workflow development communities. They were inspired by the important work from Goble et. al that applies the FAIR standards to workflows. As someone who had a long history of workflow engine development, workflow development, and workflow reusability/sharing experience I greatly appreciate this work. There are still unsolved problems, like guidelines on how to approach writing tests for workflows for example, but their system is one level above this and focuses on ways to automate the validation, testing, reviewing/governance, and publishing into a repository to greatly reduce unexpected errors from users. I looked through the source code of their rust-based client, which was extremely readable and developed with industry-level standards. I followed the read me to setup my own repositories, configure the keys, and deploy the services successfully on the first walk through. That speaks to the level of skill, testing, and effort in developing this system and is great news for users interested in using this. At some level it can seem like a "proof of concept", but it is one that is also usable in production with some caveats. The concept is important and implementing this will hopefully inspire more folks to care about this side of workflow "provenance" and reproducibility. There are so many tools out there for CI/CD that is often poorly utilized by academia and I appreciate the author's showing how powerful they can be in this space. The current manuscript is fine and will be of great interest to a wide ranging set of readers, I only have some non-binding suggestions/thoughts that could improve the paper for readers: 1. Based on your survey of existing systems, could you possibly make a figure or table that showcases the features supported/not supported by these different systems, including yours? 2. Thoughts on security/cost safeguards? Perhaps beyond the scope, but it does seem like a governing group needs to define some limits to the testing resources and be able to enforce them. If I am a bad actor and programmatically open up 1000 PRs of expensive jobs, I'm not sure what would happen. Actions and artifact storage aren't necessarily free after some limit. 3. What is the flow for simply updating to a new version of an existing workflow? (perhaps this could be in your docs, not necessarily this manuscript). 4. CWL is an example of a workflow language that developers can extend to create custom "hints" or "requirements". For example, seven bridges does this in cavatica where a user can define aws spot instance configs etc. WDL has properties to config GCP images. It seems like in these cases, tests should only be defined to work when running "locally" (not with some scheduler/specific cloud env). But the author's do mention that tests will first run locally on the user's environment, so that does kind of get around this. 5. For the "findable" part of FAIR, how possible is it to have "tags" of sort associated with a wf record so things can be more findable? I imagine when there is a large repository of many workflows, being able to easily narrow down to the specific domain interest you have could be helpful.
3. GigaScience 05 Mar 2023
  
  in GigaScience
  
  Results
  
  Reviewer names: Alban Gaignard
  
  The reading of the revised paper would have been easier by providing updates in a different color but thank you for taking into account the comments and remarks, and clearly answering the raised issues. I also appreciated the extension of the discussion. However, I still have some concerns regarding the proposed approach. The proposed platform targets both workflow sharing and testing. It is explicitly stated in the abstract: "the validation and test are based on the requirements we defined for a workflow being reusable with confidence". It is clear in the paper that tests are realized through the GitHub CI infrastructure, possibly delegated to a WES workflow execution engine. Although I inspected Figure 3 as well as the wf_params.json and wf_params.yml provided in the demo website. It doesn't seem to be enough to answer questions such as: how are specified tests ? How can a user inspect what has been done during the testing process ? What is evaluated by the system to assess that a test is successful ? I tried to understand what was done during the testing process but the test logs are not available anymore (Add workflow: human-reseq: fastqSE2bam Â· ddbj/workflow-registry@19b7516 Â· GitHub) Regarding the findability of the workflows, in line with FAIR principles, the discussion mentions a possible solution which would consists in hosting and curating metadata in another database. To tackle workflow discoverability between multiple systems, accessible on the web, we could expect that the Yevis registry exposes semantic annotations, leveraging Schema.org (or any other controlled vocabulary) for instance. This would also make sense since EDAM ontology classes are referred to in the Yevis metadata file (https://ddbj.github.io/workflow-registry-browser/#/workflows/65bc3bd4-81d1-4f2a8886-1fbe19011d81/versions/1.0.0).
4. GigaScience 05 Mar 2023
  
  in GigaScience
  
  analysis
  
  Reviewer name: Samuel Lampa
  
  The Yevis manuscript makes a good case for the need to be able to easily set up self-hosted workflow registries, and the work is a laudable effort. From the manuscript, the implementation decisions seem to be done in a very thoughtful way, using standardized APIs and formats where applicable (Such as WES). The manuscript itself is very well written, with a good structure, close to flawless language (see minor comment below) and clear descriptions and figures.
  
  Main concern
  
  I have one major gripe though, blocking acceptance: The choice to only support GitHub for hosting. There is a growing problem in the research world that more and more research is being dependent on the single commercial actor GitHub, for seemingly no other reason than convenience. Although GitHub to date can be said to have been a somewhat trustworthy player, there is no guarantee for the future, and ultimately this leaves a lot of research in an unhealthy dependenc on this single platform. As a small note of a recent change, is the proposed removal of the promise to not track its users (see https://github.com/github/site-policy/pull/582). A such a central infrastructure component for research as a workflow registry has an enormous responsibility here, as it may greatly influence the choices of researchers in the future to come, because of encouragement of what is "easier" or more convenient to do with the tools and infrastructure available. With this in mind, I find it unacceptable for a workflow registry supporting open science and open source work to only support one commercial provider. The authors mention that technically they are able to support any vendor, and also on-premise setups, which sounds excellent. I ask the authors to kindly implement this functionality. Especially the ability to run on-premises registries is key to encourage research to stay free and independent from commercial concerns.
  
  Minor concerns
  
  I think the manuscript is a missing citation to this key workflow review, as a recen overview of the bioinformatics workflows field, for example together with the current citation [6] in the manuscript: Wratten, L., Wilm, A., & GÃ¶ke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods, 18(10), 1161-1168. https://www.nature.com/articles/s41592-021-01254-9
  
  Although it might not have been the intention of the authors, the following sentence sounds unneccessarily subjective and appraising, without data to back this up (rather this would be something for the users to evaluate):
  
  The Yevis system is a great solution for research communities that aim to share their workflows and wish to establish their own registry as described. I would rather expect wording similar to: "The Yevis system provides a [well-needed] solution for ..." ... which I think might have been closer to what the authors intended as well. Wishing the authors best of luck with this promising work!
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.07.08.499265v2
www.biorxiv.org www.biorxiv.org

Chromosome-level genome and the identification of sex chromosomes in Uloborus diversus

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  The orb-web
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad002), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jonathan Coddington
  
  This paper presents the first uloborid spider genome--and it is a chromosome level assembly. Genomes of this family are important because the orb web is supposedly independently and convergently evolved in this group. Although my expertise is not in the technology and informatics of genome sequencing, it appears to be well done.
  
  Figure 1 A. geniculate -- spelling N. clavipes = T. clavipes Table S1 Number of Componenet Sequences-- typo Text single exon We found a -- typo can be ascribed by -- can be inferred by? an Araneid orb-weaver-- araneid usually not capitalized ♂X1X2/♀X1X1X2X2.[48] should be ♂X1X2/♀X1X1X2X2 [48]. You might want to be careful about citing Purcell & Pruitt, see https://purcelllab.ucr.edu/blog6.html and other questions about Pruitt's work.
  
  Re methods, it would be of interest to know what HMW DNA fragment sizes were (expressed as kb, or mb), although Tape Stations are not very accurate. For people who collect spiders with the intent to yield HMW DNA, such data are important. Data are scarce, so any facts are significant.
  
  Any homologs of the Pyriform spidroin (PySp) in Acanthoscurria? Piriform silk attachment points are a synapomorphy of araneomorph or "true" spiders. Liphistiomorph and mygalomorph spiders do not (cannot?) make point attachments, and the inability to make point attachments either to substrate or silk-silk point attachments probably constrains/ed the evolution of web architectures in non-araneomorph spiders. Therefore finding homologs to PySp spidroins in non-araneomorph spiders is of great interest to explain araneomorph web architecture diversity.
  
  Likewise, tubuliform spidroin (TuSp) is probably a synapomorphy of entelegyne spiders, with derived female genitalia--a "flow-though" sperm management system. Eggsacs occur widely in non-entelegyne spiders, so it is a mystery why entelegynes have specialized spigots, glands, and spidroins for the same purpose. Indeed, the particular function of tubuliform silk is not clear. Any thoughts on this? E.g.
  
  It is good to see attention paid to the mitochondrial genome, as many whole genome studies ignore it. In spiders, early work claimed that tRNA's appeared to be peculiar. Masta and Boore. 2004. The Complete Mitochondrial Genome Sequence of the Spider Habronattus oregonensis Reveals Rearranged and Extremely Truncated tRNAs. Molecular Biology and Evolution, Volume 21, Issue 5, May 2004, Pages 893-902. Any comments on U. diversus tRNAs from that point of view?
  
  Finally, any comments on evidence for or against the convergent evolution of the orb web? Homology between the pseudoflagelliform and flagelliform spidroins would be pertinent. The intro does raise expectations that some of the macro / larger evolutionary questions will be addressed in the paper, but many, see above, are only cursory or not too much. Perhaps include a sentence in intro acknowledging this, but saying that this paper intends to present the genome and address sex chromosomes, but other topics? For example the sections on some of the spidroins do not extensively discuss comparisons with other spider genomes.
  
  Reviewer 2: Hui Xiang
  
  In this study, the authors generated huge genome sequencing data and RNA-seq data and provided a genome assembly with rather complicated merging approach, of a spider with novel phylogenetic position. The genome undoubtedly added novel and important resources for deep understanding of spider evolution. However, there are still severe issues that need to be addressed. 1. There are huge sequencing data from different samples. However, I don't think that marge of different assemblies is good for a final qualified genome. Given high heterozygosity, that illumina data and ONT data from different individuals is quite difficult to use for assembling a clean genome. As shown in Table 2, assembly by Hify approach is not obviously inferior compared with the merged one, but obviously much better in avoiding redundancy. I strongly suggest that the author adopt the genome assembly of Hify data from one individual, instead of merging two sets of assemblies. Illumina and Nanopore assembly may be helpful in fully deciphering silk proteins. 2. Proportion of repeats are somewhat affected by the quality of assembly. The high heterozygous genome assembly is complicated merged by diverse batch of data, so the real quality might be not as good as the author described. The quality of repeat is especially hard to evaluate. Hence the statements on genome size (Line 193-200) are not convictive. 3. About the assembly of RNA-seq data. The authors get huge amounts of data. However, it is not so helpful to obtain novel transcripts if the data is saturated. More importantly, assembly of short reads is even not so useful to obtain long transcripts. 4. As to whole genome duplication. The authors did not provided solid evidence supporting that WGD occurred in U. diversus genome. They only demonstrated two hox clusters therein. The synteny analysis was quite confusing which is not helpful in confirmation of WGD. They need to provide more solid genome-wide evidence, or otherwise totally downplay the statements. 5. The identification of the sex chromosome is still vague. The statements are not well organized. The statements and the results are so vague and not convictive. "While 8 of the 10 pseudochromsomes had a median read depth of 40 ± 2, pseudochromosomes 3 and 10 were outliers, with read depths of 36 and 33, respectively." The difference in sequencing depth is rather convictive. As I know the authors sequenced female and male samples. So why they didn't clearly compare the depth of the two sex chromosomes between them and make more evidence? Other: 1. The information of chromosome-level spider genome are not Incomplete. As I know, there is a black widow genome with chromosome-level. The authors need to added this one. 2. The authors need to release the sequences of the spidroins the identified and described.
  
  Reviewer 3: Zhisheng Zhang, Ph.D
  
  The manuscript GIGA-D-22-00169 presents a chromosome-level genome of the cribellate orb-weaving spider Uloborus diversus. The assembly reinforces evidence of an ancient arachnid genome duplication and identifies complete open reading frames for every class of spidroin gene. And the authors identified the two X chromosomes for U. diversus and identify candidate sex-determining genes.
  
  The methods of work are well fited to the aims of the study, clearly described, and well written.
  
  Minor comments:
  
  In the Figure 1B, I noticed that it noted the estimated divergence times of the Araneae, I think there should be add the reference, or detail describe how to do.
  
  There is something wrong with the table format, such as Table1, 2, 5 and Table 6.
  
  Line 70: "chromosome- scale" changes to "chromosome-scale".
  
  Line 147 to lines 148: Line breaks error.
  
  Line 458: "[48]" in the wrong location.
  
  Line 511-512: In the genome of spider Uloborus diversus, which chromosome the genes of "sex lethal (sxl)" and "doublesex (dsx)" located at?
  
  Line 515-516: "The 534 shared sex-linked genes in these three species, 14 are predicted to be DNA/RNA-binding", if these sex-linked genes have difference on RNA level between male and female?
  
  Line 685: "Dovetail Chicago and Dovetail Hi-C Sequencing" should be bold.
  
  Line 764: "We then used the Trinity assembler43 v.2.12.0", the number of 43 may be redundancy.
  
  Some softwares lack the number of RRID, such as line 223 "BRAKER2", line 245 of "NOVOplasty", line 790 of "tRNAscan-SE", line 773 of "RepeatModeler", line 774 of "RepeatMasker", line 797 of "EMBOSS", and so on.
  
  Lines 780 "using the BRAKER 2 pipeline" changes to "using the BRAKER2 pipeline".
  
  Lines 950: "Literature Cited" changes to "Reference".
  
  Lines 952-953: wrong cite. The World Spider Catalog is a web online, the version and the data you accessed from should also added, and the author's name should change to World Spider Catalog.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.14.495972v2
www.biorxiv.org www.biorxiv.org

A molecular phenotypic map of Malignant Pleural Mesothelioma

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  Background Malignant Pleural Mesothelioma (MPM)
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac128), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Saurabh V Laddha
  
  Authors did a fantastic job by integrating MPM multi-omics datasets and created an integrative and interactive map for users to explore these datasets. MPM is a rare cancer type and understudied so such resources are very useful to move the field forward at a molecular level. The comprehensive data is well presented and the manuscript is well written to explain the complex genomics dataset for MPM. All the figures are well explained and very clear to understand
  
  Minor point: - Author mentioned an evaluation of tumor purity was done using pathological review, did author used molecular data such as genomic data to find tumor purity ? and if yes, how was the consensus ? This is very important factor to interpret the genomic results as the data was sequenced at 30X - In the same line, RNAseq can also be used to identify tumor purity and it will be really helpful for users to clear picture on tumor purity. - Is it not very clear from method section that the same MPM samples were used to sequence at DNA , RNA and DNA methylation level ? A brief explanation or table will be very easy for users to understand. - Recent WHO classify MPM into three different histopathological types. Did author do any unsupervised analysis from these comprehensive data to understand MPM heterogeneity or replicate WHO classification? or did author find WHO subtypes of MPM using molecular dataset ? A brief analysis/comment on usage of histological classification Vs Molecular classification will certainly move the MPM research field forward as researcher have found vast differences between histological vs molecular classification and the field is moving towards more molecular based classification in clinic.
  
  Reviewer 2: Jeremy Warner
  
  In this paper, the authors describe a new public resource for the molecular characterization of malignant pleural mesothelioma (MPM), which they describe as the most comprehensive to date. They perform WGS, transcriptome, and methylation arrays for 120 patients with MPM sourced through the MESOMICS project and integrate this dataset with an additional several hundred patients from previously published datasets.
  
  Although I cannot independently verify their claim that this is the largest and most comprehensive dataset for MPM, it is quite impressive and expansive. The pipeline utilized is well described and the results at all stages are transparently shared for prospective users of this dataset.
  
  The description of the methods to identify and remove germline variants is interesting, although the length somewhat detracts from the main goal of the paper in describing an MPM resource. Perhaps, this part could be condensed with the technical details presented in supplement. This comment pertains to both the Point Mutations and Structural Variants sections.
  
  Additional moderate concerns:
  
  There are insufficient details provided on the clinical and epidemiological parameters. Indirectly, it would appear that sex, age class, and smoking status are the clinical parameters - but what are the age classes? Is smoking status binary ever/never, or more involved? There ought to be a data dictionary provided as a supplemental table which describes each clinical/epidemiological variable, along with the possible values that the variable can take on. It should additionally be explained why other important clinical parameters are not available - most importantly, the presence of accompanying pulmonary comorbidity such as chronic obstructive pulmonary disease (COPD) and the existence of conditions that might preclude the use of standard systemic therapies, such as renal disease precluding the use of platinum agents.
  
  Context: I would like to see more here about the role of asbestos in the etiology, including what might be known about the pathophysiology of asbestos fibers at the molecular level. Also, there is nothing here about the evolution of treatment for MPM; the latest "state-of-the-art" regimens (platinum doublet + bevacizumab [MAPS; NCT00651456] and dual checkpoint inhibition [Checkmate 743; NCT02899299]) have reported median survival in the 18-month range, which is distinctly better than the median survivals quoted by the authors. Finally, I would like to see one or more direct references to the clinical trials where molecular heterogeneity has "fueled the implementation of drug trials for more tailored MPM treatments".
  
  Data Description: All specimens in the MESOMICS study are said to be collected from surgically resected MPM; this also appears to be the case for the integrated multi-omic studies from Bueno et al. and Hmeljak et al. and this should be explicitly indicated. Somewhere, it should also be explicitly discussed that this is an important limitation in the future utility of this data - surgical specimens are convenience samples and while they do provide important information, they lack treatment exposure. Given that many if not most patients with MPM will survive to 2nd or 3rd line systemic therapy, and that 1st line is fairly standardized, a knowledge of induced mutations is going to be essential to the ultimate goal of precision medicine.
  
  Minor concerns:
  
  The labels in the figures (e.g., Figure 2 - "Unmapped..too.short") are human-readable but could still be presented in a more friendly fashion. All acronyms should be defined.
  
  Reviewer 3: Mary Ann Tuli
  
  I have been asked to review the process of accessing the controlled data cited in this study to ensure that the process is clear and smooth. The study is available from the European Genome-phenome Archive (EGA) under accession number EGAS00001004812 (https://ega-archive.org/studies/EGAS00001004812). The paper is clear about how to obtain the DAA.
  
  The study has three datasets.
  
  I can confirm that the author was very prompt in his response to me requesting the DAC, in providing the DAA and in responding to the queries I had when completing the DAA. The completed DAA was sent to the EGA by the author on 29-Jul, and EGA responded within 3 working days, stating access had been granted. This is an excellent response time, so I conclude that the process of obtaining the DAA and the EGA making the data available to the user is very good.
  
  Today (1-Sep) I have attempted to gain access to the data via EGA. I was easily able to login to my EGA account and see that the datasets are available to me to download. Users need to download data using the EGA download client - pyEGA3. EGA provides a video on how to install the client, but I hit a problem and require technical support.
  
  I emailed the EGA help desk but have not had a response yet. I was quite surprised to receive a response from the author and have learnt that EGA include the owner of the study in RT tickets so they see any communication. I commend the author for his prompt response to my ticket (though it didn't solve my problem).
  
  I cannot hold on to this review for any longer, and I am not yet in a position to comment on the nature of the data held within this study.
  
  I do have concerns that the process of accessing controlled data held in the EGA is not straight forward. Users need to watch a 12 minute video to learn how to install the download client and may need to install programs on their computer). There is a FAQ which is very technical. This is not an issue for the author to resolve though.
  
  I understand the author has some minor revisions to make, so hopefully I should have a response from the EGA help desk before a final decision needs to be made (?).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.07.06.499003v1
www.biorxiv.org www.biorxiv.org

scShapes: A statistical framework for identifying distribution shapes in single-cell RNA-sequencing data

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Shiping Liu
  
  How to model the statistical distribution of the gene expression, is a basic question for the field of single cell sequencing data mining. Dharmaratne and colleagues looked details at the distribution of very gene. By using the generalized linear models (GLM), the authors present a new program scShapes, which matched a specific gene with a distribution from one of the four shapes, Poisson, Negative Binomial (NB), Zero-inflated Poisson (ZIP), and Zero-inflated Negative Binomial (ZINB). As the authors present in this manuscript, not all genes adapted to a single distribution, neither NB or Poisson, and some of the genes actually adapted to the zero-inflated models because of the property of high drop-out rate in the modern single cell sequencing, says 3' tag sequenced. It is has been popular to employ GLM in single cell data mining recently, but it also got both praise and blame. So it is a good forward step to model a specific model for an individual gene. But the bad side is the computing cost, especially for the number of cells been sequenced reach to millions in currently research, and it believed that the dataset will be reached even bigger in the future. So it make a great obstacle arise to the application of the method presented by the author here. How to speed up the calculation using the mixed model or scShapes? The authors also performed the scShapes on some datasets, including the metformin, human T cells, and PBMCs. They found some potential genes that changed the distribution shape, but didn't easy to be identified by other methods. It demonstrated that scShapes could identified the subtle change in gene expression.
  
  Major points: (1) We didn't see any details about the metformin dataset, the segueing depth and quality, number of genes/UMIs per cell, and so on. It makes hard to evaluate the quality and reliability of the results generated by scShapes. If this dataset is another manuscript could not possible to be presented at the same time, I suggest the author could perform on alternative dataset, as there are so many single cell datasets has been published could be used in this study.
  
  (2) Even the authors taken the cell type account in the GLM, I wonder for a specific gene, whether the distribution shape will change in different cell type. If so, it will becoming more complex, that is need to model the distribution shape for individual gene for every cell type alone.
  
  (3) To identify the different gene expression in scShapes, the author didn't consider the influence of different cell number, or the proportion of cell number, in the different cell type. A possible way to evaluate or eliminate this bias is to down sampling from a big dataset, instead of just simulated total number 2k ~ 5k from the PBMC. To evaluate the influence both the total number cell and the proportion in cell type.
  
  (4) The author should present the comparative results of the computational cost for different methods. Says the accuracy, time and memory consuming under different number of cells. I suggest the authors use much a larger dataset, because currently single cell research may include millions of cells, and the ability to process big data is very important to the application and becoming a widely used one.
  
  Minor points: (1) No figure legends for Fig.2 c and d.
  
  (2) It is unclear whether the total 30% genes undergo shape change, or just the proportion of the remaining after the pipeline. So please clarify the details.
  
  Reviewer 2: Yuchen Yang
  
  In this manuscript, authors presented a novel statistical framework scShapes using GLM approach for identifying differential distributions in genes across scRNA-seq data of different conditions. scShapes quantifies gene-specific cell-to-cell variability by testing for differences in the expression distribution. scShapes was shown to be able to identify biologically-relevant switch in gene distribution shapes between different conditions. However, there are still several concerns required to be addressed.
  
  In this study, authors compared scShapes to scDD and edgeR. However, besides these two, there are many other methods for calling DEGs from scRNA-seq. Wang et al. (2019) systematically evaluated the performance of eight methods specifically designed for scRNA-seq data (SCDE, MAST, scDD, D3E, Monocle2, SINCERA, DEsingle, and SigEMD) and two methods for bulk RNA-seq (edgeR and DESeq2). Thus, it is also worthy to compare scShapes to other methods, such as SigEMD, DEsingle and DESeq2, which were supposed to perform better than scDD or edgeR.
  
  When scShapes was compared to scDD, authors mainly focused on the distribution shifting. However, to users, it would be better to present a venn diagram showing the numbers of the genes detected by both scShapes and scDD, and the genes specifically identified by scShapes and scDD, respectively. In addition, authors showed the functional enrichment results for DEGs identified by scShapes. It is also worthy to perform enrichment analysis for the genes detected by both scShapes and scDD or specifically identified by scShapes or scDD.
  
  Since scShapes detects differential gene distribution between different conditions, it would be better to show users how to interpret the significant results biologically. For example, authors mentioned that RXRA is differentially distributed between Old and Young and Old and Treated, so what does this results mean? Can this differential distribution be associated with differential expression?
  
  In Discussion, authors mentioned that scRATE is another tool that can model droplet-based scRNA-seq data. It would be clearer to discuss that why authors develop their own algorithm rather than using scRATE to model the distribution.
  
  In Introduction, authors talked about the zero counts in scRNA-seq data, and presented evidence in Results part. Since 2020, there are several publications also focusing on this issue, such as Svensson, 2020 and Cao 2021. These discussions should be included in this manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.02.13.480299v1
www.biorxiv.org www.biorxiv.org

xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  Motivation
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac125), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Ruibang Luo
  
  In this paper, the authors proposed xAtlas, an open-source NGS variant caller. xAtlas is a fast and lightweight caller with comparable performance with other benchmarked callers. The benchmark comparison in multiple popular short-read platforms (Illumina HiSeq X and NovaSeq) demonstrated xAtlas's capacity to identify small variants rapidly with desirable performance. Although xAtlas is limited to call multi-allelic variants, the high sensitivity (~99.75% recall for ~60x benchmarking datasets) and desirable runtime (<2 hours) enable xAtlas to rapidly filter candidates and be considered as important quality control for further utilization.
  
  The authors presented a detailed explanation of xAtlas's workflow, design decisions and have done complete experiments in benchmarking, while there are still some points the authors need to discuss further listed as follow:
  
  The authors reported the performance in multiple coverages of the HG001 sample and the benchmarking result of HG002-4 samples by measuring the concordance with the GIAB truth set (v3.3.2). I noticed that GIAB had updated the GIAB truth sets from v3.3.2 to v4.2.1 for the Ashkenazi trio. The updated version included more difficult regions like segmental duplications and the Major Histocompatibility Complex (MHC) to identify previously unknown clinically relevant variants. Therefore, it would be helpful if the author could give a performance evaluation using the updated truth sets to give a more comprehensive performance of the proposed caller.
  
  In the Methods section, The authors stated the main three stages of the xAtlas variant calling process: read prepossessing, candidates identification, and candidates evaluation. The author fed hand-craft features (base quality, coverages, reference and alternative allele support, etc.) into a logistic regression model to classify true variants and reference calls in the candidate evaluation stage. But in Figure 1, the main workflow of xAtlas, only model scoring was shown, and the evaluation details were not visible. It would be useful if the authors could enrich Figure 1 to add more details to ensure consistency with Methods and facilitate reader understanding.
  
  In Figure 2, the authors reported the xAtlas performance comparison across in HG001 dataset with other variant callers. I noticed that the x-axis was F1-score while the y-axis was true positives per second. The tendency measurement of two metrics seems irrelevant, which might confuse the readers. we suggest the authors make separate comparisons for the two metrics. (For instance, plot Precision-Recall curves for F1-score measurement and Runtime comparison of various variant callers for speed benchmarking).
  
  Zheng, Zhenxian on behalf of the primary reviewer
  
  Reviewer 2: Jorge Duitama
  
  The manuscript describes a variant caller called xAtlas, which uses a logistic regression model to call SNPs after building an alignment and pileup of the reads. The manuscript is clear. The software is built with the aim of being faster than other solutions. However, I have some concerns relative to the method and the manuscript.
  
  Unfortunately, the biggest issue with this work is that the gain of speed is obtained with an important sacrifice in accuracy, specially to call indels. I ran xAtlas with two different benchmark datasets and the accuracy, especially for indels and other complex regions was about 20% lower compared to other solutions. Although the difference was smaller, xAtlas is also less accurate than other software tools for SNV calling. It is well known that even a simple SNV caller can achieve high sensitivity and specificity (see results from https://doi.org/10.1101/gr.107524.110). However, several SNV errors can be generated by incorrect alignment of reads around indels and other complex regions. For that reason most of the work on variant detection is focused on mechanisms to perform indel realignment or de-novo miniassembly to increase accuracy of both SNV and indel detection. The paper of Strelka is a great example of this (https://doi.org/10.1038/s41592-018-0051-x). The manuscript does not mention if any procedure has been implemented to realign reads or to increase in some way the accuracy to call indels. This is critical if xAtlas is meant to be used in clinical settings.
  
  The manuscript looks outdated in terms of evaluation datasets, metrics and available tools. Since high values of standard precision and sensitivity are easy to achieve with simple SNV callers, metrics such as the false positives per million basepair (FPPM) proposed by the developers of the synthetic diploid benchmark dataset should be used to achieve a more clear assessment of the accuracy of the different methods (https://doi.org/10.1038/s41592-018-0054-7). Regarding benchmark experiments, SynDyp should also be used for benchmarking. To actually support that xAtlas is reliable across heterogeneus datasets (as stated in the title), further datasets should be tested, as it has been done for software tools such as NGSEP (https://doi.org/10.1093/bioinformatics/btz275). In terms of tools, both DeepVariant and NGSEP should be included in the comparisons.
  
  Regarding the metrics proposed by the authors, I do not think it is a good practice to merge results on accuracy and efficiency, taking into account that the accuracy in this case is lower than other solutions, and for clinical settings that is an important issue. The supplementary table should also report sensitivity and precision for indels, not only for SNVs.
  
  The SNV calling method and particularly the genotyping procedure should be describe in much better detail. The manuscript describes the general pileup process, then, it mentions some general filters for read alignments and then it mentions that it applies logistic regression. However, it is not clear which data is used for such regression or in general how allele counts and quality scores are taken into account. A much deeper description of the logistic regression model should be included in the manuscript.
  
  There are better methods than PCA to show clustering of the 1000g samples. A structure analysis is more suitable for population genomics data and it is more clear to show the different subpopulations.
  
  Finally, about the software, genotype calls produced by the xAtlas should have a value for the genotype quality (GQ) format field to assess the genotyping accuracy. For single sample analysis the QUAL value can be used (although this is not entirely correct). However, for population VCFs, the GQ field is very important to have a measure of genotyping quality per datapoint. Regarding population VCF files it is not clear, either from the in-line help or from the github site, how population VCF files should be constructed.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/295071v1
www.biorxiv.org www.biorxiv.org

Near-chromosomal de novo assembly of Bengal tiger genome reveals genetic hallmarks of apex-predation

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  The tiger
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac112), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jong Hwa Bhak
  
  This manuscript is about assemblies of Bengal tigers. It is a great improvement over past two tiger genome assemblies. The assemblies quality is unprecedented (exceeding perhaps any feline genome in terms of contiguity).
  
  This represented a ~50x improvement in genome contiguity (see materials and methods). PanTigT.MC.v2
  
  What was the most important factor in this big jump of improvement in length?
  
  the overall contiguity was better than the domestic cat reference genome
  
  The quality comparison section is informative.
  
  We identified the "repetitive elements" in the genome by combining both
  
  ==> repeat elements is better.
  
  How close are the two genomes (MC & SI)?
  
  This reviewer finds it a great contribution to existing feline genome assemblies. The authors have done all the usual QC and constructed really high quality assemblies.
  
  Reviewer 2: Gang Li
  
  The submitted manuscript 'Near-chromosomal de novo assembly of Bengal tiger genome reveals genetic hallmarks of apex-predation' assemble the high-quality near-chromosomal leveled reference genomes of Bengal tiger, which will be of great significance for the conservation and rejuvenation of tigers, even other endangered felids. I have some comments on this manuscript: 1. Considering this the assembled genome used the Hic technology to figure out the chromosome structure, the figure of Hic results need to be presented. While, the assemble of sex chromosome always attract attentions, especially Y chromosome of tiger. More detailed information need to be specified, such as the conserved Y chromosome genes compared to other mammals, or whether there are tiger-specific Y linked gene has been observed or not. 2. In this work, authors used four zoo-bred individuals with known pedigree to test the inbreeding index of ROH and intend to evaluate the assembly quality. But I don't find any information about these four individuals and I guess they should be Bengal tigers. If it is the case, the question is that the quantity of ROH will not be only decided by the reference quality, but also the divergence between the target resequencing date and the used reference genome. That is to say, if the resequencing data and the reference genome are all from the same tiger sub-species, Bengal tiger, the quantity of ROH supposed to be more than that of the different sub-species comparison, which may not be an appropriate method used to evaluate the assembly quality. 3. I have some advice about the evolutionary divergence calibrations. Using some other species which have closer phylogenetic relationship might be better, according to their shared similar substitution rate and generation time, for instance, other species of Panthera . 4. The format of references part need to be rechecked.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.05.14.491975v2
www.biorxiv.org www.biorxiv.org

A Chromosome-level Assembly of the Japanese Eel Genome, Insights into Gene Duplication and Chromosomal Reorganization

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  Japanese eels
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac120), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer 1：Christiaan Henkel This paper describes a new chromosome-level assembly of the Japanese eel, which could finally supersede the various more fragmented assemblies. The assembly process is perhaps overly complex (many data sources and assembly steps, suppl. figure 3), but the result in general appears to be of high quality, as demonstrated by BUSCO (twice) and alignment to a closely related genome (Anguilla anguilla, suppl. figure 4). Figures 1 and 2, however, contain some inconsistencies:
  
  Figure 1: track B (nanopore coverage) shows a clear bimodal signal, with large blocks of high (double) coverage. These appear possibly correlated with areas low in gene content (track E). Are these possibly collapsed duplicate regions? That would have a strong effect on the analyses of genome duplication. Do other somewhat comparable data sources, for example PacBio CLR, show this feature?
  
  Figure 2, right panel: the new A. japonica assembly appears to have many unclustered genes (brown), similar to the fragmented draft assembly of A. rostrata and unlike the other included chromosome-level assemblies. This appears to be related to the annotation process? Or are there other problems that preclude orthology assignment for these genes? And how does A. rostrata get its gain of 11756 genes in this analysis? (By the way, line 323 has genus Anguilla as +919/-531, the figure +919/-631).
  
  Some other questions and comments I would like the authors to address:
  
  The discussion of previous and current eel sequencing efforts in the Introduction is not complete. For example, I miss the assemblies by Kai et al (2014) and Nakamura et al (2017) of the Japanese eel genome. In addition, the Introduction and Discussion (lines 415-417) present the current assembly as the first chromosome-scale Anguilla genome, which is not the case. At least two high-quality assemblies of Anguilla anguilla (European eel) are available, and should be acknowledged: one is by the Vertebrate Genome Project, and this assembly is even used in the manuscript for comparative purposes (line 199). The other has been described in a preprint (Parey et al 2022). Some of the mentioned papers include similar analyses (mostly on evolution after genome duplication and ancestral genome reconstruction, see figure 5).
  
  Kai et al (2014) A ddRAD-based genetic map and its integration with the genome assembly of Japanese eel (Anguilla japonica) provides insights into genome evolution after the teleost-specific genome duplication. BMC Genomics 15, 233. https://doi.org/10.1186/1471-2164-15-233 Nakamura et al (2017) Rhodopsin gene copies in Japanese eel originated in a teleost-specific genome duplication. Zoological Lett 3, 18. https://doi.org/10.1186/s40851-017-0079-2 Parey et al. (2022) Genome structures resolve the early diversification of teleost fishes. BioRxiv https://doi.org/10.1101/2022.04.07.487469 The different statistics listed for each alternative assembly in the Introduction make comparisons difficult.
  
  The statement in line 79, that eels as the most basal teleost group are 'close' to non-teleosts, is incorrect. They are just as close to non-teleosts as any other teleost. (The rest of the sentence, up to line 82, could use rephrasing).
  
  The statement in line 307 that 'Japanese eels are phylogenetically closer to American than European eels' contradicts the phylogeny presented (fig. 2), or is this based on some additional analysis (a density plot not shown), or even on figure 2 right panel (see comment earlier)? Even if they are incrementally 'closer' by some metric, I would not interpret this a phylogenetic distance, given the inferred divergence dates. In any case, the American eel assembly is still highly fragmented, and not the best basis for inferences which otherwise rely on chromosome-scale assemblies.
  
  Similarly, the statements on divergence between teleost groups in lines 495-500 need rephrasing. Anguilla species did not diverge from Megalops etc.
  
  Figure 2 & lines 205-213/310-313: These divergence times are calibrated using a few intervals taken from TimeTree.org (red dots). I wonder how reliable this is, as I get quite different intervals when checking now: for Anguilla-Megalops it is 162.2-197.3 (the paper has 179.3-219.3). Also TimeTree appears to have arowana (Scleropages) as the most basal branch among the teleosts, the paper has a combined Osteoglossomorpha(arowana)/Elopomorpha(eels) branch. Has the phylogenetic tree topology been inferred or imposed? Why have the specific calibration points been chosen? The early branching among teleosts (see line 310-312) is somewhat controversial, see the preprint by Parey et al.
  
  Line 346-348: This uses the eel genome size (~1 Gbp) and the further (4R) duplicated salmon genome (3 Gbp) to argue against a such further genome duplication in eels. Although I agree that the eel 4R probably did not occur, comparing genome sizes presents no evidence in this case. Genome size changes by other processes as well, and more dramatically (e.g. transposon proliferation). In addition, salmon and eel are not closely related, at all. Compare this to the genomes of the (much more closely related) common carp and zebrafish, both ~1.5 Gbp: the carp genome, but not zebrafish, has experienced an additional duplication, but the zebrafish genome contains a higher transposon density.
  
  The second argument against 4R (lines 352-356, figure 4b) also does not really work. With 8 Hox clusters, the eel genome appears duplicated with respect to the gar (4 clusters), and not quadruplicated. However, with 8 clusters and 70+ genes, eels actually have more than all established 3R teleost genomes (max. 7 clusters, 42-50 genes). So the question is then whether these 8 clusters form nice 3R WGD ohnolog pairs, or if some clusters have been lost (as in nearly all other teleosts) and re-duplicated. The former hypothesis is consistent with the high level of retained WGD genes (line 369), the latter with the inferred high level of local duplication (line 363). The observation of duplicate eel Hox clusters goes back to the initial European eel genome assembly (Henkel et al 2012), but there the draft status precluded confident assignment to 3R for some clusters.
  
  The eel olfactory receptors have previously been identified using an assembled transcriptome (Churcher et al. 2015, not cited). How do the analyses of line 214-229/324-333/420-434/figure 3 compare?
  
  Churcher et al (2015) Deep sequencing of the olfactory epithelium reveals specific chemosensory receptors are expressed at sexual maturity in the European eel Anguilla anguilla. Molecular Ecology 24, 822-834. https://doi.org/10.1111/mec.13065 Lines 460-467 state eels have retained duplicates of immune genes, which have been under positive selec tion. So how does this translate to a (very recent) negative effect on eel fitness (line 460-462)?
  
  The discussion of line 482-502 on chromosome numbers invokes ecological explanations (freshwater vs. marine habitats, 482-489), but subsequently does not translate this to the low Anguilla chromosome numbers. As these ecological factors are highly applicable to Anguillidae, this connection should be explored here - including their evolutionary history (e.g. Inoue et al, 2010, Deep-ocean origin of the freshwater eels. Biology Letters 6, https://doi.org/10.1098/rsbl.2009.0989)
  
  In this discussion: how do the numbers of line 482/3 (modal 2n 54/48 chromosomes in fish) correspond to those of line 492 (peak chromosome number n = 24/25 in extant teleosts)?
  
  The supplementary figures/tables lack legends (just mentions in the main text).
  
  Line 109: which ONT flowcell, kit, and basecaller versions have been used? In the M&M, please list software versions.
  
  Reviewer 2: Zhong Li This manuscript by WANG et al. titled "A Chromosome-level Assembly of the Japanese Eel Genome, Insights into Gene Duplication and Chromosomal Reorganization " provides a high quality genome assembly of Japanese Eel, and economically important fish. The authors have used for kinds of sequencing technologies, and assembling strategies, and provided well annotated genomes. This genome provides useful information for the genome organization and evolution and other fields of this species.
  
  Overall, the manuscript is sufficiently descriptive and easy to follow. I have three major concerns:
  
  The genome annotation rely on the transcriptome. No detailed information was given the the method section. The analyses do not include command lines or software versions and thus are not repeatable easily. A document that include these information is higly recommended included as a supplementary file. The genome assembly seems has not been released on NCBI database (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA852364). Besides, the gene models (nucleotide, protein, and GFF files) should also be made available and included in the Data Availability section when the manuscript is accepted.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.28.497880v1
www.biorxiv.org www.biorxiv.org

Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation

1
1. GigaScience 04 Mar 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Dominik Heider
  
  The paper is well written, and the objectives are clear. The study is a very nice application of CGR in bioinformatics and shows the excellent performance of CGR-encoded data in combination with deep learning. I have a few things that should be addressed in a minor revision:
  
  1) Some very important studies have not been addressed in the related work part, e.g., in Touati et al. (pubmed:32645523) and Sengupta et al. (pubmed:32953249), the authors compared SARS-CoV2 with other coronaviruses based on CGR, or we (pubmed:34613360) used CGR in combination with deep learning for resistance predictions in E. coli.
  
  2) To me, it is unclear how accuracy was used in the model. Is it one class (i.e., clade) versus all others? If yes, accuracy might be misleading because of the high class imbalance. In such high class imbalances, MCC has been shown to be more suitable.
  
  3) "The undersampled dataset was randomly split into train...". Why did you undersample? To balance the data, which would make sense to use accuracy as a metric but discard a lot of valuable data. What about oversampling?
  
  4) Comparison with other tools: I wonder whether the good performance of your model is the result of deep learning or the CGR encoding. Please also provide the results for another ML model (besides SVM, e.g., random forests) to compare to, e.g., Covidex.
  
  Reviewer 2: Riccardo Rizzo
  
  The authors propose a classification experiment based on Frequency Chaos Game Representation and deep learning. They used the outstanding performances of a ResNet network as an image classification tool and the FCGR method that represent a genome sequence as an image.
  
  The work seems good, although some major points should be clarified.
  
  First, whether the performance index values came from a 5-fold validation procedure (5 because they said the split was 80-10-10) or a one-shot experiment is unclear.
  
  Second, the part that involves the frequent k-mers and the SVM should be better explained. The authors should clarify what the meaning of this comparison is.
  
  Another point to clarify is the quality of the sequences used; the authors worked on complete sequences, but, as far as I know, in the real world virus sequences are noisy data, and authors should discuss this point.
  
  Minor points:
  
  Authors said that a sequence is a string $s \in {A, C, G, T, N}^*$, so they should explain the procedure used in Definition 2, where only 4 symbols seem to be used. If they discard the N, or consider 4 k-mers (consider that N means "any symbol") they should say it clearly.
  
  Figure 1 and 2 report two different quantities but say the same thing; maybe one of them can be omitted.
  
  Authors should add some details about the training time of the network.
  
  A final suggestion: probably it will be interesting to use the same deep network with transfer learning (the whole network or just the first sections) to evaluate the gain with ad-hoc training and the different training time.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.13.495912v1
www.biorxiv.org www.biorxiv.org

DeePVP: Identification and classification of phage virion protein using deep learning

2
1. GigaScience 01 Mar 2023
  
  in GigaScience
  
  The
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac076) and has published the reviews under the same license.
  
  Reviewer 1 Satoshi Hiraoka
  
  In this manuscript, the authors developed a new tool, DeePVP, for predicting Phage Virion Proteins (PVPs) using the Deep learning approach. The purpose of this study is meaningful. As the authors described in the Introduction section, currently it is difficult to annotate functions of viral genes precisely because of its huge sequence diversity and existence of many unknown functions, and there are still many rooms to improve the performance of in silico annotation of phage genes including PVPs. Although I'm not an expert in machine learning, the newly proposed method based on Deep learning seems to be appropriate. The proposed tool showed clear outperformance compared with the other previously proposed tools, and thus, the tool might be valuable for further deep analysis of many viral genomes. Indeed, the authors conducted two case studies using real phage genomes and reported novel findings that may have insight into the genomics of the phages. Overall, the manuscript is well written, and I feel the tool has a good potential to contribute to the wide fields of viral genomics. Unfortunately, I have concerns including the source cord openness. Also, I have some suggestions that would increase the clarity and impact of this manuscript if addressed.
  
  Major: I did not find DeePVP source cord on the GitHub page. Is the tool not open source? I strongly recommend the author disclose all scripts of the tool for further validation and secondary usage by other scientists. Or, at least, clearly state why the source cords need to hold private. Also, I was much confused about the GitHub page because the uploaded files are not well structured. Scripts and data used for performance evaluation were included in 'data.zip' file, which should be renamed to be an appropriate one. 'Source code' button in the Releases page strangely links to the 'Supporting_data.zip' files which only contained installing manual but not source cord file. The authors should prepare the GitHub page appropriately that, for example, upload all source cords to the 'main' branch rather than include them in zip file, and 'source code' file in Releases should contain actual source code files rather than manual PDF. According to the Material and method section, 1) using the Deep learning approach, and 2) using th large dataset retrieved from PhANNs as teacher dataset, are two of the important improvement from the other studies in the PVP identification task. Someone may suspect the better performance of DeePVP was mostly contributed by the increased teaching dataset rather than the used classification method. Is there a possibility that the previously proposed tools (especially the tools except for PhANNs) with re-training using the large PhANNs dataset could reach better performances than DeePVP? The naming of 'Reliability index' (L249) is inaccurate. The score did not support the prediction 'reliability' (i.e., whether the predicted genes are truly PVP or not) but just reflects the fact that the gene is predicted as PVP by many tools without considering whether it is correct or incorrect. The sentence 'A higher n indicates that this protein is predicted as PVP by more tools at the same time, and therefore, the prediction may be more reliable.' in L252 is not logical. I dose not fully agree with the discussion that the tool will facilitate viral host prediction as mentioned in L294-302. It is very natural that if the phages are phylogenetically close and possess similar genomic structures including PVP-enriched regions, those will infect the same microbial lineage as a host. However, this is not evaluated systematically in wide phage lineages. In general, almost all phage-host relations are unknown in nature except few numbers of specific viruses such as E. Coli phages. Further detailed studies should be needed on whether and how degree the conservation of PVP-enriched region could be a potentially good feature to predict phage-host relationship. I think the phage-host prediction is beyond the scope of this tool, and thus the analysis could be deleted in this manuscript or just briefly mention in the Discussion section as a future perspective.
  
  Minor: The URL of the GitHub page is better to describe in the last of the Abstract or inside of the main text in addition to the 'Availability of supporting source code and requirements' section. This will make it easy for many readers to access the homepage and use the tool. Fig 2 and 3. I think it is better to change the labels of the x-axis like 0 kb, 20 kb, 40 kb, ..., and 180 kb. This will make it easy for understanding that the horizontal bar represented the viral genome.
  
  Re-review:
  
  I read the revised manuscript and acknowledge that the authors made efforts to take reviewers' comments into account. My previous points have been addressed and I feel the manuscript was improved. I think the word 'incomplete proteins' in L391-396 would be rephrased like 'partial genes' because here we should consider protein-encoding genes (or protein sequences), not proteins themselves, and the word 'incomplete' is a bit ambiguous.
2. GigaScience 01 Mar 2023
  
  in GigaScience
  
  ABSTRACT
  
  Reviewer 2. Deyvid Amgarten
  
  The manuscript presents DeePVP, a new tool for PVP annotation of a phage genome. The tool implements two separate modules: The main module aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the ten major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs, a known tool in the area. Overall, the manuscript is well written, clear, and I could not identify any serious methodological inconsistence. I was not sure whether to consider the performance metrics shown as significant improvements or not, since PhANNs already does a similar job on that regard. And it is better for some types of PVPs for example. But I would rather give this task to readers and other researchers in the area. Specifically, I enjoyed the discussion about how one-hot encoded features may be more suitable for predictions that k-mers based ones. And by consequence, that convolution networks may present an advantage against simple multilayer perceptron networks. This manuscript brings an important contribution to the phage genomics and machine learning fields. I am certain that DeePVP will be helpful to many researchers. I have a major question about the composition of the dataset used to train the main module: Among the PVP proteins, do authors know if only the ten types of PVP are present? There is a rapid mention to key words used to assemble the PhANNs dataset in the discussion (line 340), but that is not clear to me. This will help me understand the following: Line 124: The CNN in the extended module has an output softmax layer, which outputs likelihood scores for 10 types of virion proteins. I wonder if only proteins from these 10 types were included in the datasets used to train the CNNs. I mean, is it possible that a different type of virion protein is predicted by the main module as PVP? And if so, how would the extended module predict this protein since it is PVP but none of the ten types? Minors: Line 121: By default, a protein with a PVP score higher than 0.5 is regarded as a PVP. How was this cutoff chosen? Was this part of the k-cross validation process? Line 157 and other pieces in the manuscript: I would suggest authors not to use sentences like "F1-score is 9.05% much higher than that of PhANNs" for obvious reasons that 9% may not seem such a great difference for using the "much" adverb. Same thing to "much better" and variations. About the comparisons between DeePVP and PhANNs: Did authors make sure that instances of the test set were not used to train the PhANNs model being used? Line 221: What authors mean by "more authentic prediction"? Looking at the github repository, I found rather unusual that authors chose to upload only a PDF with instructions of how to use and install. It is very detailed, I appreciate. The virtual machine and docke containers are also nice resources to help less experienced users. However, I noticed that the github repository has no clear mention to the source code of the tool. I found it by a mention in the Availability of supporting data, where authors created a release with the datasets and the scripts. Again, very unusual, but I suppose authors have chosen this approach due to github limitations to large files. Table 2: I would like to ask authors what might me the reason for such low performance metrics to some types of PVP (for example, minor capsid)? Figure 5 states: "Host genus composition of the subject sequences". But there is a "Myoviridae" category, which is a family of phages. Not anything related to bacterial hosts. Please, verify why this is in the figure.
  
  Re-review:
  
  Thank you for authors' responses. Most of my concern were addresses. I have to say, though, that the github page is not quite in the standards for a bioinformatics tools yet. I appreciate the source code upload, but I noticed that not a single line of #comments were present in the code I have checked. README file is also not very clarifying. I do not consider this as an impediment for publication (since there are detailed info in GigaScience DB), but perhaps this may hind usage of authors' tool. Most users will only look at the github repository. I suggest some improvements in case authors judge my comment makes some sense. Bellow I list three examples just to give authors an idea:
  
  https://github.com/fenderglass/Flye https://github.com/LaboratorioBioinformatica/MARVEL https://github.com/vrmarcelino/CCMetagen
  
  One last concern was about authors' response to the Myoviridae mistake in figure 5. Authors stated that the genus of a phage host is in its name (as for example Escherichia phage XX). But this is a dangerous assumption, since many phage names are outside of this rule. For example, there are many phages with Enterobacteria phage XXX (for instance NC_054905.1 ), meaning that they infect some Enterobacteria. Again, enterobacteria is not a genus. Phage nomenclature may be a mess sometimes, be careful.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.10.23.465539v1
www.biorxiv.org www.biorxiv.org

Benchmarking ultra-high molecular weight DNA preservation methods for long-read and long-range sequencing

2
1. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Studies
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac068) and has published the reviews under the same license.
  
  Reviewer 1 Tomas Sigvard Klingström,
  
  As a researcher who may occasionally use long read sequencing technique for projects it is immensely helpful to get an insight into the experience accumulated through work related to the Vertebrate Genomes Project (VGP). My personal research interest on the subject is more on understanding why and how DNA fragment during DNA extraction. Due to my work in that area I have one key question regarding the interpretation of the data presented in figure 2 and then a number of suggestions for minor edits. The answer on how to interpret figure 2 may require some minor edits but the article is regardless of this a welcome addition to what we know about good practices for DNA extraction generating ultra high molecular weight DNA. It should also be noted that the DOI link to Data Dryad seems broken and I have therefore not look at the supplementary material. In figure 2 the size distribution of DNA fragments is visualized from the different experiments. Most of the fragment distributions look like I would have expected them based on the work we did in the article cited as nr 25 in the reference list. However the muscle tissue from rats and the blood samples from the mouse and the frog indicates that there may be a misinterpretation in the article regarding the actual size distribution of fragments which needs to be looked in to. Starting with the mouse plots and especially the muscle one. There must either have been a physical shearing event that drastically reduced the size of DNA (using the terminology from ref 25 this would mean that physical shearing generated a characteristic fragment length of approximately 300-400 kb), or the lack of a sharp slope on the rightmost side of the ridgeline plot is due to the way the image was processed. All other animals got a peak on the rightmost side of the ridgeline plot and the agarose plug should, based on the referenced methods paper [7], generate megabase sized fragments which far exceed the size of the scale used in figure 2. I would presume these larger fragments would get stuck in or near the well which makes it easy to accidentally cut them out when doing the image analysis step which may explain their absence in the mouse samples. This leads me to the conclusion that the article is well designed to capture the impact of chemical shearing caused by different preservation methods but would benefit from evaluating whatever figure 2 properly covers the actual size distribution of fragments or only covers the portion of DNA fragments small enough to actually form bands on the PFGE gel with a substantial part of the DNA stuck in or near the well. The frog plot is a good example of how this may influence our interpretation of the ridgeline plots. If the extraction method generate high-quality DNA concentrated in the 300-400 kb range then there must be something very special with the frog DNA from blood as there is a continuous increase in the brightness all the way to the edge of the image. This implies that the sample contains a high amount of much larger DNA fragments than the other samples. I find this rather unlikely and if I saw this in my own data I would assume that we had a lot of very large DNA fragments that are out of scale for the gel electrophoresis but that in the case for the frog blood samples many of these fragments have been chemically sheared creating the "smeared" pattern we see in figure 2.
  
  Minor edits and comments: Dryad DOI doesn't work for me. Figure 1 - The meaning of x3 and x2 for the turtle should be described in the caption. Figure 2 - Having the scale indicator (48.5. 145.5 etc) at the top as well as the bottom of each column would make it quicker to estimate the distribution of samples. The article completely omits Nanopore sequencing, is there a specific reason for why lessons here are not applicable to ONT? There is a very interesting paragraph starting with "The ambient temperature of the intended collecting locality should be a major consideration in planning field collections for high-quality samples. Here we test a limited number of samples at 37°C to". Even if the results were very poor information about the failed conditions would be appreciated. What tissues/animals did you use, did you do any preservation at all for the samples and did you measure the fragment length distribution anyway? Simply put, even if the DNA was useless for long read sequencing it is an interesting data point for the dynamics of DNA degradation and a valuable lesson for planning sampling in warm climates.
  
  Re-review:
  
  All questions and commends made in my first review are now resolved. I understand the thought process behind the first cropping of figure 2 but appreciate the 2nd version as it makes it easier for researchers with a limited understanding of the experiment to interpret the data.
2. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer 2. Elena Hilario
  
  I am glad to have been selected as reviewer for the manuscript "Benchmarking ultra-high molecular weight DNA preservation methods for long-read and long-range sequencing" by Dahn and colleagues. The manuscript reports a detailed guide on the effect of preservation methods on the quality of the DNA extracted from a wide range of animal tissues. Although the work is only focused on vertebrates, it is a great foundation to conduct similar studies on plants, invertebrates and fungi, for example. Although the effectiveness of the tissue/preservative combination was only tested with the preparation of long range libraries, it would have been useful to select one or two cases for long range sequencing (PacBio or Oxford Nanopore) to explore the impact of the different QC parameters measured in this study.
  
  Minor comments and corrections are included in the file uploaded
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.07.13.451380v2
www.biorxiv.org www.biorxiv.org

Stardust: improving spatial transcriptomics data analysis through space aware modularity optimization based clustering

2
1. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Background
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license https://doi.org/10.1093/gigascience/giac075) and has published the reviews under the same license.
  
  Reviewer 1. Nikos Karaiskos
  
  Reviewer Comments to Author: In this article the authors developed Stardust, a computational method that can be used for spatially-informed clustering by combining transcriptional profiles and spatial information. As spatial sequencing technologies gain popularity, it is important to develop tools that can efficiently process and analyse such datasets. Stardust is a new method that goes in this direction. It is particularly appealing to make use of the spatial information and relationships to cluster gene expression in these datasets. Overall the quality of data used is high and the manuscript is clearly written. The algorithm behind Stardust is simple and consists of an interpolation between spatial and transcriptional distance matrices. A single parameter called space weight controls the contribution of the spatial distance matrix. The authors benchmark Stardust against other recently developed tools in five different spatial transcriptomics datasets by using two measures. Stardust therefore holds the potential of being a useful method that can be applied in different datasets.
  
  Before recommending the manuscript for publication, however, the authors should thoroughly address the following points: 1. What is the rationale behind modelling the contributions as a linear sum of the spatial and transcriptional distance matrices? In particular, why did the authors not consider non-linear relationships as well? As cells neighboring in space often share similar transcriptional profiles (see for instance Nitzan et al., 2019 for this line of reasoning and several examples therein), I would expect product terms to be even more informative. 2. The authors demonstrate Stardust's performance only on datasets obtained with the 10X Visium platform. How does Stardust perform on higher-resolution methods, such as Slide-Seq, Seq-scope etc? As ST methods will improve in resolution in the future, it is critical to be able to analyze such datasets as well. An important question here concerns scalability: how well does Stardust scale with the number of cells/spots? 3. In Fig. 1b conclusions are driven based on the CSS for different space weights, but only for a clustering parameter=0.8. What happens for other clustering values? And can the authors comment on why the different space weight values do not perform consistently across the datasets (i.e. 0.5 is better for HBC2 but 0.75 for MK)? 4. The authors compared Stardust with four other tools. The conclusion is that Stardust outperforms all other methods --and performs equivalently with BayesSpace. All of these methods, however, rely on choosing specific values for a number of parameters. Did the authors optimize these values when they benchmarked these methods against Stardust? 5. I was able to successfully install Stardust and run it. The resulting clusters in the Seurat object, however, were all NAs. The authors should make an effort to better document how Stardust runs, including the input structure that the tool expects and potential issues that might arise.
  
  Re-review: The authors have successfully addressed all raised points. The introduction of Stardust*, in particular, is a valuable enhancement of the method. Therefore, I recommend the manuscript for publication.
2. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Spatial
  
  Reviewer 2. Quan Nguyen
  
  Reviewer Comments to Author: This work presents a new clustering method, Stardust, that has the potential to improve stability of clustering results against parameter changing. Stardust can assess the contribution to the clustering result by spatial information relative to gene expression information. Stardust appears to performs better than other methods in the two metrics used in this paper, stability and coefficient of variation. The essence of the method is the use of a spatial transcriptomics (ST) distance matrix as a simple linear combination of physical distance (S) and transcriptional distance (T) matrices. A weight factor is used for the S matrix to control and evaluate the contribution of the spatial information. The effort for evaluating multiple parameters and comparing with several latest methods and across a number of public spatial datasets is a highlight of the work. The authors also made the code available.
  
  Major comments: - The concept of combining spatial location and gene expression is not new and has been applied in most spatial clustering methods. It is not clear what are the new additions to current available methods, except for a feature to weigh the contribution of spatial components to clustering results. - The approach to assess the contribution of spatial information, by varying the weight factor from 0 to 1 is rather simple, because the contribution can be nonlinear and vary between spots/cells (e.g. spatial distance becomes more important for spots/cells that are nearer to each other; some genes are more spatially variable than the others; applying one weight factors for all genes and all spots would miss these variation sources) - The 5 weight factors 0, 0.25, 0.50, 0.75, and 1 were used. However, this range of parameters provided too few data points to assess the impact of spatial factor. As seen in figures, the 5 data points do not strongly suggest a point where the spatial contribution is maximum/minimum due to large fluctuation of values in the y-axis. - Although two performance metrics are used (stability and variation), there needs to be an additional metric about how the clustering results represent biological ground truth cell type composition or tissue architecture (for example, by comparing to pathological annotation). Consequently, it is unclear if the stardust results are closer to the biological ground truth or not. - Stardust was tested on multiple 10x Visium datasets, but different types of spatial transcriptomics data like seqFISH, Slideseq, MERFISH, ect. are also common. Extended assessment of potential applications to other technologies would be useful. Minor comments: - The paragraphs and figure legends in the Result section are repetitive. - The result section is descriptive and there is no Discussion section.
  
  Re-review:
  
  The authors have improved the initial manuscript markedly. There are a couple of important points regarding comparisons between Stardust and Stardust that need to be addressed: 1) In which cases Stardust improves over Stardust? It seems the results would be dependent on different biological systems (i.e., tissue types). The authors suggest both versions produce comparable results, but given the major change in the formula (replacing a constant weight with variable weights as normalised gene expression values to [0,1] minmax scale), there are likely differences between Stardust and Stardust. For example, certain genes will have higher weight than the others, therefore making the effects of the weights variable among genes. For this example, the authors may assess highly abundant genes vs low abundant genes 2) In cases where spatial distances are important, Stardust could be less accurate than Stardust version with a high space weight. How Stardust* considers cases that spatial distance is as important as gene expression.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.04.27.489655v2
www.biorxiv.org www.biorxiv.org

SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data

2
1. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Survival
  
  Reviewer 2. Animesh Acharjee
  
  SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data.
  
  Authors compared many survival analysis methods and created a benchmarking framework called as SurvBenchmark. This is one of the extensive study using survival analysis and will be useful for translational community. I have few suggestions to improve the quality of the manuscript.
  
  Figure 1: LASSO, EN and Ridge are regularization methods. So, I would suggest including a new classification category as "regularization" or "penalization methods" and take out those from non-parametric models. Obviously this also need to be included accordingly in the methodology section and discussions
  
  Data sets: please provide a table with six clinical and ten omics data sets with number of samples, features and reference link.
  
  Discussion section: How the choice of the method need to be chosen? What criteria need to be used? I understand one does not fit all but some sort of clear guidance will be very useful. Also sample size related aspects need to be more discussed. In the omics research number of samples are really limited and deep learning based survival analysis is not feasible as authored mentioned in the line number 328-331. So, question come, when we should used deep learning based methods and when we should not.
  
  Reviewer 3. Xiangqian Guo Accept
2. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Abstract
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac071) and has published the reviews under the same license.
  
  Reviewer 1. Moritz Herrmann
  
  First review: Summary:
  
  The authors conducted a benchmark study of survival prediction methods. The design of the study is reasonable in principle. The authors base their study on a comprehensive set of methods and performance evaluation criteria. In addition to standard statistical methods such as the CoxPH model and its variants, several machine learning methods including deep learning methods were used. In particular, the intention to conduct a benchmark study based on a large, diverse set of datasets is welcome. There is indeed a need for general, large-scale survival prediction benchmark studies. However, I have serious concerns about the quality of the study, and there are several points that need clarification and/or improvement.
  
  Major issues:
  
  The method comparison does not seem fair As far as I can tell from the description of the methods, the method comparison is not fair and/or not informative. In particular, given the information provided in Supp-Table-3 and the code provided in the Github repository, hyperparameter tuning has not been conducted for some methods. For example, Supp-Table-3 indicates that the parameters 'stepnumber' and 'penaltynumber' of the CoxBoost method are set to 10 and 100, respectively. Similarly, only two versions of RSF with fixed ntree (100 and 1000) and mtry (10, 20) values are used. Also, the deep learning methods appear not to be extensively tuned. On the other hand, telling form the code, methods such as the Cox model variants (implemented via glmnet) and MTLR have been tuned at least a little. Please clearly explain in detail, how the hyperparameters have been specified respectively how hyperparameter tuning has been conducted for the different methods? If, in fact, not all methods have been tuned, this is a serious issue and the experiments need to be rerun under a sound and fair tuning regime.
  
  Description of the study design Related to the first point, the description of the study design needs to be improved in general as it does not allow to assess the conducted experiments in detail. A few examples, which require clarification:
  
  as already mentioned, the method configurations and implementations are not described sufficiently. It is unclear how exactly the hyperparameter settings have been obtained, how tuning as been applied and why only for some methods
  
  concerning the methods Cox(GA), MTLR(GA), COXBOOST(GA), MTLR(DE), COXBOOST(DE): have the feature selection approaches been applied on the complete datasets or only on the training sets
  
  Supp-Table-3 lists two implementations of the Lasso, Ridge and Elastic Net Cox methods (via penalized and glmnet); yet, Figure 2 in the main manuscript only lists one version. Which implementations have been used and are reported in Figure 2?
  
  l. 221: it is stated that "the raw Brier score" has been calculated. At which time point(s) and why at this/these time point(s)?
  
  Supp-Table-2: it is stated that "some methods are not fully successful for all datasets", but only DNNSurv is further examined. Is it just DNNSurv or are there other methods that have failed in some iterations? Moreover, what has been done about the failing iterations? Have the missing values be imputed? Are the failing iterations ignored?
  
  I recommend that section 3 be comprehensively revised and expanded, in particular including the methods implementations, how hyperparamters are obtained/tuning has been conducted, aggregation of performance results, handling of failing iterations. Moreover, I suggest to provide summary tables of the methods and datasets in the main manuscript and not in the supplement.
  
  Reliability of the presented results In other studies [BRSB20, SCS+20, HPH+20] differences in (mean) model prediction performance have been reported to be small (while variation over datasets can be large). This can also be seen in Figure 3 of the main manuscript. Please include more analyses on the variability of prediction performances and also include a comparison to a baseline method such as the Kaplan-Meier estimate. Most importantly, if some methods have been tuned while others have not, the reported results are not reliable. For example, the untuned methods are likely to be ill-specified for the given datasets and thus may yield sub-optimal prediction performances. Moreover, if internal hyperparameter tuning is conducted for some methods, for example via cv.glmnet for the Cox model variants, and not for others, the computation times are also not comparable.
  
  Clarity of language, structure and scope I believe that the quality of the written English is not up to the standard of a scientific publication and consider language editing necessary (yet, it has to be taken into account that I am not a native speaker). Unlike related studies [BWSR21, SCS+20, e.g.], the paper lacks clarity and/or coherence. Although clarity and coherence can be improved with language editing, there are also imprecise descriptions in section 2 that may additionally require editing from a technical perspective. For example:
  
  l. 76 - 78: The way censoring is described is not coherent, e.g.: "the class label '0' (referring to a 'no-event') does not mean an event class labelled as '0'". Furthermore, it is not true that "the event-outcome is 'unknown'". The event is known, but the exact event time is not observed for censored observations.
  
  The authors aim to provide a comprehensive benchmarking study of survival analysis methods. However, they do not, for example, provide significance tests for performance differences nor critical differences plots (it should be noted that the number of datasets included may not provide enough power to do so). This is in stark contrast to the work of Sonabend [Son21].
  
  I suggest revising section 2 using more precise terminology and clearly describing the scope of the study, e.g., what type of censoring is being studied, whether time-dependent variable and effects are of interest, etc. I think this is very important, especially since the authors aim to provide "practical guidelines for translational scientists and clinicians" (l. 32) who may not be familiar with the specifics of survival analysis.
  
  Minor issues
  
  l. 43: Include references for specific examples
  
  l. 60: The cited reference probably is not correct
  
  l. 266: "MTLR-based approaches perform significantly better". Was a statistical test performed to determine significant differences in performance? If yes, indicate which test was performed. If not, do not use the term "significant" as this may be misunderstood as statistical significance.
  
  Briefly explain what the difference is between data sets GE1 to GE6.
  
  It has been shown that omics data alone may not be very useful [VDBSB19]. Please explain why only omics variables are used for the respective datasets.
  
  Figure 1: Consider changing the caption to 'An overview of survival methods used in this study' as there are survival methods that are not covered. Moreover, consider referencing Wang et al [WLR19] as Figure 1a resembles Figure 3 presented therein.
  
  Figure 2: Please add more meaningful legends (e.g., title of legend; change numbers to Yes, No, etc.).
  
  Figure 2 a & b: What do the dendrograms relate to?
  
  Figure 2 d: The c-index is not a proper scoring rule [BKG19] (and only measures discrimination), better use the integrated Brier score (at best, at different evaluation time points) as it is a proper scoring rule and measures discrimination as well as calibration.
  
  Figure 3: At which time point is the Brier score evaluated and why at that time point? Consider using the integrated Brier score instead.
  
  This is rather subjective, but I find the use of the term "framework", especially that the study contributes by "the development of a benchmarking framework" (l. 60), irritating. For example, a general machine learning framework for survival analysis was developed by Bender et al. [BRSB20], while general computational benchmarking frameworks in R are provided, e.g., by mlr3 [LBR+19] or tidymodels [KW20]. The present study conducts a benchmark experiment with specific design choices, but in my opinion it does not develop a new benchmarking framework. Thus, I would suggest not using the term "framework" but better "benchmark design" or "study design".
  
  In addition, the authors speak of a "customizable weighting framework" (l. 241), but never revisit this weighting scheme in relation to the results and/or provide practical guidance for it. Please explain w.r.t. the results how this scheme can and should be applied in practice.
  
  The references need to be revised. A few examples: - l. 355 & 358: This seems to be the same reference. - l. 384: Title missing - l. 394: Year missing - l. 409: Year missing - l. 438: BioRxiv identifier missing - l. 441: ArXiv identifier missing - l. 445: Journal & Year missing
  
  Typos: - l. 66: . This - l. 89: missing comma after the formula - l. 93: missing whitespace - l. 107: therefore, (no comma) - l. 121: where for each, (no comma) - l. 170: examineS - l. 174: therefore, (no comma) - l. 195: as part of A multi-omics study; whitespace on wrong position; the sentence does not appear correct - l. 323: comes WITH a
  
  Data and code availability
  
  Data and code availability is acceptable. Yet, the ANZDATA and UNOS_kidney data are not freely available and require approval and/or request. Moreover, for better reproducibility and accessibility, the experiments could be implemented with a general purpose benchmarking framework like mlr3 or tidymodels.
  
  References
  
  [BKG19] Paul Blanche, Michael W Kattan, and Thomas A Gerds. The c-index is not proper for the evaluation of-year predicted risks. Biostatistics, 20(2):347-357, 2019. [BRSB20] Andreas Bender, David Rügamer, Fabian Scheipl, and Bernd Bischl. A general machine learning framework for survival analysis.arXiv preprint arXiv:2006.15442, 2020. [BWSR21] Andrea Bommert, Thomas Welchowski, Matthias Schmid, and Jörg Rahnenführer. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings in Bioinformatics, 2021. bbab354. [HPH+20] Moritz Herrmann, Philipp Probst, Roman Hornung, Vindi Jurinovic, and Anne-Laure Boulesteix. Large-scale benchmark study of survival prediction methods using multi-omics data. Briefings in Bioinformatics, 22(3), 2020. bbaa167. [KW20] M Kuhn and H Wickham. Tidymodels: Easily install and load the 'tidymodels' packages. R package version 0.1.0, 2020. [LBR+19] Michel Lang, Martin Binder, Jakob Richter, et al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software, 4(44):1903, 2019. [SCS+20] Annette Spooner, Emily Chen, Arcot Sowmya, Perminder Sachdev, Nicole A Kochan, Julian Trollor, and Henry Brodaty. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific reports,10(1):1-10, 2020. [Son21] Raphael Edward Benjamin Sonabend. A theoretical and methodological framework for machine learning in survival analysis: Enabling transparent and accessible predictive modelling on right-censored time-to-event data. PhD thesis, UCL (University College London), 2021. [VDBSB19] Alexander Volkmann, Riccardo De Bin, Willi Sauerbrei, and Anne-Laure Boulesteix. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC medical research methodology, 19(1):1-15, 2019. [WLR19] Ping Wang, Yan Li, and Chandan K Reddy. Machine learning for survival analysis: Asurvey. ACM Computing Surveys (CSUR), 51(6):1-36, 2019.
  
  Re-review:
  
  Many thanks for the very careful revision of the manuscript. Most of my concerns have been thoroughly addressed. I have only a few remarks left.
  
  Regarding 1. Fair comparison and parameter selection The altered study design appears much better suited to this end. Thank you very much for the effort, in particular the additional results regarding the two tuning approaches. Although I think a single simple tuning regime would be feasible here, using the default settings is reasonable and very well justified. I agree that this is much closer to what is likely to take place in practice. However, it should be more clearly emphasized that better performance may be achievable if tuning is performed.
  
  Regarding 2. Description Thanks, all concerns properly addressed. No more comments.
  
  Regarding 3. Reliability I am aware that Figure 2c provides information to this end. I think additional boxplots which aggregate the methods' performance (e.g. for unoc and bs) over all runs and datasets would provide valuable additional information. For example, from Figure 2c one can tell that MTLR variants obtain overall higher ranks based on mean prediction performance than the deep learning methods. However, it says nothing about how large the differences in mean performance are.
  
  Kaplan-Meier-Estimate (KM) I'm not quite sure I understood the authors' answer correctly. The KM does not use variable information to produce an estimate of the survival function, and I think that is why it would be interesting to include it. This would shed light on how valuable the variables are in the different data sets.
  
  Regarding 4. Scope and clarity Thanks, all concerns properly addressed. No more comments.
  
  Minor points:
  
  Since the authors decided to change 'framework' to 'design', note that in Figure 1b it still says 'framework'
  
  l.51 & l.54/55 appear to be redundant
  
  Figure 2 a and b:
  
  Please elaborate more on how similarity (reflected in the dendrograms) is defined?
  
  Why is the IBS more similar to Bregg's and GH C-Index than to the Brier Score?
  
  Why is the IBS not feasible for so many methods, in particular Lasso_Cox, Rdige_Cox, and CoxBoost?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.07.11.451967v2
www.biorxiv.org www.biorxiv.org

Meta-Prism 2.0: Enabling algorithm for ultra-fast, accurate and memory-efficient search among millions of microbial community samples

1
1. GigaScience 01 Mar 2023
  
  in GigaScience
  
  Abstract
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac073 and has published the reviews under the same license.
  
  Reviewer 1. Siyuan Ma
  
  Reviewer Comments to Author: In Kang, Chong, and Ning, the authors present Meta-Prism 2, a microbial community analysis framework, which calculates sample-sample dissimilarities and queries microbial profiles similar to those of user-provided targets. Meta-Prism 2 adopts efficient algorithms to achieve the time and memory efficiency required for modern microbiome "big data" application scenarios. The authors evaluated Meta-Prism 2's performance, both in terms of separating different biomes' microbial profiles and time/memory usage, on a variety of real-world studies. I find the application target of Meta-Prism appealing: achieving efficient dissimilarity profiling is increasingly relevant for modern microbiome applications. However, I'm afraid the manuscript appears to be in poor state, with insufficient details for crucial methods and results components. Some display items are either missing or mis-referenced. As such, I cannot recommend for its acceptance, unless major improvements are made. My comments are detailed below.
  
  Major 1. The authors claim that from its previous iteration, the biggest improvements are: (1) removal of redundant nodes in 1-against-N sample comparisons. (2) functionality for similarity matrix calculation (3) exhaustive search among all available samples.
  
  a. (1) seems the most crucial for the method's improved efficiency. However, the details on why these nodes can be eliminated, and how dissimilarity calculation is achieved post-elimination are not sufficient. The caption for Figure 1C, and relevant Methods texts (lines 173-188) should be expanded, to at least explain i) why it is valid to calculate (dis)similarity postelimination based on aggregation, ii) how aggregation is achieved for the target samples. b. I may not have understood the authors on (2), but this improvement seems trivial? Is it simply that Meta-Prism 2 has a new function to calculate all pair-wise dissimilarities on a collection of microbial profiles? c. For (3), it should be made clearer that Meta-Prism 1 does not do this. I needed to read the authors' previous paper to understand the comment about better flexibility in customized datasets. I assume that this improvement is enabled because Meta-Prism 2 is vastly faster compared to 1? If so, it might be helpful to point this out explicitly.
  
  I am lost on the accuracy evaluation results for predicting different biomes (Figure 2). a. How are biomes predicted for each microbial sample? b. What is the varying classification threshold that generates different sensitivities and specificities? c. Does "cross-validation" refer to e.g. selection of tuning parameters during model training, or for evaluation model performances? d. What are the "Fecal", "Human", and "Combined" biomes for the Feast cohort? Such details were not provided in Shenhav et al.
  
  Moderate 1. I understand that this was previously published, but could the authors comment on the intuitions behind their dissimilarity measure, and how it compares to similar measures such as the weighted UniFrac? a. Does Meta-Storm and Meta-Prism share the same similarity definition? If so, why would they differ in terms of prediction accuracies? 2. There seems to be some mis-referencing on the panels of Figure 1. a. Panel B was not explained at all in the figure caption. b. Line 185 references Figure 1E, which does not exist.
  
  Minor 1. The Meta-Prism 1 publication was referenced with duplicates (#16 and 24) 2. There are minor language issues throughout the manuscript, but for they do not affect understanding of the materials. Examples: a. Line 94: analysis -> analyze b. Line 193: We also obtained a dataset that consists of ...
  
  Re-review:
  
  I find most of my questions addressed. My only remaining issue is still that the three biomes from FEAST (Fecal, Human, and Mixed) are still not clearly defined. The only definition I could find is line 206-208 "We also obtained a dataset that consists of 10,270 samples belonging to three biomes: Fecal, Human, and Mixed, which have been used in the FEAST study, defined as the FEAST dataset". Are "Fecal" simply stool samples, and "Human" samples biopsies from the human gut? What is "Mixed"? As a main utility of Meta-Prism is source tracking, it is important for the reader to understand what these biomes are, to understand the resolution of the source tracking results. If this can be resolved, I'll be happy to recommend the manuscript's acceptance.
  
  Reviewer 2. Yoann Dufresne
  
  In this article the authors present Meta-Prism 2, a software to compute distances between metagenomic samples and also query a specific sample against a pool of samples. They call "sample" a precomputed file with abundance of multiple taxa. In the article they first succinctly present multiple aspects on the underlying algorithms. Then they provide an extensive analysis on the precision, ram and time consumption of the software. Finally, they show 3 applications of Meta-Prism 2.
  
  I will start to say that the execution time of the tool looks very good compared to all other tools. But I have multiple concerns about these numbers. - First, I like to reproduce the results of a paper before approving it. But I had a few problems doing so. * The tool do not compile as it is on git. I had to modify a line of code to compile it. This is nothing very bad but authors of tools should be sure that their main code branch is always compiling. See the end of the review for bug and fix. * The analysis are done using samples from MGnify. I found related OTU tsv files linked in the supplementary but no explanation on how to transform such files in pdata files that the software is processing. * The only way to directly reproduce the results is to trust the pdata files present on the github of the authors. I would like to make my own experiments and compare the time to transform OTU files into pdata with the actual run time of MP2. - The authors evaluated the accuracy of their method (which is nice) but did not gave access on the scripts that were used for that. I would like to see the code and try to reproduce the figure by myself on my own data. - The 2nd and 3rd applications are explained in plain text but there is no script related neither any table of graphics to reproduce or explain the results. The only way for me to evaluate this part is to trust the word of the authors. I would like the authors to show me clear and indisputable evidences.
  
  For the methods part it is similar. We have hints on what the authors did, but not a full explanation: - For the similarity function, I would like to know where it comes from. The cited papers [14] and [24] do not help on the comprehension of the formula. If the function is from another paper, I ask the authors to add a clear reference (paper + section in the paper) ; if not, I would like the authors to explain in details why this particular function, how they constructed it and how it behaves. - The authors refer multiple times to "sparse format" applied to disk & cache but never defined what they mean by that. I would like to see in this section which exact datastructure is used. - In the Fast 1-N sample comparison, the authors write about "current methods" but without citing them. I would like the authors to refer to precise methods/software, succinctly describe them and then compare their methods on top of that. Also in this part, the authors point at figure 1E that is not present in the manuscript. - The figure 1 is not fully understandable without further details in the text. For example, what is Figure 1C4 ?
  
  I want to point that the paper is not correctly balanced in term of content. 1.5 page for time execution analysis is too much compared to the 2 pages of methods and less than 1 page of real data applications.
  
  Finally, the authors are presenting a software but are not following the development standards. They should provide unit and functional tests of their software. I also strongly recommend them to create a continuous integration page with the git. With such a tool the compilation problem would not exist.
  
  To conclude, I think that the authors very well engineered the software but did not present it the right way. I suggest the authors to rewrite the paper with strong improvements of the "methods" and "Real data application" sections. Also, to provide a long term useful software, they have to add guaranties to the code as tests and CI.
  
  For all these reasons, I recommend to reject this paper.
  
  --- Bug & Fix ---
  
  make mkdir -p build g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/loader.o src/loader.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/newickParser.o src/newickParser.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/simCalc.o src/simCalc.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/structure.o src/structure.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/main.o src/main.cpp src/main.cpp: In function 'int main(int, const char)': src/main.cpp:128:31: error: 'class std::ios_base' has no member named 'clear' 128 | buf.ios_base::clear(); | ^~~~~ make: * [makefile:7: build/main.o] Error 1
  
  To fix the bug: src/main.cpp:128 => buf.ios.clear();
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.11.17.387811v1
www.biorxiv.org www.biorxiv.org

PhysiCOOL: A generalized framework for model Calibration and Optimization Of modeLing projects

1
1. GigaScience 01 Mar 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.77 and has published the reviews under the same license.
  
  Reviewer 1. Cicely Macnamara
  
  The manuscript entitled " PhysiCOOL: A generalized framework for model Calibration and Optimization Of modeLing projects" is succinctly written; its purpose is clear and the software created simple yet effective. I think improvements could be made to the documentation allowing a non-expert user to make use of this valuable tool. I also have a few minor comments below. Otherwise I am happy to recommend the publication of this paper.
  
  Minor comments: (1) Could the authors clarify in the paper (where it says PhysiCool has partial support for PhysiCell v1.10.3 and higher) whether it is the author's intention to keep this tool up to date with newer releases of PhysiCell? (2) For the multilevel parameter sweep the authors suggest that the number of levels and grid parameters can be defined by the user. Do the authors have any suggestions on picking the appropriate number of levels, for example, or could future development include some form of dynamic choice for number of levels e.g. stop when a certain degree of accuracy is found?
  
  Reviewer 2. Daniel Roy Bergman
  
  This is a very nice addition to the PhysiCell ecosystem. Methods for parameterizing agent-based models is critical, and the ability to do so without expensive computing resources, i.e. HPC, will aid many researchers.
  
  Comments: 1) "Furthermore, experimental data could..., they can be used..." this feels like a run-on sentence. It is unclear who/what "they" is. 2) "bespoke HPC workflows..." Is this referencing DAPT and the PhysiCell-EMEWS workflow? If so, how does PhysiCOOL differ from these? 3) Is PhysiCOOL defining this multilevel sweep approach to parameter estimation? Or is this already established? If the former, please emphasize. If the latter, are there citations? 4) Please emphasize that the "Simple model of logistic growth" is not done with PhysiCell. 5) I needed Python version < 3.11.0 to install physicool
  
  Major revisions: 1) Please check on the issue I had with the motility example and it not generating output files.
  
  Minor revisions: 1) "As for many several computational modelling frameworks..." consider rewording. I would suggest "As with many computational modeling frameworks" 2) "...namely an Extensible..." 3) "...can be employed to randomly sample points within..." 4) Please change notation in Table 2 so that the "* point" columns report the values as coordinates ( , ) rather than like intervals [ , ].
  
  Re-review: The authors addressed all my concerns and I have no further reservations in recommending this manuscript for publication.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.11.17.516671v2
www.biorxiv.org www.biorxiv.org

Improvements to the Gulf Pipefish Syngnathus scovelli Genome

1
1. GigaScience 01 Mar 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.76 and has published the reviews under the same license.
  
  Reviewer 1. Sven Winter
  
  I am really sorry, and I do not want to sound mean, but this manuscript needs major improvements in structure, writing, and data validation. It violates so many standard practices of scientific writing. I have never seen anybody cite a full title of a previous manuscript. There is absolutely no need for that. The annotation is labeled as an improved annotation, but its results are only listed in the abstract, and it is not mentioned how it is generated anywhere other than the data availability section. That the genome is tagged under RefSeq by NCBI is absolutely unnecessary information in the abstract, this is just a label, and it tells not much about the quality. I would urge the authors to restructure the manuscript. Start with a short description of the species and why the species and its genome is important as an introduction, then focus on a detailed data description with methods and basic results such as assembly statistics (importantly not just scaffold N50 but also on the contig-level!), Busco, Merqury completeness and error rate, genome size estimate, annotation (repeat and gene), etc. There is really no need for 30 pages of useless supplementary tables (please also make sure that next time you sort the files during the submission so that the pdf does not start with 30 pages of tables). The data cannot support any information about gene loss, as there is so much of the assemblies not properly anchored into chromosomes. I would also try to improve the Hi-C contact map figure. There is really no need for the blue and green boxes and the assembly label at the x-axis. I may have overlooked it due to the writing style, but I would like to see mentioned how much of the assembly is in the chromosome-scale scaffolds and how much is unplaced. I like the improved assembly, it just needs a much better presentation in form of a well-structured manuscript, and unfortunately, in its current form, it clearly is not well-structured. There are plenty of other data notes available as templates. I personally would always opt for a more traditional manuscript structure (Introduction, Methods, combined Results and Discussion), but that is my personal preference. I hope my comments are helpful, and I am looking forward to seeing a revised version in the future.
  
  Re-review:
  
  Thank you for the improvement of the manuscript. It is now easier to follow and includes more information as before. It was a bit difficult to see the changes as they were not highlighted and the lines are not numbered. Despite that, I have only a few minor comments that should be addressed easily so that the manuscript will be ready for publication soon. Line numbers in the comments refer to lines of the specific paragraph/section.
  
  DNA and RNA extraction: L7:such as? If you listed all tissues, please remove such as, if you sequenced RNA for nor tissues please add them.
  
  Sequencing and Assembly: L5: 159 bp is an uncommon read length. Was this just a typo, or how did that come to be? L10: remove "the" before juicer; otherwise, it sounds like an actual fruit juicer instead of a bioinformatics tool ;-). Same for 3D-DNA in the line below. Please make it more clear in the text if you sequenced the RNA for each tissue separately or in one library. L11-12: I am not convinced that not allowing for correction was the right approach. Did you test how the results would look with corrections enabled?
  
  Assembly Statistics and Quast Results: Quast calculates assembly statistics so I am not sure why the header needs to include both. L5: Please avoid using "better" but instead rephrase so that is is clear that the NG50 is 1.75x larger than the previous assembly. "Better" is not clear.
  
  Busco and Merqury results: I would not claim that Busco says the genome is 95% complete, as busco only tries to find genes that are supposedly orthologous in Actinopterygii. So I would rather say Busco suggests a high completeness as it finds 95% of the orthologs. Also, all genes in the Busco dataset are supposed to be single-copy orthologs; therefore, I would not say that 93% are conserved single-copy orthologs, as the remaining duplicated or fragmented genes could just be assembly errors. Please also state the Merqury QV value, and I would suggest stating the error rate in %. I still find the discussion about missing Busco genes strange, as since Busco 4 or 5 the datasets all got much larger and the Busco completeness values went down in most assemblies, even in well studies taxa as mammals. With recent datasets, it is very unlikely to get much more than 95-97%. In my opinion, it is rather a sign of too large and incorrect Busco datasets than evidence for missing orthologs. I would at least add that point to the discussion.
  
  Table 1: Please follow standard practice in scientific writing and add separators to the numbers in all tables (main text and supplementary), e.g., 28444102  28,444,102. Otherwise, they are difficult to read.
  
  Annotation Results: L3: 20,101 coding genes, 18,616 genes … Please check throughout the whole manuscript for consistent style.
  
  Data Availability: L2: Annotation report release 100. What does "100" stand for? Also, "at here" sounds not correct; please remove "at". L4: Table S2 does not show the scaffold identifiers. L5: please state the complete BioProject accession not just the numerical part.
  
  Supplementary data: Please change numbers in all tables to standard format e.g., 21,671,036
  
  Reviewer 2. Yue Song
  
  (1) Please state clearly how much CCS Hi-Fi data has been produced by sequencing and hic-data finally used for chromosome assembly after filtration, not just the number of reads. (2) Please state clearly the estimated genome size using Hi-Fi data.
  
  (3) What is the process for “correct primary assembly misassembles”? Please described in detail. (4) In Table 1, I noticed that the difference between the new and previous genome of S.scovelli is more than 100M (about 25% of the size of the newly assembly). Otherwise, most of genome size of Syngnathus species ranged from 280-340 Mb, I think take some explanation of these extra sequences is necessary. (5) Need more detailed parameters and process about genome assembly and gene annotation. (6) Whether the previous version had any assembly errors and updated in this new one. if this exists, please state so.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.23.525209v1
Feb 2023
www.biorxiv.org www.biorxiv.org

BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data

3
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Dadi Gao
  
  Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots.
  
  Major comments:
  
  Have the authors compare the performance with different seed length? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).
  
  The authors need to benchmark BrumiR with more existing tools (e.g. those ML-based methods), and to include more genome-free methods (e.g. MiRNAgFree).
  
  It is also interesting to know whether de novo method for mRNA assembly would be useful on the miRNA side. It would be great if the authors were able to compare the performance of BrumiR2reference (without filtering for RFAM) with Trinity in genome-guided mode, by tweaking its seed length to be the same as BrumiR.
  
  The tool's sensitivity is promising across animal and plant datasets. However, the average precision is quite low, an average precision of 0.3 means a false discovery rate of 0.7. This is not an accepted value for a tool designed to discover novel miRNA. Is there any parameter the author could tweak towards a better performance? For example, is seed length of 18nt too short to start with? Is there any other sequences feature the authors should take into account to boost the performance? Or maybe some post-assembly filtering approaches might be sufficient and helpful.
  
  Wet-lab validation (e.g. Luciferase assay) for the identified novel miRs will leverage the real-life usefulness of BrumiR. This is extremely important, as the tool showed a high false discovery rate.
  
  Minor comments:
  
  MiRNA maturation involves RNA editing. Can the authors comment on how this would be handled and captured by BrumiR. It seems that the authors allow mismatches when cluster the potential miRNAs via edlib library. It is interesting to know whether or not, or to what extent, edlib would help in including RNA edited candidates in the final result.
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  AbstractMicroRNAs
  
  This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Marc Friedlander
  
  The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more careful way.
  
  Major comments:
  
  RFAM filtering is not really part of the prediction step, this is rather a filtering step. Therefore, to make a fair comparison with mirnovo (the other genome-free tool), BrumiR should additionally be run without RFAM filtering, and mirnovo should additionally be run using the exact same RFAM filtering.
  
  it appears that 16-mers from miRBase miRNAs were specifically excluded from the RFAM catalog used for the filtering, which is reasonable. However, the miRNAs from the exact benchmarked species should not be included in the used miRBase 16-mer catalog, to avoid circular reasoning.
  
  miRDeep2 software should ideally not be run with default options - this is particular important since the miRDeep2 performance in this manuscript appears lower than what is reported in other studies (e.g. Friedlander et al. 2012). First, reference mature miRNAs from a related and well-annotated species should be included to support the prediction. Second, a score cut-off should be used that gives a decent signal-to-noise ratio according to the miRDeep2 output overview table (for instance 5:1). Third, all read pre-processing and genome mapping should be performed with the mapper.pl script which is part of the miRDeep2 package.
  
  it appears that only miRNA-derived sequences were included in the simulated data. In fact, real small RNA-seq data typically contains fragments from other known types of RNA and also sequences from unannotated parts from the genome. Therefore, the authors should use simulated data that also includes samples from RFAM and randomly sampled sequences from the reference genome (for instance 10% of each). Overall, the use of simulated sequence data could be put a bit in the background in this study, since real small RNA-seq data is in fact readily available these days and typically has a structure that is not easy to simulate. Further, there is little reason not to use real data, since the miRNAs in miRBase tend to be reasonably well curated for most species and therefore can function well as a gold standard for benchmarking.
  
  precision of BrumiR is in some cases lower than 0.2, for instance for one mouse dataset. From this dataset ~3000 mouse miRNAs are reported - the majority of which are not in miRBase and can reasonably be presumed to be false positives. The authors should comment on why this particular dataset appears to produce so many false positives for BrumiR - could this have to do with the prevalence of piRNAs that the software cannot easily discern from miRNAs? Also, the authors should reflect on in what kind of use cases could tolerate these thousands of false positives. Would this be for generating candidates for downstream high-throughput validation?
  
  the authors should either benchmark BrumiR against the genome-free methods miReader and MirPlex, or explain why this comparison is not relevant.
  
  Minor comments:
  
  the brief introduction to miRNA biology should be carefully edited by an expert in the field. Currently, very old reviews are being cited (e.g. Bartel 2004), and some of the other references appear to be a bit spurious (e.g. why focus on plant host-pathogen interactions out of the hundreds of established functions of miRNAs?). The excellent review of Dave Bartel from 2018 contains references to numerous milestone studies that the introduction could build on.
  
  the authors write on page 2 that genome-based methods struggle with a high rate of false positive prediction, citing [9]. However, this is a mis-reference, since the reference [9] states that methods that rely on only the genome and do not leverage on small RNA-seq data have high false positive rates.
3. GigaScience 17 Feb 2023
  
  in GigaScience
  
  AbstractMicroRNAs (miRNAs)
  
  This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Ernesto Picardi
  
  The manuscript by Moraga et al. describes BrumiR, a software devoted to the de novo identification of miRNAs from deep sequencing experiments of the RNA fraction at low molecular weight. In contrast with existing tools, BrumiR is based on de Bruijn graphs, generated directly from raw fastq reads. The performances on simulated and real sequencing data, in terms of precision, recall and FScore, are very good. In addition, the tool is ultra-fast, enabling the analysis of huge amount of data. I have tried to use BrumiR but I always got a GLIB error. I have tested the script on different Linux and Mac computers but I was not able to fix the GLIB error. It seems that a very recent version of the GLIB library is required. So, unfortunately, I didn't have the possibility to test the program and look at the outputs.
  
  Major concerns:
  
  I was not able to run the program and, thus, provide a correct revision. In my opinion, the github page should take into account this by providing the minimal software and hardware architecture to run BrumiR. Authors could also include a copy of the output files (by the way, there is a typo in the description of the second output file).
  
  Since the tools is able to identify novel miRNAs and look also at known ones, they could provide an output file including the read count per miRNA. In addition, since the tool is expected to be ultra-fast (not checked … see above), the differential gene expression analysis could also be implemented.
  
  I suggest also to implement a graphical output. A sort of summary in a decorated html page.
  
  By using BrumiR, authors analyze miRNAs in Arabidopsis during the development, discovering three novel miRNAs. Although bioinformatics evidences indicate that they could be real miRNAs, an experimental validation is required. Indeed, these miRNAs have been detected by BrumiR only. I think that this validation could be easily done because authors directly performed sRNAseq data. In my opinion, this experiment could really improve the manuscript and assess the high performance of BrumiR.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.08.07.240689v1
www.biorxiv.org www.biorxiv.org

Disease classification for whole blood DNA methylation: meta-analysis, missing values imputation, and XAI

2
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Giulia De Riso
  
  In this study, a workflow is presented to generate classification models from DNA methylation data. Methods to deal with harmonization and missing data imputation are presented and the benefit of adopting them for classification tasks is tested on case-control datasets of schizophrenia and Parkinson disease. The authors support this workflow with source code. Although mostly based on already known methodologies, the present study may help orient studies aimed at building and applying DNA methylation based models. However, some major concerns can be raised:
  
  Majors: In different points of the manuscript, the authors refer to their approach as a pipeline. Indeed, this approach should be composed of sequential modules, in which the output of a module becomes the input of the next one. Although the modules are clearly distinguishable, their organization in the pipeline is less straightforward (also considering that modules can be adopted both to build a model and to use it on new data). The authors could think to draw a scheme of the pipeline, or to adopt a different term to refer to the presented approach. From the model performance perspective, the ML models poorly perform for schizophrenia. The authors point to inner characteristics of the disease as a possible reason for this. However, this point should be better commented in the Discussion section.
  
  Besides this, the impact of the smaller number of samples included in the training set and the higher proportion of imputed features compared to Parkinson disease on the classification accuracy should be discussed. In addition, since the authors provided the code, is there a way to select samples to include in training/test sets based on random choice (classical 70-30% splitting) instead of source dataset? "For machine learning models, we used only those CpG sites that have the same distribution of methylation levels in different datasets in the control group (methylation levels in the case group typically have greater variability because of disease heterogeneity).": is this filtering performed only on the datasets included in the training set, or also on the test set? It seems the former, but the authors should clearly state this point. Accuracy with weighted averaging should be defined with a formula in the methods section Regarding the ML models, the authors chose different types of decision-trees ensemble, along with a deep learning one. They should contextualize this choice (why different models from the same family?).
  
  In addition, ML models built on DNA methylation are often based on elastic net or Support-Vector Machines, which are not accounted for in this work. The authors should comment on this aspect in limitations, and state whether the code they provided for their approach could be customized to adopt different models from the ones they presented.
  
  Regarding the Imputation Method column in Table 2, the meaning is not clear. Are the different imputation methods described in the Imputation of missing values section paired with the ML models presented in Table 2? If yes, some of the methods (like KNN) are missing. In the harmonization section, Models for case-control classification are trained on different numbers and sets of CpGs. To assess the effect of harmonization alone, the number of CpGs should be instead fixed. This is especially critical for schizophrenia, when the number of features for the non-harmonized data is 35145 whereas the one for harmonized data is 110,137. Dimensionality reduction section: are the models from imputed and not-imputed data trained only on harmonized data? And how the set of 50911CpG sites for Parkinson and 110137 CpG sites for schizophrenia is selected?
  
  Imputation of missing values section: it is not clear on which CpGs and on which samples imputation is performed. Also, it is not clear whether the imputation has been tested on the best-performing model.
  
  Minors: Page 1, line 2: "DNA methylation is associated with epigenetic modification". DNA methylation is an epigenetic mark itself. Do the authors mean histone marks?
  
  Page 1, from line 7: "DNA methylation consists of binding a methyl group to cytosine in the cytosineguanine dinucleotides (CpG sites). Hypermethylation of CpG sites near the gene promoter is known to repress transcription, while hypermethylation in the gene body appears to have an opposite, also less pronounced effect.": references should be added
  
  Page 2, from line 2 : "Current epigenome-wide association studies (EWAS) test DNAm associations with human phenotypes, health conditions and diseases.": references should be added
  
  Page 3: "In most cases, an increase in dimensionality does not provide significant benefits, since lower dimensionality data may contain more relevant information". This point could be presented in a reverse way (higher dimensionality data may contain redundant information), introducing the collinearity issue. In addition, this issue could be introduced before the missing values and imputation section.
  
  Page 3: references for "Modern machine-l earning-based artificial intelligence systems are powerful and promising tools" could be more specific for the field of epigenetics and DNA methylation.
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Liang Yu Reviewer
  
  Comments to Author: The paper by Kalyakulina et al. described the disease classification for whole blood DNA methylation. The author proposed a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence algorithms. For Parkinson's disease and schizophrenia, the author also demonstrates that a method for classifying healthy individuals and patients with various disorders based on whole blood DNA methylation data is an efficient and comprehensive approach.
  
  Overall, the manuscript is well organized. I have some suggestions for the authors to improve their work:
  
  The manuscript has constructed different models for the prediction study of CpG sites for different types of data. It is suggested to add a flowchart of the whole model construction process to the manuscript so that readers can understand the study more clearly.
  
  In Figure 4, the author only shows the top 10 important features and marks the highest accuracy and number of features with black lines in the figure. It is recommended to show the relevant data (optimal accuracy and number of features) in the figure. For the three subplots included in the figure, please label them separately, e.g., A, B, and C to indicate them separately.)
  
  Remark concerns model performance evaluation: author should provide standard deviations of the obtained values.
  
  In this manuscript, the author used graphs to present the results and suggested that a table summarizing the performance results of the model would be intuitive.
  
  I didn't find how the authors optimize the hyper-parameters, usually using grid search.
  
  The authors do not adequately address how their method outperforms existing methods in the discussion section.
  
  The "Dimensionality reduction" section: I think this section is more appropriately called "feature selection", a sequence forward search method. First sort the features according to their importance values, then add or remove features from a candidate subset while evaluating the criterion
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.05.10.491404v2
www.biorxiv.org www.biorxiv.org

https://biorxiv.org/cgi/content/10.1101/2021.08.20.457128

2
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  AbstractRecent studies
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac094 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Milad Miladi
  
  In this work, Tombacz et al. provide a Nanopore RNA sequencing dataset of SARS-CoV-2 infected cells in several timepoints and sequencing setups. Both direct RNA-seq and cDNA-seq techniques have been utilized, and multiplex barcoded sequencing has been done for combining the samples. The dataset can be helpful to the community, such as for future transcriptomic studies of SARS-CoV-2, especially for studying the infection and expression dynamics. The text is well written and easy to follow. I find this work valuable; however, I can see several limitations in the analysis and representation of the results.
  
  Notably, the figures and tables representing statistical and biological insights of the data points are underworked, lack clarity, and provide limited information about the experiment. Further visualizations, analysis, and data processing could help to reveal the value and insights from this sequencing experiment.
  
  Comments: The presentation of reads coverage and lengths in Figs 1 & 2 are elementary, unpolished, and non-informative. Better annotation and labeling in Fig. 1 would be needed. Stacking so many violin plots in Fig 2 does not provide any valuable information and would only misguide. What are the messages of these figures? What do the authors expect the readers to catch from them? As noted, stacking many similar figures does not add further information. The authors may want to consider alternative representations and aggregation of the information, besides or replacing the current plots. For example, in Fig.2, scatter/line plots for the median & 25/75% percentile ranges, with an aggregation of the three replicates in on x-axis position, could help identify potential trends over the time points.
  
  It is better to start the paper by presenting the current Fig.3 as the first one. This figure is the core of contributions and methodologies, and current Figs 1&2 are logical followups of this step.
  
  There is a very limited description in the Figure Legends. The reader should be able to understand essential elements of the figures merely based on the Figure and its legend.
  
  This study does not provide much notable biological insight without demultiplexing the reads of each experimental condition into genomic and subgenomic subsets. Distinguishing the genomic and subgenomic reads and analyzing their relative ratio is essential in this temporal study. Due to the transcription process of coronaviruses, the genomic and subgenomic reads have very different characteristics, such as length distribution and cellular presence. Genomic and subgenomic reads can be reliably identified by their coverage and splicing profiles, for enough long reads. It is essential that the authors further process the data by categorizing the genomic/subgenomic reads and the provide statistics such as read length for each category. It would also be interesting to observe the ratio of genomic vs. subgenomic reads. This is an indicative metric of the infection state of the sample. An active infection has a higher sub-genomic share, while, e.g., a very early infection stage is expected to have a larger portion of genomic reads.
  
  Page-3: "[..] the nested set of subgenomic RNAs (sgRNAs) mapping to the 3'-third of the viral genome". Is 3'-third a typo? Otherwise, the text is not understandable.
  
  Page-4: " because after a couple of hours, the virus can initiate a new infection cycle within the noninfected cells." More context and elaboration by citing some references can help to support the authors' claim. A gradual infection of non-infected cells can be assumed. However, "a couple of hours" and "initiate a new infection cycle" need further support in a scientific manuscript. The infection process is fairly gradual, but the wording here infers a sudden transition to infecting other cells only at a particular time point.
  
  Page-4: "[..]undergo alterations non-infected cells during the propagation therefore, we cannot decide whether the transcriptional changes in infected are due to the effect of the virus or to the time factor of culturing." This can be strong support for why this experiment has been done and for the value of this dataset. I would suggest mentioning this in the abstract to highlight the motivation.
  
  Page-4: "based studies have revealed a hidden transcriptional complexity in viruses [13,14]" Besides Kim et. al, the first DRS experiments of coronaviruses have not been cited (doi.org/10.1101/gr.247064.118, doi.org/10.1101/2020.07.18.204362, doi.org/10.1101/2020.03.05.976167)
  
  Table-1: dcDNA is quite an uncommon term. In general, here and elsewhere in the text, insisting on a direct cDNA is a bit misleading. A "direct" cDNA sequencing is still an indirect sequencing of RNA molecules!
  
  Figs S2 and S3: Please also report the ratio of virus to host reads.
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac094 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: George Taiaroa
  
  The authors provide a potentially useful dataset relating to transcripts from cultured SARS-CoV-2 material in a commonly used cell line (Vero). Relevant sequence data are publicly available and descriptions on the preparation of these data are for the most part detailed and adequate, although this is lacking at times.
  
  Although the authors state that this dataset overcomes the limitations of available transcriptomic datasets, I do not believe this to be an accurate statement; based on comparable published work in this cell line, transcriptional activity is expected to peak at approximately one day post infection (Chang et al. 2021, Transcriptional and epi-transcriptional dynamics of SARS-CoV-2 during cellular infection), with the 96 hour period of infection described likely representing overlapping cellular infections of different stages.
  
  Secondly, many in the field have moved to use more appropriate cell lines in place of the Vero African Monkey kidney cell line, to better reflect changes in transcription during the course of infection in human and/or lung epithelial cells (See Finkel et al. 2020, The coding capacity of SARS-CoV-2). Lastly, the study would ideally be performed with a publicly available SARS-CoV-2 strain, as has been the case for earlier studies of this nature to allow for reproducibility and extension of the work presented by others.
  
  That said, the data are publicly available and could be of use. Primary comments I think that a statement detailing the ethics approval for this work would be essential, given materials used were collected from posthumously from a patient. Similarly, were these studies performed under appropriate containment, given classifications of SARS-CoV-2 at the time of the study? I do not know what the authors mean in reference to a 'mixed time point sample' for the one direct RNA sample in this study; could this please be clarified? Secondary comments I believe the authors may over-simplify discontinuous extension of minus strands in saying that
  
  'The gRNA and the sgRNAs have common 3'-termini since the RdRP synthesizes the positive sense RNAs from this end of the genome'. Each of the 5' and 3' sequence of gRNAs/sgRNAs are shared through this process of replication. 'Infections are typically carried out using fresh, rapidly growing cells, and fresh cultures are also used as mock-infected cells.However, gene expression profiles may undergo alterations non-infected cells during the propagation therefore, we cannot decide whether the transcriptional changes in infected are due to the effect of the virus or to the time factor of culturing. This phenomenon is practically never tested in the experiments.' I do not follow what these sentences are referring to. 'Altogether, we generated almost 64 million long-reads, from which more than 1.8 million reads mapped to the SARS-CoV-2 and almost 48 million to the host reference genome, respectively (Table 1).
  
  The obtained read count resulted in a very high coverage across the viral genome (Figure 1). Detailed data on the read counts, quality of reads including read lengths (Figure 2), insertions, deletions, as well as mismatches are summarized Supplementary Tables.' Could this perhaps be more appropriately placed in the data analysis section, rather than background?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.08.20.457128v1
www.biorxiv.org www.biorxiv.org

Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other non-standard architectures in genome assemblies

3
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  AbstractRecent technological
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Kamil S. Jaron
  
  Assembling a genome using short reads quite often cause a mixed bag of scaffolds representing uncollapsed haplotypes, collapsed haplotypes (i.e. the desired haploid genome representation) and collapsed duplicates. While there are individual software for collapsing uncollapsed haplotypes (e.g. HaploMerger2, or Redundans), there is no established workflow or standards for quality control of finished assemblies. Naranjo-Ortiz et al. describes a pipeline attempting to make one.
  
  The Karyon pipeline is a workflow for assembling haploid reference genomes, while evaluating the ploidy levels on all scaffolds using GATK for variant calling and nQuire for a statistical method for estimating of ploidy from allelic coverage supports. I appreciated the pipeline promotes some of good habits - such as comparing k-mer spectra with the genome assembly (by KAT) or treatment of contamination (using Blobtools). Nearly all components of the pipeline are established tools, but authors also propose karyon plots - diagnostic plots for quality control of assemblies.
  
  The most interesting and novel one I have seen is a plot of SNP density vs coverage. Such plot might be helpful in identifying various changes to ploidy levels specific to subset of chromosome, as authors demonstrated on the example of several fungal genomes (Mucorales). I attempted to run the pipeline and run in several technical issues. Authors, helped me overcoming the major ones (documented here: https://github.com/Gabaldonlab/karyon/issues/1) and I managed to generate a karyon plot for the genome of a male hexapod with X0 sex determination system. I did that, because we know well the karyotype and I suspected, the X chromosome will nicely pop-up in the karyon plot.
  
  To my surprise, although I know the scaffold coverages are very much bi-modal, I got only a single peak of coverages in the karyon plot and oddly somewhere in between the expected haploid and diploid coverages. I think it is possible I have messed up something, but I would like authors to demonstrate the tool on a known genome with known karyotype. I would propose to use a male of a species with XY or X0 sex determination system. Although it's not aneuploidy sensu stricto, it is most likely the most common within-genome ploidy variation among metazoans. I would also propose authors to improve modularity of the pipeline. On my request authors added a lightweighted installation for users interested in the diagnostic plots after the assembly step, but the inputs are expected in a specific, but undocumented format, which makes a modular use rather hard. At least the documentation of the formats should improve, but in general I think it could be made more friendly to folks interested only in some smaller bits (I am happy to provide authors with the data I used).
  
  Although I quite enjoyed reading the manuscript and the manual afterwards, I do think there is a lot of space for improvement. One major point is there is no formal description of the only truly innovative bit of this pipeline - the karyon plots. There is a nice philosophical overview, but the karyon plots are not explained in particular, which makes reading of the showcase study much harder. Perhaps a scheme showing the plot and annotating what is expected where would help. Furthermore, authors did a likelihood analysis of ploidy using nQuire, but they did not talk about it at all in the result section. I wonder, what's the fraction of the assembly the analysis found most likely to be aneuploid for the subset of strains that suspected to be aneuploids? Is 1000 basis sliding window big enough to carry enough signal to produce reliable assignments? In my experience, windows of this size are hard to assign ploidy to, but I usually do such analyses using coverage, not SNP supports.
  
  However, I would like to appraise authors for the fungal showcases, I do think they are a nice genomics work, investigating and considering both biological and technical aspects appropriately. Finally, a bit smaller comment is that the introduction could a bit more to the point. Some of the sections felt a bit out of place, perhaps even unnecessary (see minor comments bellow). More specific and minor comments are listed bellow. Kamil S. Jaron
  
  Minor manuscript comments: I gave this manuscript a lot of thought, so I would like to share with you what I have figured out. However, I recognise that these writing comments listed bellow are largely matter of personal preference. I hope they will be useful for you, bit it is nothing I would like to insist on as a reviewer. l56: An unnecessary book citation. It's not a primary source for that statement and if a reference was made a "further reading", perhaps better to cite a recent review available online rather than a book. l65 - 66: Is the "lower error rate" still a true statement? I don't think it is, error rates of HiFi reads are similar or even lower compared to short reads. (tough I do agree there is still plenty of use for short reads). l68 - 72: I don't think you really need this confusing statement " which are mainly influenced by the number of different k-mers", the problems of short read assembly are well explained bellow. However, I actually did not understand why the whole paragraph l76 - 88 was important. I would expect an introduction to cover approaches people use till now to overcome problems of ploidy and heterozygosity in assemblies. l176 - 177: "Ploidy can be easily estimated with cytogenetic techniques" - I don't think this statement is universally true. There are many groups where cytogenetics is extremely hard (like notoriously difficult nematodes) or species that don't cultivate in the lab. For those it's much easier to do NGS analysis. You actually contradict this "easily" right in the next sentence. l191: the first autor of nQUire is not Weib, but WeiÃŸ. The same typo is in the reference list. l222 - 223: and l69-70 explains what is a k-mer twice. l266 - 267: This statement or the list does not contain references to publications sequencing the original genomes. I am not sure, but when possible, it is good to credit original authors for the sequencing efforts. l302: REF instead of a reference l303: What is "important fraction"? l304: How can you make such a conclusion? Did you try to remove the contamination and redo the assembly step? Did the assembly improve? Not sure if it's so important for the manuscript, but I would tone down this statement ("could be caused by" sounds more appropriate). l310: "B9738 is haploid" are you talking about the genome or the assembly? How could you tell the difference between homozygous diploid and haploid genome? If there is a biological reason why homozygous diploid is unlikely, it should be mentioned. l342: How fig 7 shows 3% heterozygosity? How was the heterozygosity measured? Also, karyon plot actually shows that majority of the genome is extremely homozygous and all heterozygosity is in windows with spuriously high coverage. What do you think is the haploid / diploid sequencing coverage in this case? l343 - 345: I don't think these statements are appropriately justified. The analysis presented did not convincingly show the genome is triploid or heterozygous diploid. l350: I think citing SRA is rather unnecessary. l358: what "model"? How could one reproduce the analysis / where could be the model found? l378 - 379: Does Karyon analyse ploidy variation "during" the assembly process? Although the process is integrated in a streamlined pipeline, there are loads of approaches to detect karyotype changes in assemblies, from nQuire which is used by Karyon, through all the sex-chromosome analyses, such as https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002078.
  
  Method/manual comments:
  
  Scaffold length plots have no label of the x axis. As the plots are called distributions, I would expect frequency or probability on the y axis and the scaffold length on the x. Furthermore, plotting of my own data resulted in a linnear plot with a very overscaled y-axis. "Scaffold versus coverage" plot also does not have axis labels either. I would also call it scaffold length vs coverage instead. I also found the position of the illustrating picture in the manual confusing a bit (probably should be before the header of the next plot).
  
  Variation vs. coverage is the main plot. It does look as a useful visualisation idea. Do I understand right that it's just numbers of SNPs vs coverage? I am confused as I thought the SNP calling is done on the reference individual and in the description you talk about homozygous variants too, what are those? Missmapped reads? Misassembled references?
  
  I also wonder about "3. Diffuse cloud across both X and Y axes.", I would naturally imagine that collapsed paralogs would have a similar pattern to the plot that was shown as an example - a smear towards both higher coverage and SNP density. I guess this is a more general comment, would you expect any different signature of collapsed paralogs and higher ploidy levels? Should not paralogy be more explicitly considered as a factor?
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Recent tec
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: Michael F. Seidl
  
  The technical note 'Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other non-standard architectures in genome assemblies' by Naranjo-Ortiz and colleagues reports on the development and application of the Karyon framework. Karyon is a python-based toolkit that utilizes several software tools developed by the authors' and/or others with the overall aim to assess sequencing data and genome assemblies for potential assembly artefacts caused by a plethora of different features intrinsic to the analyzed species/strain. Karyon is publicly available from github and as a docker image.
  
  Genome assemblies are nowadays important tools to develop novel biological hypotheses. However, genome assemblies are often not ideal, i.e., they are highly fragmented and/or incomplete, which can significantly hamper their full exploitation. The genome assembly quality is impacted by different biological factors that can be, at least partially, discovered directly based on the raw sequencing data and from the genome assembly (e.g., allele frequency, k-mer profiles, coverage depth, etc.). There are already plenty of established computational tools available to perform these type of analyses (to name a few: KAT, genomscope, nQuire).
  
  Karyon will ease these analyses by providing a single computation framework that combines different and complex software tool and generates diagnostic figures to support biological interpretation. Karyon thus represents a valuable contribution to the scientific community. The Karyon toolkit is built around established software tools and the overall methodology is sound and suitable to assess genome qualities. The interpretation of the results of Karyon is on the user, which still necessitates expert knowledge to correctly interpret signals.
  
  While examples are provided in the manual, the level of experience required will likely hamper the full exploitation of the pipeline by not expert users. Furthermore, it can be anticipated that expert users already employ the separate software to study genome complexities, and thus might not be in full need for Karyon. Obviously, this is inherent to the problem at hand and cannot be easily addressed by the authors. However, I would like to encourage the authors to further improve the manual and the examples to guide the data interpretation with the aim to make this software as accessible to as many researchers as possible.
  
  I nevertheless also have some comments related to the data presented in the manuscript that the authors need to address. First, the introduction finishes by asserting that different biological factors are expected to impact published genome assemblies. Furthermore, the manuscript mentions that quality of fungal genomes is often sub-optimal. However, no evidence for these statements is provided. To strengthen this point and to further highlight the urgency of methods to discover and ultimately address these problems, the authors need to provide a more systematic analyses based on publicly available genome assemblies for the occurrence of compromised genome assemblies. For example, a random subset of genome sequences for different eukaryotic phyla and / or classes, and more systematic throughout the fungi, would
  
  i) significantly substantiate the manuscript's message and
  
  ii) confirm the applicability of the authors' framework to most eukaryotes and not only to specific fungal groups (Mucorales).
  
  Second, the table mentions the diagnosis derived from Karyon but simply mentions 'unknown' for most entries. Based on the manuscript is seem that these are supposedly haploid with very little heterozygosity (L279) but table 1 nevertheless reports for most species/strains strikingly different genome size estimates between the original and the Karyon-derived genome assemblies (Karyon is consistently smaller). The authors need to explain in much more depth the nature of these differences for the reported genomes. For instance, it could be that publicly deposited assemblies have been generated by a combination of different sequencing libraries and technologies that are not fully exploited by Karyon. Third, one additional measure often applied to assess genome quality is genome completeness as for instance assayed by BUSCO. Karyon should include as strategy such as BUSCO to
  
  i) assess the occurrence of marker genes in the genome assemblies and
  
  ii) the duplication level of these genes as this might reveal un-collapsed alleles etc. Especially the latter is important to interpret genome size differences between original and Karyon-derived genome assemblies.
  
  Further detailed comments and suggestions to improve the manuscript: L21: could the authors please specify what 'groups' they refer to? L22: there seems to be an extra space L59: could the authors please specify what they mean with a 'poor assembly'. What is poor in terms of genome assembly? Contiguity or completeness, or unresolved haplotypes, or …, or a combination of thereof? L63-: the authors only once refer explicitly to Fig 1 in this section. the manuscript would be clearer if they would refer to specific panels as they describe factors impacting genome assembly quality L66: could the authors please further substantiate their notion that most genome assemblies publicly available are formed by short-read sequencing data. This information should be readily available at NCBI and/or GOLD
  
  L119: the manuscript mentions pan-genomics, but the relevance of aneuploidy in these studies is not explain. The manuscript should provide a brief explanation for the importance of aneuploidy (or any form of ploidy shift) for pan-genomics L147: 'From' -> 'from' L148: 'Symbiotic' -> 'symbiotic' L232: the reference to nQuire should read WeiÃŸ et al. 2018. L302: the reference to blobtools is missing L349: To initiate the pipeline, was a single sequencing library or a combination of multiple libraries used? Table 1: The table formatting, at least in the combined pdf, seems to be broken.
3. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Zhong Wang
  
  In this work, Naranjo-Ortiz et al. presented a software pipeline that is capable of de novo genome assembly, variant calling, and generating diagnostic plots. Applying this software to 35 publically available, highly fragmented fugal genome assemblies revealed prevalent inconsistencies between the sequencing data and the assembly. I really appreciate the authors' effort to make their software, Karyon, easy to use by providing multiple ways to install and a detailed software manual. I especially like the detailed explanation of how to use the diagnostic plots to infer the "nonstandard genome architectures".
  
  The manuscript is clearly written and very easy to follow. I have the following general comments:
  
  It wasn't clear to me the relationships between the raw sequencing data and the assembly -- were they belong to the same isolate? If so, then the inconsistencies may reflect assembly errors in the fungal genome assembly. Have the authors rule our this possibility? The fact that these genomes are highly fragmented suggests they likely contain many errors. If they were from different isolates, then I agree with the authors that the diagnostic plots could be examined carefully to detect structural variations. For that, have the authors used any alternative method to validate at least some of their findings? To establish the validity of their approach, it would be more convincing to obtain the same findings using independent approaches, including experimental ones.
  
  Given the raw WGS reads and assembled genome, another software, QUAST (http://quast.sourceforge.net/), automatically detect assembly errors and structural variations. It would be interesting to see a comparison between the findings via Karyon and via Quast.
  
  3.This is an optional suggestion, as I realize it may not be easy to implement. The biggest limitation of Karyon is that it does not automatically detect these usual genome organization. It may be possible by comparing the de novo assemblies produced by Karyon to the reference genomes. At least such possibilities should be discussed.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.05.23.445324v1
www.biorxiv.org www.biorxiv.org

A curated human cellular microRNAome based on 196 primary cell types

3
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Background
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: Kevin Peterson Pati et al. report the expression profiles of miRNAs and a vast array of other, non-miRNA sequences, across 196 cell types based on thousands of publicaly available data sets. Although this will be an outstanding contribution to the miRNA field, I am recommending rejection for now with the strong encouragement to resubmit once the authors have addressed my concerns. To be frank, the authors have two diametrically opposed research agendas here:
  
  1) What are the expression profiles of bona fide miRNAs (as determined by MirGeneDB); and
  
  2) What else might be expressed in human cell types that could be of interest to small RNA workers and clinicians? Because of these opposed goals, the paper is not only confusing to read and process, but it gives fodder to the numerous paper-mill products that continue to identify non-miRNAs as diagnostic, prognostic, and even mechanistic indicators into virtually every human malady under the sun. Let me try and highlight why the use of only MirGeneDB (MGDB) would be highly useful for this paper.
  
  1) miRBase (MB) is not consistent in its identification of both arms of a bona fide hairpin, resulting in the authors not reporting star reads for highly expressed miRNAs such as mir-206 and mir-184.
  
  Further, there are numerous examples where the authors do in fact report a "mature" versus "star" read without both arms annotated in MB with some included in MGDB (e.g., Mir-944, Mir-3909) and others not (e.g., mir-3615) raising the question of how these data were annotated.
  
  2) The authors write that, "the majority (46%) of the reads are mature miRNAs." But MB makes no attempt to distinguish mature from star arms. Hence, if they are annotating to MB, they cannot distinguish between these two processing products. This is not only confusing, but also very unfortunate as one cannot get a sense of the expression of evolutionary intended gene products versus processing products.
  
  3) The authors report on the use of 5p versus 3p strand dominance, but have no examples of "codominant" miRNAs (Fig. 1C) when, in fact, there are numerous examples in their data including Mir-324, Mir-300, Mir-339, Mir-361 etc. with some switching arms depending on the variant. All of this is available at MGDB; none at MB. 3) MB does not allow the identification of loop or offset reads separate from the arm reads, allowing to authors to accurately report the amount of reads derived from the "hairpin" versus the arms (and how the authors reported this in Fig. 1B is not at all clear given that these sequences are not annotated as such at MB).
  
  4) The authors bias their genic origins of small RNA reads by filtering first using MB, and then identifying remaining reads as arising from other sources including tRNAs, rRNAs, mRNAs etc. However, numerous "miRNAs" in MB arise from these genic sources including mir-484 (mRNA) and mir-3648 (rRNA). So if I understand the authors pipe-line these sequences are mistakenly included in the "mature miRNA" column.
  
  5) The use of MGDB would allow the user to see the saturation of mature reads across the different cell types in Fig. 1E, and, if mature is distinguished by star, then one could also see the (near)-saturation of star reads as well. As it stands, their plot just simply highlights the non-genic nature of much of MB. Further, because MGDB identifies the age of each miRNA, if the authors were interested, they could also test a long-standing pattern that evolutionary older miRNAs are expressed at higher levels than younger miRNAs relative to specific cell types.
  
  6) The authors report the expression profiles of bona fide miRNAs in Figs 3 and 5, but report the expression profiles of non-miRNAs in Fig. 4. These include mir-3150b, mir-4298, mir-569, mir-934, mir302f, and mir-663b. None of these supposed miRNAs have the requisite reads for miRNA annotation, and all but mir-3150b fail a structural examination as well. In fact, MGDG has no reads (which includes numerous data sets from the Halushka lab) for mir-302f, mi-4298, and mir-569, and only a few reads from one "arm" for mir-663b and mir-3150, highlighting the need to examine these supposed reads in detail. The inclusion of obvious non-miRNAs here is confusing and needlessly undermines the authors study and conclusions. So, my strong recommendation is to potentially write two papers. The first (this one here) focuses only on the expression of miRNAs, emphasizing really interesting results (like what they report in Fig. 5), and providing to the miRNA field a robust cell-type expression profile for humans. This would eliminate the need for read/rpm cutoffs as they are simply reporting the read profiles for what is in MGDB. This would not hamper their attempts to include these data at UCSC as MGDB includes links to both MB as well as UCSC, and indeed, why report "miRNA" read data to a genome browser for well over a thousand nonmiRNAs? This simply will lend credence to all of these non-miRNAs that already clutter the literature. A second paper could focus on potentially interesting or relevant small RNAs that show interesting patterns of expression in normal and/or diseased tissues, highlighting the structural and expression profiles of these genic elements, and possibly trying to identify what they might be (including potential false negatives in MGDB). As Corey and colleagues (2022, NAR) recently stressed, we as a field must focus on mechanism as the identification of a "biomarker" in and of itself is of no real value if we don't understand what it is or where it comes from.
  
  Minor comments:
  
  1) The seed sequence is 7-8 nt in length, not 6 nt.
  
  2) miRNAs reads - both mature and star - have a mean length of 22 nt in length, and no miRNA is less than 20 nt long (5p: median = 22, mean = 22.56, SD = 0.94, range = 20-27; 3p: median = 22, mean = 22.11, SD = 0.57, range = 20-26. All data from MGDG.).
  
  3) Its misleading to write miRNAs "block protein translation." Please rewrite.
  
  4) I don't believe our understanding of the expression profile of miRNAs is hampered by the numerical naming scheme. MB's nomenclature system obscures the evolution of miRNAs by erecting both paraphyletic (e.g., MIR-8, which includes mir-141) and polyphyletic groups. Why would distinct monophyletic families like MIR-142, MIR-143 and MIR-144 create confusion regarding their expression?
  
  5) The use of the term "leading strand" is confusing given its clear association with DNA replication (and not a term I've heard of associated with miRNAs).
  
  6) Please give cut-offs for things like "infrequent", "frequent" etc.
  
  7) I was surprised at the lack of co-expression for Burge's co-targeting miRNAs, especially in the brain. I think it would be worthwhile to examine more carefully these miRNAs and discuss in a bit of detail why they don't appear together in Fig. 2A.
  
  8) Fig. 6 should be moved to the supplemental figures as this is not readable and of no real value.
  
  9) The authors might want to reference Lu et al. (2005) for Mir-1 expression in the colon as this is one of the obvious down-regulated miRNA in diseased colon tissues.
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Ian MacRae
  
  In this study, Patil and co-workers have combined the largest set of publicly available small RNA-seq datasets to provide a comprehensive analysis of cell-type-specific miRNA distributions. Moreover, the authors made their results easily accessible to the public via Bioconductor and UCSC genome browser. This deeply curated resource is a valuable asset to biomedical research and will help researchers better understand and utilize the otherwise overwhelming number of small RNA-seq datasets currently available.
  
  Here are some minor points for the authors to address:
  
  In the background section, the first sentence, "microRNAs (miRNAs) are short, ~18-21 bp, critical regulatory elements that block protein translation". Mature miRNA is single-stranded, so it would be more appropriate to use 'nt' (nucleotides) instead of 'bp' (base pairs) to describe miRNA length. Additionally, many mature miRNAs have a length of 22 and 23nt. Finally, "block protein translation" is not quite right as mammalian miRNAs are believed to primarily function by promoting the degradation of targeted mRNAs . 2. In Fig. 1C, is the "co-dominant" category bar missing? Since the sum of 5p and 3p bars are not equal to 100%.
  
  In Fig. 1D and 1E, the y-axis label "Unique miRNA count" is misleading/confusing. Would a more appropriate label be "Unique miRNA species"?
  
  In the "DESeq2 VST provided superior normalization" section, the authors mentioned that "An HTML interactive UMAP with cell type information is available in the GitHub repository (https://github.com/mhalushka/miOme2/UMAP/Figures)." However, the provided link is not accessible.
3. GigaScience 17 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1 Reviewer name: Qinghua Cui
  
  In this paper, the authors reported a curated human cellular microRNAome based on 196 primary cell types. This could be a valuable resource. The following comments could improve this study.
  
  Euclidean distance could be not a good metric for clustering analysis. I am wondering the results when using other metrics, e.g. spearman's correlation.
  
  More analysis are suggestted, such as cell-specific miRNA, functional set analysis etc.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.05.16.492160v2
www.biorxiv.org www.biorxiv.org

Association Mapping Across a Multitude of Traits Collected in Diverse Environments Identifies Pleiotropic Loci in Maize

3
1. GigaScience 17 Feb 2023
  
  in GigaScience
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Reviewer name: Roberto Pilu **
  
  The manuscript "Association Mapping Across a Multitude of Traits Collected in Diverse Environments in Maize" by Ravi V. Mural et al. reported the application of high-density genetic marker data from two partially overlapping maize association panels, comprising 1,014 unique genotypes grown in seven US states, allowing the identification 2,154 suggestive marker-trait associations and 697 confident associations and suggesting the possible application to study gene functions, pleiotropic effects of natural genetic variants and genotype by environment interaction.
  
  The background data are well documented, experimental data are convincing, clearly presented and well discussed, the paper is suitable for publication in Giga Science in its present form.
2. GigaScience 17 Feb 2023
  
  in GigaScience
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Reviewer name: Yingjie Xiao.
  
  The authors described a study of integrating multiple published datasets for reanalysis. They combined previously community panel data and newly collected data in the present study, finally assembling 1014 accessions with 18M SNP markers and 162 traits at different environments. They used a resample-based GWAS method to reanalyze this assemble dataset, and identified 2154 suggestive associations and 697 confident associations. They found genetic loci were pleiotropic to multiple traits.
  
  As the authors mentioned, I acknowledge their efforts for collecting and assembling different sources of previously datasets, which should be useful for the maize community. However, to the manuscript per se, I feel the paper seems not to be sufficiently quantified regarding the novelty and significance of reported findings. If the authors could present several novel results because the previous studies had the limitations on population size, diversity, trait dimensions and environments. In this study, the authors seemed trying to present like this, but it may be improved further and more.
  
  It's hard to let me understand there are some novel things which was found due to the merged large dataset. On the other hand, using this assembled dataset, I'm not very clear what's the scientific questions that the authors want to address. In technical sense, I'm wondering how did authors deal with the batch effects when merging datasets phenotype from different environments? It's not comparable for the phenotypes from different accessions collected in different environments. It's hard to figure out the phenotypic difference is caused by genotype, environment, or their interaction.
  
  The introduction section lacked the proper review for the project background, related progress and publications and findings.
3. GigaScience 17 Feb 2023
  
  in GigaScience
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Reviewer name: Yu Li
  
  Reviewer Comments to Author: Mural et al. reported a large-scale association analysis based on publicly published genotype and phenotype datasets and a meta-GWAS. This study provides a good example for mining community association panel data and further identifying candidate genes, pleiotropic loci and G x E. Actually, metaanalysis of GWAS has been used in humans and animals. However, I have some major concerns as follows.
  
  This study only used three association panels (MAP, SAM, and WiDiv), as I know, some publicly available genotype and phenotype could be obtained for other association panels, for example the association panel including 368 inbred lines (Li et al., 2013, Nat Genetics, 45(1):43-50. doi: 10.1038/ng.2484), which was used widely in GWAS studies in maize. Can other association panels be integrated into this research, which would provide a rich genetic resource for maize research groups.
  
  For association analysis, a total of 1014 unique inbred lines and 162 distinct traits from different association panels were used, but these traits were not measured for each of 1041 inbreds. For example, cellular-related traits were mainly measured in the SAM association panel. Hence, association analysis for cellular-related traits were conducted in SAM or 1014 inbreds. If 1014 inbreds were used to perform association analysis for cellular-related traits, how did you analyze the phenotype data? Please describe the method of phenotype data analysis in the Method section.
  
  Authors used RMIP values to identify significant association signals, please add more details about the RMIP method. What advantages of the resampling-based genome-wide association strategy over other methods?
  
  Although some important functional genes could be identified, were some new candidate genes obtained in this study functionally verified by the mutants or overexpression experiments.
  
  The authors identified pleiotropic loci based on categories of phenotypes associated with the same peak. For example, the phenotypes associated with the pleiotropic peak on chromosome 8 from 134,706,389 to 134,759,977 bp belongs to Flowering Time, Root and Vegetative categories, thus the locus was associated with different traits. Do you have any ideas on pleiotropic genes based on the results?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.02.25.480753v1
www.biorxiv.org www.biorxiv.org

Spacemake: processing and analysis of large-scale spatial transcriptomics data

4
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  at the same time
  
  Reviewer name: Ruben Dries (revision 1)
  
  The authors responded adequately to my original concerns and have adjusted their manuscript accordingly. I have no further questions or comments. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  State
  
  Reviewer name: Ruben Dries
  
  In this article, the authors created a modular and scalable pipeline to process raw sequencing data from spatially resolved transcriptomic technologies. In contrast to other popular genomics technologies, such as (single-cell) RNA sequencing, there are virtually no existing public tools that allow users to quickly and efficiently process the raw spatial transcriptomic sequencing data that are generated through Illumina sequencing. This is largely due to the fact that each spatial transcriptomic workflow creates its own unique spatially barcoded reads and thus typically requires technology-specific tools or scripts to extract both the barcode and gene expression information. Here the authors created Spacemake which consists of multiple modules that are tied together using the popular workflow management system Snakemake. The innovative part of Spacemake comes from the creation of specific 'sample variables', such as the barcode-flavor, run-mode and puck, which allows them to create a flexible pipeline that in theory can be adapted to any type of spatial array-based sequencing technology. The authors use well-established tools for downstream quality control and data processing and provide useful additional modules to assess or improve spatial data quality. Finally, Spacemake is also directly linked to Squidpy for downstream analysis and creates a web-based report, which could certainly help to lower initial spatial data analysis barriers. Overall, the presentation of the tool and the methods used in the pipeline as described in their contents are comprehensive and the user manual is easy to understand. We appreciate the efforts to provide this tool to the spatial transcriptomics community and to make it open-source and flexible. However, we do have some suggestions and concerns regarding the manuscript and/or use of this tool. Major comments: 1. We managed to install the spacemake software on the linux based server but failed to install it on a MacOS machine due to the compatibility issue with bcl2fastq2. Unfortunately, we also ran into an issue on our linux server, which happened during one of the reading steps from "/dev/stdin" in the middle of the spacemake workflow. More specifically we encountered the following error: Job error: Job 7, TagReadWithGeneFunction Error message: [E::idx_find_and_load] Could not retrieve index file for '/dev/stdin' Even with the help of our IT team we were unable to resolve this issue. To help troubleshoot it might be helpful if the authors can provide exact commands for the examples provided in the manuscript and show what should be expected output of each job in the snakemake pipeline. As a result we were unable to re-run any of the provided examples, which severely limited our reviewing options. 2. A major drawback of Spacemake is that it currently does not offer solutions for the integration of imaging information, which is typically an essential step in any spatial sequencing workflow. The authors do note this shortcoming in their discussion and as a potential solution they argue that Spacemake can be used with another tool called Optocoder, which is currently being developed in their lab. However no information can be found anywhere. There is no biorxiv or github page available based on our search results and as such we were unable to test or assess this solution. At minimum the authors should provide general guidelines on how users could potentially integrate images together with the created spatial downstream results. Minor comments: 1. The figure labels and legends are not always clear. More specifically it's sometimes hard to figure out which samples are being used for each figure or panel. This could be simply resolved by writing more informative legends that specifically state which sample was used to create each figure panel. According to the text Seq-Scope was used to generate figure 3, however in the legend of figure 3 it says Slide-seq … 2. Overall, the figures are pretty and informative, however I would suggest starting with a general overview figure that highlights the spacemake pipeline and it's innovative framework. Given the goal and content of the manuscript this seems to be appropriate as a main figure. 3. In order to initialize a spacemake project, the dropseq tools that are required by Spacemake lack any introduction. Please provide a brief introduction and a link to the associated github page to improve this step. 4. In order to configure the spacemake project by adding a sample species, the pipeline does not allow compressed versions of genome files. This could be simply fixed and allows the user to directly link to their, typically compressed, genome files. 5. More information is needed about the R1 R2 arguments in the add sample function. For example, SeqScope has two separate libraries to get sequenced. Where each round of libraries should be loaded is not immediately clear from the tutorial the authors provided. 6. The downsampling and NovoSparc modules together might create an opportunity to identify the relative error that is introduced when NovoSparc is used to enhance spatial expression patterns. Although this might be outside the scope of this paper. 7. As mentioned in the Major comments section we were unable to successfully run an example script, but it would be of great interest to the large spatial community if this pipeline can easily be used with other downstream analysis tools, such as Giotto, Seurat, Bioconductor (spatialExperiment class), etc. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Spatial
  
  Reviewer name: Qianqian Song (revision 1)
  
  The revised version mostly addressed my concerns. Hopefully this tool can be widely used with the emerging spatial transcriptomics data. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
4. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Qianqian Song This manuscript proposed a python-based framework named spacemake, to process and analyze spatial transcriptomics datasets. It offers functionalities including sample merging, saturation analysis and analysis of long-reads as separate modules, etc. Overall, this tool holds promises for spatial analysis, though this manuscript lacks details and explanations of methods and results. Specifically, I have some concerns regarding this manuscript. 1) As shown in table 1, it is noticeable that spacemake doesn't include H&E integration, which is kind of necessary in spatial data. I would recommend the authors at least discuss the potential functionality in including H&E images. 2) From the legend of Fig 2B, I didn't find the plot with Shannon entropy, please double check. 3) I don't understand the meaning of fig 2D. The authors should explain how they calculate the Shannon entropy and string compression length of the sequenced barcodes, as well as how they define the expected theoretical distributions. More details are needed here. Though the authors mentioned related information/details would be in methods (last line in QC section), I didn't find any in methods. 4) In Fig 4 A, the authors show the mapped scRNA-seq of mouse cortical layers. I think a complement spatial plot with annotations is necessary, as there is a gap between Fig 4A and Fig 4B. 5) Fig 5C lack the annotations of different colors. 6) In page 16, the authors cited a manuscript in preparation, which is not good. I suggest remove the citation. 7) Supplementary Fig 1 would be better if put as fig 1, thus it would show the overall flow & functionality of spacemake. 8) Based on Supplementary Fig 1, the authors should add a section illustrating how they annotate the spatial data and the involved gene markers. 9) The paragraph "Spacemake can readily merge resequenced samples" lacks detailed explanation and results. 10) Though spackemake claims it is fast in processing data, well, Supplementary Fig 5 doesn't fully support that. Meanwhile, the authors should explain what the different colors represent. 11) In Supplementary Fig 2, the authors show very high correlation between spacemake and spaceranger, especially the exon intron and exon sub-figures. It looks like the correlations is close to 1. I suggest the authors double check the results and give explanations on their correlation analysis. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.11.07.467598v1
www.biorxiv.org www.biorxiv.org

Loop detection using Hi-C data with HiCExplorer

4
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  compute
  
  Reviewer name: Aleksandra Pakowska (revision 2)
  
  Thank you for the feedback and for including more analyses. Figure S 5 is hard to read (it is unclear where the loops are), in Figure S 6, HiCExplorer looks in fact worse than HiCCUPS. Both tools have issues at noisy loci but seem to be calling the most relevant interactions. The authors decided not to address the issue of pixel merging and its impact on the analysis which might have perhaps helped to understand the discrepancies between tools. Given that almost half of the loops detected by HiCExplorer are not detected by HiCCUPS, it would be interesting to check what these loops connect - convergent CTCF sites, cis regulatory elements to each other? This point could be addressed either in this or in another study. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  are
  
  Reviewer name: Feng Yue (revision 1)
  
  My main concern for the revised manuscript is the additional benchmarking the authors performed with Fit-Hi-C and Peackachu. Since Fit-Hi-C is one of the first algorithms for Hi-C loop prediction (published in 2014) and Peakachu is the only method that uses the supervised machine learning approach for such purpose, I suggested that these two software should be recognized. If the authors can perform a fair benchmarking and find out where the differences come from, the results would be really interesting. The authors decided to test the aforementioned methods during the revision. Unfortunately, I believe there were some errors during the testing. For Peakachu: 1. Most importantly, the authors used the wrong form of normalized Hi-C files for Peakachu. Peakachu model was trained and should be used with ICE-normalized Hi-C matrix. However, based on page 8 in the supplementary file, the input file is gm12878_KR.cool. The data range for ICE and KR normalization is very different, and therefore, the model trained in ICE file will not work with KR format and the prediction will wrong. Therefore, all the following evaluations and descriptions for the Peakachu prediction are not accurate and needs to be revised (such as Fig. 4, Table S1 ...). 2. In the response letter, there is another misunderstanding about merging. Because Fit-Hi-C predicted too many contacts, the authors of Peakachu merged "the top 140,000 interactions into 14,876 loops (Fig. 3a, b), with the same pooling algorithm used by Peakachu." The reason is that if multiple continuous bins on a Hi-C map are all predicted as loops, the merging/filtering step will use the bin with the most significant P-value as the chromatin loops (local minimal). As the authors noted, Fit-Hi-C by default will generate "significant contacts in the 100,000-ends." Therefore, this merging/filtering step is necessary if we want to compare the loops predicted by each method. This is also what the author did in this manuscript as well - I am quoting their own writing here, "This filtering step is necessary to address the candidate peak value as a singular outlier within the neighborhood." Therefore, I do not understand the authors are "irritated" by such approach. 3. The authors of Peakach have released their prediction in 56 Hi-C datasets on their 3D Genome Browser website (http://3dgenome.fsm.northwestern.edu/publications.html), including the ones used in this manuscript. The authors used models trained at different sequencing depths for different datasets. Therefore, I would suggest the authors use this dataset for a fair evaluation. Regarding Fit-Hi-C, what are the number of peaks the before and after filtering? The author also needs to provide the loop locations so that reviewers can evaluate their claim independently. This information is critical. This manuscript might be helpful for the authors to evaluate Fit-Hi-C (Arya Kaul et al. Nature Protocol 2020). Finally, the authors need to provide all the predicted chromatin loops in the cell lines as well as loops predicted by other software used in this manuscript as supplementary materials (loops in Supplementary Table 1). Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Chromatin
  
  Reviewer name: Feng Yue
  
  This paper provided a loop detection method using continuous negative binomial function combined with donut approach. To test the performance of this method, the authors used in-situ Hi-C data by Rao 2014 in GM12878, K562, IMR90, HUVEC, KBM7, NHEK and HMEC cell lines. This method showed comparable results with HiCCUPS and cooltools and better outputs than HOMER and chromosight. The significant advantage is the utilization of modern computational resources. The following are my comments: 1. The author claimed the advantages in utilizing computational resources. The authors need to clarify how their algorithm contributes to this advantage. 2. It will be helpful for the users to know the performance of the software at various sequencing depths, which can be achieved by down-sampling the high resolution datasets. 3. The authors need to compare (or at least discuss) Fit-Hi-C and Peakchachu. A table showing the strength and limitation of each method will be helpful. To be honest, I don't think any method is clearly better than the other. They are just different approaches. 4. It is better to use other types of orthogonal data like HiChIP, ChIA-PET to evaluate the loops called by these methods. There are H3K27ac HiChIP, SMC1 HiChIP, CTCF ChIA-PET and RAD21 ChIA-PET data in GM12878. 5. Just a minor suggestion. There are a lot of tables in the manuscript, which makes it hard for the readers to compare. It might be better to use figures instead. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
4. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Borbala Mifsud
  
  Wolff et al. present the python version of HiCExplorer for loop detection. The algorithm is included in the Galaxy HiCExplorer webserver (Wolff et al. 2020), although the publication about the webserver did not describe the algorithm in detail. HiCExplorer uses the same donut approach as HiCCUPS (Rao et al. 2014) with a few notable differences. HiCExplorer selects candidate peaks based on the significance of the distance-corrected observed/expected ratio using a negative binomial model, and compares the peak's enrichment to its neighbourhood's using a Wilcoxon rank-sum test. The method is appropriate for chromatin loop identification and it performs similarly to existing methods both in terms of computational requirements and specificity of the detected loops. However, the manuscript in its current format does not describe the method adequately, and the comparison with the other methods is limited and inconsistent. It would be good to describe each step of the method (filtering based on distance, candidate selection based on negative binomial test, additional filtering options, local enrichment testing using different neighbourhoods in a Wilcoxon rank-sum test). The graphical representation currently included for the algorithm is not informative for most of these steps. For the scientific community, it would be more informative if this method's performance would be further analyzed. Even though it is mentioned that the loop detection greatly depends on the initial parameters, the results do not show how the parameters influence it. The comparison of HiCExplorer with other existing methods is inconsistent. Finally, the text would need heavy editing for language, clarity and minor spelling mistakes. Specific comments: The background does not clearly lay out the motivation behind designing this algorithm. There are similar existing methods that are fast. Why is it expected to detect chromatin loops better? This is not a 3D genomics specialized journal, therefore the text should introduce Hi-C and its challenges clearly. For example, the notion that genome properties and ligations affect Hi-C data analysis is mentioned in the methods section without further elaboration. It would be hard for readers to understand why authors are normalizing for ligation events in their algorithm. The background introduces a few methods that are not aimed at detecting chromatin loops (e.g. GOTHiC) or not designed for Hi-C (e.g. cLoops) and are also not used in the comparison. It would be more useful to describe the algorithms of those methods that are comparable to Hi-C explorer in terms of their goal and design. Figure 1, which represents the steps of the algorithm, does not make it clear what happens at each step, some of arrows seem to point to random pixels, e.g. in panel C. More elaboration on the use of the three different expected value calculation methods would be needed. Which one is more appropriate for a mammalian vs. an insect Hi-C does it depend on the genome size, the sequencing depth or the sparsity of the data? The negative binomial distribution does model well the read counts in most high-throughput sequencing experiments, but the rationale given for choosing it is not appropriate. Also, citing a stackexchange discussion for the methods is not suitable. The numbers in most tables could be better appreciated if they were represented in a figure. What was the reason to increase the distance only to 8Mb instead of using the full genome as comparison, especially given that some of the compared methods only work on the full genome? The bottom left neighbourhood in HiCCUPS is assessed, because they only use the upper triangle in the Hi-C matrix, and the bottom left neighbourhood represents the shorter interactions. In Figure 2, the detected interactions are indicated on the bottom triangle , which is counterintuitive. Fig 2A is showing the same data as Fig 2A in the Galaxy HiCExplorer publication (Wolff et al 2020), but the detected loops indicated are different. What is the reason for that? The difference between the proportion of CTCF-bound loops for the different methods is probably not significant. It should be tested. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.03.05.979096v1
www.biorxiv.org www.biorxiv.org

ChemChaste: Simulating spatially inhomogenous biochemical reaction-diffusion systems for modelling cell-environment feedbacks

3
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Results
  
  Reviewer name: Lutz Brusch (revision 1)
  
  The revised version of the manuscript "ChemChaste: Simulating spatially inhomogenous biochemical reaction-diffusion systems for modelling cell-environment feedbacks" addresses all my previous comments and I would also like to thank the authors for their in-depth response. Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Motivation
  
  *Reviewer name: Lutz Brusch*
  
  The manuscript no. GIGA-D-21-00383, entitled "ChemChaste: Simulating spatially inhomogenous biochemical reaction-diffusion systems for modelling cell-environment feedbacks" addresses the important technical challenge of hybrid discrete-continuous models. The presented extension of the widely used Chaste software library, termed ChemChaste, now supports simulations of reactiondiffusion dynamics in a 2-dimensional environment bi-directionally coupled to motile and chemically active but point-like cells. Specifically, ChemChaste supports arbitrarily many spatial domains within the system, each with individual uniform diffusion coefficients. It supports arbitrarily many coupled reaction-diffusion equations and coupling via membrane reactions and transport reactions between bulk molecular species and intracellular species. Cells are coarsely represented as points on a cell-mesh that is distinct from the FE-mesh for solving the reaction-diffusion dynamics. The user interface is established through a tree of many small text and csv files that are human-readable. All these extensions to Chaste are valuable and their presentation is important for the large user base and beyond. The manuscript is clearly structured and well written. The source code is openly available under the permissive BSD 3- clause license at the provided GitHub link (https://github.com/OSS-Lab/ChemChaste) and includes all models, parameters and data as used in the present manuscript. As the motivation and title focus on "...modelling cell-environment feedbacks", then also the implications and limitations of the coarse cell representation in ChemChaste must be clearly stated, see comments below. Major comments:
  
  Coarse spatial cell representation: Cells are represented by their node position in the cell-mesh and interact with the environment through a single node at the same position in the FE-mesh. Can this formalism properly account for transport reaction fluxes in strongly heterogeneous environments where the FE-mesh needs many nodes with differing field values in a spatial area equivalent to the size of a single cell (with the cell node inside this area)? For example, how does this formalism evaluate the uptake from an exponential concentration gradient (as is common for diffusion and degradation around a localized source). For such a field, the local concentration value at any single position is always smaller than the average over any symmetric interval around it. Hence a transport reaction flux calculated with the single concentration value at the cell center will systematically underestimate the flux that would result from averaging over the area equivalent to the size of the cell. Moreover, such systematic errors also occur for linear concentration gradients and can get amplified when transport or membrane reactions are nonlinear with for instance high Hill coefficient. For comparison, with a spatially more explicit cell representation with many paired cell-nodes and field-nodes, one could directly sum the flux contributions from these paired field-nodes. But with the single cell-node here, usability seems limited to weak gradients at the scale of cell size. Alternatively, can a spatial kernel or stencil function be used to average or sum over field values in the spatial area equivalent to the size of a cell?
  
  Conservation of mass for transport: In biology, the number of molecules per time taken from the environment in a transport reaction has to equal the number of molecules per time added to the cell, and vice versa. So mass needs to be conserved and not concentration whereas ChemChaste seems to add and subtract the concentration flux in the different spatial compartments (cf. page 7 of SI.S1.4). For example, if the FE-mesh needs to use multiple nodes in a spatial area equivalent to the size of a single cell (hence Ve<Vc) but the transport reaction only relates the concentration value at one of these nodes to the cell-node, then mass is not conserved and results will be wrong. One option may be to attach volume attributes to nodes in both meshes. A node i in the cell-mesh would store the current cell volume Vc_i and a node j in the FE-mesh would store that node's share of the volume in the environment Ve_j (doubling the number of nodes in the FE-mesh would on average halve each node's volume Ve_j). Then secretion of molecules with intracellular concentration u at rate k would reduce the intracellular concentration by a flux of molecule number per per time and per volume, i.e. k*u*Vc/Vc=k*u, and increase the concentration at the environment node with flux k*u*Vc/Ve which in general is and must be different from the intracellular concentration flux k*u. Likewise, if the FE-mesh is coarse (hence Ve>Vc) then the transport flux must get diluted like kuVc/Ve < k*u. The factor Vc/Ve does not appear to be implemented and the equations on page 7 of SI.S1.4 omit this factor, limiting the usability to the special case Vc=Ve. This implies that the construction of the FE-mesh has to match the cell-mesh wherever cells are positioned and in their neighborhood. This limitation and the required construction of the FE-mesh must be described.
  
  Scaling of fluxes with cell surface area: In biology, membrane reactions and transport reactions occur at the molecular scale and yield a characteristic flux density per membrane area. The total flux per cell is then the integral of the flux density over the cell surface. Hence cells with larger surface area must be able to exchange more molecules with the environment. Since differently shaped cells will have different surface to volume ratios, it appears necessary to attach not only a cell volume Vc_i to each node i of the cell-mesh but also a surface area value Ac_i. The transport reaction fluxes from item 2. above then become k'AcuVc/Vc=k'Acu and k'AcuVc/Ve, respectively, with a new rate constant k' with units [1/(areatime)]. The same argument applies to membrane reactions. Only if all cells have the same and constant surface area then Ac does not need to be attached to nodes and k may be used instead of k'Ac.
  
  User interface and model format: To improve Interoperability according to FAIR,
  
  please explore and comment how the files that are required for model definition in ChemChaste can or cannot be packaged in a COMBINE archive [Bergmann et al. (2014). COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Systems Biology 15:369. https://doi.org/10.1186/s12859-014-0369-z].
  
  please compare ChemChaste's declaration of the reaction-diffusion model in the environment to that of the SBML Level 3 Spatial Processes Package (SBML-spatial) [https://synonym.caltech.edu/documents/specifications/level-3/version-1/spatial/].
  
  please compare ChemChaste's declaration of the reactions to that of the Antimony model format as used in the Tellurium framework [Smith et al. (2009). Antimony: a modular model definition language. Bioinformatics 25:2452. https://doi.org/10.1093/bioinformatics/btp401].
  
  please discuss the necessary steps to convert model files available in SBML-spatial or Antimony to ChemChaste and vice versa.
  
  Numerical accuracy of the 3-fold operator splitting scheme for cell-environment coupling: As shown in Fig.1b, the three operators 1 (Cell dynamics), 2 (Environment dynamics), 3 (Cellular fluxes) are applied sequentially for a coupled cell-environment model. How is the numerical error controlled for this 3-fold operator splitting scheme? How are time steps chosen or adapted internally?
  
  Model equations for test case with cell-environment coupling: In SI, Figure S10.c (and file CellA/Srn.txt in the code repository) apparently all 5 reactions are defined as reversible with "<->" and each has a nonzero kr=1.0 but only two of these reactions are reversible in the reaction scheme in main Fig.4a. Probably the file in the repo and SI is wrong (as the reverse generation of Precursor directly from Biomass and Enzyme is not physiological) and possibly the simulation results in Fig.4b may change after correction of the file CellA/Srn.txt.
  
  Findability of repository: To improve Findability of ChemChaste according to FAIR, the code repo should be integrated with or referenced from the core project at https://github.com/Chaste/ . This integration should also facilitate future code maintenance and usability in a sustainable manner. Minor comments:
  
  Further tests may be easily implemented for the Schnakenberg model which was qualitatively simulated but not quantitatively compared to an analytical prediction (main text, lines 368-375). One (rough) quantitative comparison could be achieved for the dominant mode of the Fourier-transformed simulated pattern (Fig.3b; or some other measure of the spatial period of the pattern) versus the critical mode of the diffusion-driven instability (|k_cr|^2 = 1/(2D_U) * dR_U/dU + 1/(2D_V) * dR_V/dV). In addition, the instability threshold from eq. (25) in SI.S6 (page 27) can be tested in simulations along a one-parameter scan across the instability and the temporal oscillation period in Fig.3a can be (roughly) compared to the predicted period from the imaginary part of the eigenvalues of the steady state or computed by means of numerical continuation in AUTO (http://indy.cs.concordia.ca/auto).
  
  Main text, lines 460-463: "Thus...lead to a spatial segregation of the two cell types." This behavior may be subject to the slow or lacking active motility of the cells. Now, cell division alone seems to generate compact clones of the same cell type instead of emergent spatial segregation. Maybe comment if/how ChemChaste handles random walks of cells or even chemotaxis of cells towards ES. Then the interesting question of emergent spatial segregation can be studied with ChemChaste.
  
  Please clarify if/how ChemChaste allows to incorporate transport reactions directly between neighboring cells (like auxin or calcium transport in tissues)?
  
  Where are the membrane reactions involving a cell and the environment included in Fig.1b: in steps 1./2. or in step 3.? That is interesting for the numerical operator splitting scheme and may be added to the caption.
  
  In addition to item 7. above (which should ensure future usability), the reproducibility of the current model results as presented in this manuscript should be ensured by archiving the current software version from the ChemChaste code repo at Zenodo or a similar service and the DOI of that archive should be given in the manuscript. In addition, that archived code shall be given a version number on GitHub and that version number shall also be given in the manuscript. Figure improvements:
  
  Figure 2.b may have axes flipped or may have an unfortunate color scale with too little contrast for convergence scores between 0.4 and 0.5 to show the gradual change of score at the horizontal row with dt=0.1 (which is apparently used in Fig. 2.c and shows a change of accuracy there). Please check and improve the correspondence between panels b) and c) such that the data from panel c) helps to get a feeling for the L2 score changes in panel b).
  
  Figure 2.b: How can we understand the loss of convergence if the time step is reduced (say from 0.006 to 0.0002) at any fixed dx? From other solvers, one is used to that finer dt improve convergence while this plot shows dark (high L2 score) areas on both sides of the light (low L2 score) areas at intermediate values of dt.
  
  Figure 2.c: The color code is not suited for so many curves. Either include line style or reduce the number of curves (preferred). It must become clear which curve belongs to which dx. The green curve with dx=0.8 seems to be hidden?
  
  Figure 3.a: The figure caption should explain the source of variation between nodes (e.g. by pointing to the noise terms in eqs. 13,14) and the color code for the two bands (dark and light) around each curve (1-sigma and 2-sigma or 1-sigma and min/max ?).
  
  Figure 4b: These two panels could be given more space. Suggestion: re-arrange part a) horizontally and then put both diagrams of b) at the bottom, left and right.
  
  Figure 5: The caption wrongly announces "and t=100" which is not shown. Also the words "towards the" in the first line seem to be linked to t=100. Text corrections:
  
  main text, line 61. The sentence "...centred on the role chemical coupling." seems to miss the preposition "of".
  
  main text, line 71. The phrase "cellular network reaction size" appears misleading, when it shall refer to "the size of the cellular reaction network".
  
  main text, lines 280, 284, 286: Since the subsections of the Results section are not numbered here, then the text pointers "(Section )" can be omitted.
  
  main text, one line below eq.(7): "reaction rate constants parameters" can drop the word "parameters"
  
  main text, lines 450 and 451: "a...concentrations" should be either singular or plural
  
  SI.S1, page 1, line 5 above eq. (1): text "exchange chemical concentrations" should read "exchange molecules" and, correspondingly, "controlling the chemical concentrations passing between the bulk and the cell" should read "controlling the flux of molecules between the bulk and the cell".
  
  SI.S1, page 2, line 2: "asssociated" has an "s" too much
  
  SI.S1, page 5, at the end of Fig.S1's caption: $k-p$ should be $k_p$
  
  SI.S2.2.1, page 14, eq. (11) has capital U_0 and V_0 as initial values while the sentence above has small u_0, v_0. These should be the same symbols.
  
  SI.S6, page 26, 1 line below eq. (19): "is a spatial case" should be "is a special case" Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Cheryl Sershen
  
  It would be nice to include the Github link for Chaste. I was able to use the software and reproduce the results presented in the paper. Software is easy to use and install. A broader discussion of what would be necessary to expand Chemchaste to three dimensions is necessary. In a follow-up paper, comparisons to actual experimental results would be useful and promote users to consider this software. Only proximity to the analytical solutions were presented here. Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.10.21.465304v1
www.biorxiv.org www.biorxiv.org

https://biorxiv.org/cgi/content/10.1101/2021.07.05.451071

5
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  report
  
  Reviewer name: Yang Zhou (revision 1)
  
  The authors have resolved most of my comments. However, I am still confused about the gap in the Pilon step from the information in Table 1. In the table, I could read that the assembly length of "Flye + Pilon" is 2,383,228,608 bp, and the ungapped legnth is 2,383,226,373 bp, so the gap length is 2,383,228,608 - 2,383,226,373 = 2,235 bp. Because in the "Flye" version the assembly length is equal to the ungapped legnth, this means that gaps are introduced after Pilon correction. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Findings
  
  *Reviewer name: Yang Zhou*
  
  The authors have resolved most of my comments. However, I am still confused about the gap in the Pilon step from the information in Table 1. In the table, I could read that the assembly length of "Flye + Pilon" is 2,383,228,608 bp, and the ungapped legnth is 2,383,226,373 bp, so the gap length is 2,383,228,608 - 2,383,226,373 = 2,235 bp. Because in the "Flye" version the assembly length is equal to the ungapped legnth, this means that gaps are introduced after Pilon correction. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Syrian
  
  Reviewer name: Derek Bickhart (revision 2)
  
  The authors have addressed all of my remaining concerns. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
4. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Background
  
  Reviewer name: Derek Bickhart (revision 1)
  
  Summary: In this revision, the authors have addressed most of my major concerns with the manuscript. More details must be provided in two sections of the manuscript based on new details provided by the authors. However, these concerns could feasibly be addressed in revision. Line 124: While the authors have provided an explanation for the sequencing of different target fragment length library preparations, I do not see any results that suggest that one particular preparation was more efficient than the others. This is particularly important given the prevalence of four experimental runs of varying dataset sizes that were uploaded to the cited Biosample accession on SRA. Currently, the metadata provided for that Biosample and its associated experiments is lacking, and one cannot easily distinguish which experiment resulted from different target length preparations. A discursive analysis is not required here, but a statement that provides limited data supporting the authors' preference for library prep is necessary. Line 301: I believe that the authors misinterpreted the comment on this section in my last review. I requested the proportion of sequence identity differences between assemblies due to INDELs, not assembly gaps. Residual INDELs are still a major problem in polished assemblies that may impact gene annotation. Figure 1 caption: Given the new k-mer genome size estimation analysis provided by the authors, it does not make sense to use the total length of the MesAur1.0 assembly here. I believe that the authors should choose a genome size estimate that seems most reasonable (from the two options provided) and then use that as the basis for NG50 comparisons. Otherwise, are they conceding that the MesAur1.0 assembly size is the full length of the Syrian Hamster sequence-accessible genome? Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
5. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Derek Bickhart
  
  Summary: In this manuscript, Harris et al. detail the methods they used to create a new reference genome for the Syrian hamster, which is an important model for respiratory disease pathogens. They used several different sequencing technologies to generate the contigs and scaffolds for their new assembly, and achieved a relatively continuous end product. The analysis is suitable for the "genome report" style format (with one omission detailed below in my comments); however, the manuscript suffers from some awkward phrasing and grammar errors in the results and methods. I list my comments below in the relative order in which I encountered them in the manuscript. Since the authors did not provide line numbers in their submission, I provide my comments as a block listing of questions/suggestions/critiques. Section titled "oxford nanopore long-read sequencing": The description of the shearing is awkward. I recommend revising the first sentence to state that the genomic DNA isolates were sheared to three lengths (without providing these lengths in the sentence). In subsequent sentences, provide the lengths in situ with the methods used to prepare them. Also, it is unclear why three different fragment lengths were used here for oxford nanopore sequencing. Given that these fragment lengths are relatively similar in size (e.g. not disparate lengths similar to recent ultra-long nanopore read preps of >100kb), it would be very helpful to the reader if justification was given for this approach. Section titled "Genome assembly": This entire paragraph is awkwardly phrased with numerous past- or present-tense changes. Additionally, the reference to the Pilon polisher needs to be cited, and details need to be provided on what settings were used for Pilon polishing (it is often recommended to correct only indels and to omit gap-filling) and how many iterations of polishing were used. Details are missing on how BioNano optical maps were generated, and what DNA was used as input in the process. Also, what software was used to compare BioNano optical maps, and with what settings? Finally, it appears that the RNA-seq data used by NCBI for annotation was used in another study. Citation to that study would be required so that the reader is aware that the data resulted from different individuals other than the reference individual sequenced in this analysis. Section titled "Assembly Comparisons": What is the expected c-value of the Syrian Hamster genome? Also, what is the karyotype count? Are any of the chromosomes metacentric or acrocentric? Were any satellite regions identified and annotated in this assembly? Finally, I would have preferred that assembly comparisons be conducted with feature response curves, such as those produced by the program "FRC_align" as this provides a useful metric to assess assembly "correctness" by length. Section titled "Transcript and protein alignments and annotation comparisons": How many INDELs were identified in the alignments of RNA-seq transcripts to the BCM_Maur_2.0 assembly? Was this count different from those discovered in the short read assembly? Section titled "Interferon type 1 alpha gene cluster": Were there any gaps that spanned the gene cluster or flanked it? Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.07.05.451071v4
www.biorxiv.org www.biorxiv.org

A chromosome-level reference genome of Ensete glaucum gives insight into diversity, chromosomal and repetitive sequence evolution in the Musaceae

3
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Findings
  
  Reviewer name: Boas Pucker (revised version)
  
  The authors further improved the quality of this manuscript and responded to all my comments. My concerns were addressed and several comments were solved by extensive analyses (e.g. #7). Although some opportunities for further investigations were left for future studies, I still believe that this work is very important for the community. The quality of this Ensete glaucum assembly appears very high. I would like to congratulate the authors on this excellent work and recommend its publication in GigaScience. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Background
  
  Reviewer name: Ning Jiang
  
  In this study, the authors described the generation of a high-quality reference genome of Ensete glaucum, which is one of the most cold-hardy species in the Musaceae. It is also well known for its drought tolerance. The authors compared the expansion and contraction of gene families and the composition of repeats among related species. The genome assembly, analysis, and annotation are certainly useful for comparative genomic studies as well as future breeding practice. Everything seems to make sense to me. Certainly, the results are descriptive, but this is more than sufficient for a data note. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Boas Pucker
  
  Wang et al. generated a chromosome-scale genome sequence assembly of Ensete glaucum based on ONT long reads. This is a valuable resource for comparison against various Musaceae species. This assembly will certainly help to identify genes underlying agronomic traits in Musaceae. Important data sets are already well integrated into the banana genome hub and available to the community. The authors harnessed this highly contiguous assembly for analyses of synteny against Musa acuminata and for the investigation of repeats/TEs. Overall, the quality of this work is high and the manuscript is well written. I am not sure why this submission is classified as a data note, because it could also pass as a research article. I noticed a few issues and provided some specific comments that might be helpful to further improve the quality of this work: 1) There are many numbers in the abstract. I would recommend to reduce this to the most important ones. For example, the BUSCO results could be removed. 2) There is only one short paragraph about existing genome sequences. I would recommend to extend this and to mention the banana genome hub as the central community resource. 3) Please indicate if the coverage estimations are based on the haploid or diploid genome size (Table 1). 4) Please provide additional details about the BUSCO results (C, S, D, F, M) in line114 and/or in Table 2. 5) I find the sentence in line 120/121 confusing when reading for the first time. This suggests to me that more sequence was anchored than present in the initial assembly. The sentence is correct, but it might be better to present the total assembly size first and to describe the anchored proportion in a separate sentence. 6) It would be helpful to clearly distinguish between the genome (DNA) and the genome sequence (the assembly). That would make it easier to understand the discussion of differences between both (e.g. collapsed repeats). 7) Genome size estimation is always tricky. I would recommend to run several tools and to provide the estimated range (findGSE, gce, MGSE, GenomeScope, ….). It is also important to run the k-mer-based approaches with different k-mer sizes. Apparently, GenomeScope was used for the heterozygosity analysis, but not for the genome size estimation. That is surprising. 8) Statistics about the pseudochromosomes in Table 2 could be removed. For example, it is not necessary to say that the L50 number of 9 chromosomes is 5. 9) Please explain the difference in BUSCO results between predicted genes and BUSCO run in genome mode. Which genes are missing in the annotation? Table S3 suggests that the automatic BUSCO annotation (genome mode) is superior to the annotation generated in this study (analyzed in transcriptome mode). 10) Some statements about the CENs and telomeres would be interesting. These could give a good impression of the assembly results. Estimating their copy numbers could help to explain the difference between assembly size and estimated genome size. 11) Are there any genetic markers that could be used to check the assembly accuracy? 12) In my opinion, the section "Gene distribution and whole-genome duplication analysis" could be removed. Genes are never equally distributed across a genome and repeats/TEs are usually clustered around the centromeres. Therefore, this part does not add any novel insights. The second paragraph comes to the conclusion that all Musaceae share the same WGDs. This seems obvious to me. Was there a different expectation? 13) Orthogroup identification could be complemented with a synteny analysis. A comparison to Musa acuminata (https://doi.org/10.1038/s42003-021-02559-3) could help to check the accuracy of the orthogroups. 14) The statement "Genes with Ka/Ks > 1 were under positive selection (Supplementary Table S6)." does not fit well to the rest of this paragraph. Given that there are >35k genes, some would show values >1 by chance. Some statistical test would be needed to find out which genes are actually under positive selection. What is the conclusion from the identification of such genes? Any enrichment of particular functions? 15) The statement about the sugar transporters is interesting. This would be a good chance to connect these comparative genomics results with the transcriptome analyses. 16) Transcription factor families are mentioned, but not discussed. It is not surprising that MYBs are the largest TF gene family. However, it would be interesting to know if there are any striking differences compared to M. acuminata (https://doi.org/10.1371/journal.pone.0239275). Some MYBs like the anthocyanin regulators respond to sugar treatments. Is there a connection to the large number of sugar transporters? Any duplications/deletions compared to M. acuminata? This could be another opportunity to better connect different aspects of this study. 17) It is interesting to read that head-to-head and tail-to-tail repeats appeared collapsed. Previous studies identified that these arrangements of repeats are associated with low local read quality (e.g. https://doi.org/10.1093/nar/gkaa206, https://doi.org/10.1186/s12864-021-07877-8). I would not expect that both strands of the DNA molecules are sequenced. The authors might want to check this and provide additional explanation. 18) I am surprised that TEs were the most abundant class of repeats. Could this be caused by treating at all the different TEs as one group? CENs should appear with a much higher copy number than individual TEs or TE families. 19) The centromeric patterns could be compared to the situation in Arabidopsis thaliana: https://www.science.org/doi/10.1126/science.abi7489. 20) Are SSR less frequent around the centromeres and on the NOR chromosome arm or is this just a lack of detection in these regions? 21) Why is AG/CT more abundant than other SSRs? This could be compared to other species. 22) References for the length of 45S rDNA length in other species are missing. 23) How many 45S rDNA copies can be inferred from the ONT reads. The coverage is way higher thus this estimation should be more reliable. 24) NOR chromosome arm is depleted of protein encoding genes, but there should be plenty of rRNA genes. Please specify this in the sentence. 25) The synteny section is lengthy. The statements in context of previous studies are good, but removing some purely descriptive parts might make it more interesting. The corresponding figures show everything and could stand on their own. 26) What is the value of genotyping-by-sequencing if not combined with GWAS? 27) Which ONT flow cell type? Which Guppy version? 28) It does not become clear how the Hi-C library was prepared (line 562). What is the improvement? Please explain this here. 29) Please add the detailed parameters of the assembly and polishing. 30) BWA reference is missing. Why was BWA not used for the mapping of the Hi-C reads? 31) The statement in line 592/593 suggests that Hi-C was used for validation. However, it was also used for correction in the previous step. Anyways, this result should be moved from the method to the result section. 32) Trinity assembly and PASA steps lack details. 33) Parameters of STAR mapping and gene prediction steps are missing. 34) There is some discrepancy concerning the Musa acuminata genome assembly versions. It seems that v2 is used in some cases and v4 in others. Please check this. 35) Please make the customized script available via github (line 732) if this is different from the one mentioned in line 737. 36) Are the TE results consistent if a different 2Gb subsets of the illumina data are analyzed? 37) How were the centromere positions determined? I think that I have missed that in the method section. It must be connected to the CEN repeats, but the precise approach could be explained in more detail. 38) The read data sets are not released thus I cannot check if all raw data sets were included. It would be particularly important to have the FAST5 files of the ONT data to study base modifications in the future. 39) The link to the banana genome hub appears to be broken in the data availability statement. The data sets on the genome hub look fine. 40) The terms "core" and "pseudo-core" in Fig. 3 are not frequently used in the literature. These genes seem to have different degrees of dispensability and might be conditionally dispensable (https://pubmed.ncbi.nlm.nih.gov/24548794/; https://doi.org/10.1186/s13007-021-00718-5). 41) There seems to be some variation in the genome size estimation. I would recommend to present the results of multiple k-mer sizes (e.g. 17-25). The distribution of the resulting values might help to estimate the true genome size. JellyFish (k=17): 563Mb findGSE (k=21): 589Mb GenomeScope (k=21): 489Mb (this is smaller than the actual assembly size) 42) The presented sugar transporters are not among the top enriched GO terms (S2). Therefore, I am afraid that this analysis is not very informative. Could it be that the "enriched" GOs are just a "random" set? 43) Why is E. glaucum not presented as S5C? A direct comparison would make more sense. 44) S10: I would recommend to identify the precise break points. Next, it would be good to validate the accuracy of the assembly by finding individual reads that actually support the situation in E. glaucum. This would help to exclude an assembly artifact as reason for the difference. 45) It might be better to use a three letter abbreviation of the species ("Egl" instead of "Eg") in the gene IDs to avoid ambiguities in future genome sequencing projects. 46) The method section states that short DNA fragments below 12kb were removed. S11 suggests that two libraries were sequences: one with depletion of the short fragments and one without it. Please check this. Generally, I would recommend to try a different gDNA extraction protocol and to use SRE instead of BluePippin. 47) The north of eg06 looks suspicious in the Hi-C analysis (S12). There is also no substantial synteny with any of the Musa chromosomes (S8). Could this be an indication that there are errors in the assembly? 48) Table S1: What is the point in showing that all contigs are larger than 1, 2, and 5kb? 49) 445 bHLHs in M. acuminata is almost twice the number of bHLHs detected in E. glaucum. Some other TF families also show this large difference, but orther families show almost equal numbers. It could be interesting to further investigate this. The HB-KNOX value of M. acuminata is missing. Minor comments: line 70/71: Some countries are named multiple times. Please change this. line 113: chromosomes > pseudochromosomes line273/274: Please check this sentence. line 428: Please rephrase "translated proteins" and SynVisio should only be named in the method section. line 436: "protein-coding genomes" ? line 464: "second (right)" … should be replaced by north/south or q/p nomenclature. This also affects some following sentences. line 625: "Musa acuminata" is a species name line 639: blast > BLAST line 731: of of > of line 811: RNA-sequencing > RNA-seq (I have not seen a section about RNA sequencing) S10: "E glaucum" > "E. glaucum" Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.11.23.469474v2
www.biorxiv.org www.biorxiv.org

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

4
1. GigaScience 15 Feb 2023
  
  in GigaScience
  
  ML
  
  Reviewer name: Gael Varoquaux (revision 1)
  
  I would like to thank the authors for the work done on their manuscript, in particular adding the experiments that enable linking to sparse-recovery theory. In my opinion, the manuscript brings a lot of value to the application community and is pretty much complete. A few details come to my mind that could help its message be most accurate. Because of my suggestions, the authors have used an l1 penalty in the SVC. This worked well in terms of prediction. However, it is not the default. I think that the authors should stress this and be precise on the peanlity each time they mention the SVC. In addition, I think that there would be value in performing an additional experiment with an l2 penality (which is the default) to stress the importance of the l1 penalty. The message should stress that the penality (l1 vs l2) is importance, but less the loss (log reg vs SVC). As a minor detail, I would invert the color scale of one of the plot plots on figure S12, S13, to stress the parallel between the two. Finally, I think that it is important to stress in the conclusion that all the results build on the fact that the predictive information is sparse (maybe putting this with words more familiar to the application community). Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
2. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Results
  
  Reviewer name: Filippo Castiglione
  
  The article "Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification by Kanduri1 et al. describes the construction of suitable reference benchmarks data-sets to guide new AIRR ML classification methods. The article is interesting and potentially useful in defining benchmark data sets and criteria for constructing specialized AIRR benchmark datasets for the community of researcher interested in AIRR. The authors following previous indications about model reproducibility and availability also provide a docker container which include all data and procedures to reproduce the study. The article is sufficiently well written although at time a bit full of details which perhaps could be synthesised further (this has already been done in pictures and tables). I don't have major concerns. Only a couple of notes. Would be good to have a figure or diagram showing an example of bags containing receptors and associated witnesses. It could illuminate the reader not familiar with Multiple instanvd learning. Would be good to have line commands for the generation of data sets (in the case, for instance, of use of Olga). I understand these are inside the docker container but the reader that is not interested in the whole container might find useful to have access to pieces of the pipeline so to use this or that tool (being it in immuneML, in Olga, etc.). Curiosity: why have the authors used Olga and not the mate Igor? Why is the performance metric in model training the accuracy and not, for instance, the F1-score? Any particular reason? Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
3. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Background
  
  Reviewer name: Enkelejda Miho
  
  General opinion: approved with minor changes Comments: The manuscript profiles machine learning methods for AIRR T-cell receptor dataset immune state label prediction to establish the baseline performance of such methods across a diverse set of challenges. Simulated datasets with variable properties are used to provide a large amount of benchmarking datasets with known immune state signals while reflecting the natural complexity of experimental datasets. Their results provide insights on the current limits posed by basic dataset properties to baseline ML models and establish a frontier of improvement of AIRR ML research. The manuscript is understandable and well structured in the approach to comparisons as well as solid conclusions. The graphics are clear and consistent and support the manuscript. Very interesting insight into the importance of single individual variable parameters such as sample size or witness rate on the general accuracy. The advantage of the results to the scientific community is that it offers an evaluation of classical ML methods, provides large and specialized AIRR benchmark datasets, and allows further development and benchmarking of more sophisticated ML methods. The manuscript is overall well-written and we endorse it with minor changes: In paragraph Impact of noise on classification performance (page 14) the sentence "but enriched above a baseline in positive class examples" should be corrected with "but being enriched above a baseline in positive class examples" In paragraph Machine learning models (methods section, page 21) "lasso" should be corrected with "Lasso". In paragraph Machine learning models (methods section, page 21) " '- ' " should be corrected with "'-'" and "ð•‘‹jdenotingÂ» with "ð•‘‹j denotingÂ». In the discussion the sentence "which aligns with the observations that that the majority of the possible contacts between TCR and peptide" should be corrected with "which aligns with the observations that the majority of the possible contacts between TCR and peptide" Keep comparisons like size>500 and size > 500 concise Check for missing whitespace as in the description of the figure 1(b): …(5 x 105 % of sequence.. Same in cases like â‰ˆ90% | â‰ˆ 90 % or n=60 | n = 60 Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. Enkelejda Miho owns shares is aiNET GmbH. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
4. GigaScience 15 Feb 2023
  
  in GigaScience
  
  Abstract
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Gael Varoquaux The manuscript by Kanduri et al benchmarks baseline machine-learning method on simulated sequencing data of adaptive immune receptors to predict immune states of individuals by detecting antigen-specific signatures. Given that there is a volume of publication using a wide variety of different machine learning techniques with the promise of clinical diagnostics on such data, the goal of the study is to set baseline expectations. From an application standpoint, I believe that the study motivated and useful to the communitee. From a signal processing standpoint, many aspects of the study are trivial consequences of the simulation choices: sparse estimators are good for prediction when the signal is generated from sparse coefficients. Though I do not know well this application community, it seems to me that the manuscript is valuable because it casts this knowledge in a specific application setting, however it should discuss a bit more the fundamental statistical reasons that underly the empirical findings. I give below some major and minor comment to help make the study more solid. 1. Plausibility of the simulations The validity of the findings relies crucial on the simulations, in particular the hypotheses of extreme sparsity. These hypotheses need to be discussed more in details, with references to back them. The amount of sparsity as detailed in table 1, is huge, which strongly favors sparse models. 2. Another baseline, natural given the sparsity I do realize that the goal of this study is not do an exaustive comparison of all machine learning methods --an impossible task--, however for someone knowledgeable about sparse signal processing, In particular, the study begs the question of whether univariate tests on appropriate k-mer can be enough, an alley suggested by the authors on page 7. This option should be studied empirical, as it would provide important practical methods. 3. Link to sparse model theory A vast variety of theoretical results state that a sparse model will be successful for n proportional to s log(p) where n here would be the number of samples in the minority class, s would be the number of non-zero coefficients. A good summary of these results can be found in the book "Statistical learning with sparsity: the lasso and generalizations T Hastie, R Tibshirani, M Wainwright - 2019" It would be interesting to see how these theoretical scaling match results, for instance those on figure 3. 4. Accuracy and class imbalance It seems to me that in parts of the manuscript (fig 4.a for instance) accuracy is compared across different scenarios with varying class imbalance. However, accuracy is not comparable when class imbalance varies: for instance with 90% positive class, a classifier that always choose the positive label will have .9 accuracy. In this light, I don't understand fig 4.a, in which even for large class imbalance accuracy goes to .5. In addition, the typical good practice is to use a metric for which decision under chance are not affected by class imbalance, such as area under the curve of the ROC curve. 5. Comparison with SVC The manuscript mentions that a Support Vector Classifier is also benchmarked, however it does not give details on which specific SVC is used. A crucial point is the kernel used: with a linear kernel, the SVC is a linear model, while with another kernel (RBF kernel, for instance), the SVC is a much more complex model and is not expected to behave well in large p, small n problems. Also, I suspect that the SVC is used with the l2 regularization. A linear SVC with l1 regularization would likely have similar performance as the l1-penalized logistic regression, as it is a model of the same nature. These details should be added; ideally, if the model benchmarked is not a linear SVC, a linear SVC should be benchmarked, to give a baseline (though the default l2 regularization can be used, to stick to common practices). 6. Wording in the conclusion The conclusion starts with "To help the scientific community in avoiding futile efforts of developing...". The word futile is too strong and the phrasing will not encourage healthy scientific discussion. I try to sign my reviews as much as possible. GaÃ«l Varoquaux Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.05.23.445346v2
www.biorxiv.org www.biorxiv.org

Data Note: Monash DaCRA fPET-fMRI: A DAtaset for Comparison of Radiotracer Administration for high temporal resolution functional FDG-PET

3
1. GigaScience 14 Feb 2023
  
  in GigaScience
  
  Functional
  
  Reviewer 3: Chris Armit
  
  This Data Note describes an Open CC0 neuroimaging dataset of 15 subjects (young adults) who underwent simultaneous BOLD-fMRI and FDG-fPET imaging. FDG-fPET ([18]-fluorodeoxyglucose positron emission tomography) measures glucose uptake in the human brain, whereas BOLD-fMRI (blood oxygenation level dependent functional magnetic resonance imaging) captures the cerebrovascular haemodynamic response. FDG-PET data was acquired using three different radiotracer administration protocols - bolus, constant infusion, and 50% bolus + 50% infusion - and each administration protocol was applied to 5 subjects. BOLD-fMRI and FDG-PET was acquired while participants viewed a checkerboard stimulation, which was used to trigger dynamic changes in brain glucose metabolism.
  
  This neuroimaging dataset allows researchers to explore the complexity of energetic dynamics in the brain using multimodal imaging data analysis. In addition, this neuroimaging dataset includes structural MRI data for each of the subject, including T1 and T2 FLAIR, enabling neuroanatomical correlations to be explored. The neuroimaging data are available from OpenNeuro [http://doi.org/10.18112/openneuro.ds003397.v1.1.1] and the authors are to be commended for ascribing a CC0 Public Domain Dedication to this dataset. Importantly, the authors highlight that consent was obtained from participants to release de-identified data. I downloaded a small number of image files from this dataset and I confirm that the de-identified NIfTI (Neuroimaging Informatics Technology Initiative) format files can be opened using Fiji / ImageJ.
  
  This neuroimaging dataset has immense reuse potential and I recommend this Data Note for publication in GigaScience.
2. GigaScience 14 Feb 2023
  
  in GigaScience
  
  Background
  
  Reviewer 2: Nicolas Costes
  
  Jadamar et al present a database of limited size, but of a rarity which amply justifies its interest. This is a combined dynamic FDG PET (fTEP) and fMRI study performed in three groups of 5 subjects for whom 3 different modes of FDG administration were used: bolus, infusion and bolus + infusion. The statistical analysis resulting from this study is also of limited scope due to the low residual degree of freedom of the design, but nevertheless makes it possible to confirm the expected characteristics of the shape of PET kinetics; It confirms the superiority of the bolus + infusion protocol ensuring maximum sensitivity to highlighting the neural circuits involved in the visual flickering task performed during acquisition. The interest of the study lies in the free provision of the whole data that can be used, as it is argued, as a demonstrator for the development of methods for correcting, processing and analyzing data. A multivariate analysis combing PET and fMRI taking advantage of the simultaneous recording is not accired out: a simple GLM voxel-to-voxel analysis makes it possible to expose notable differences between the 3 methods of administration of FDG. However, the provision of data opens the field for future exploitation. The fact that raw data before PET reconstruction is provided is relatively new and opens up the possibility of extending the field of their exploitation to methods of correction and reconstruction. Respecting the BIDS description format as much as possible is also a plus. These data are of undeniable interest to the community and therefore the description of their content and the exhaustive provision of all the demographic and physical parameters of their realization deserve their publication. Some following remarks should be considered before publication. p7. [18F]-FDG 18 should be in upper script p9: raw PE data are in the original format exported from the siemens console: is there a distinction between list-mode file exceeding 4 Gb, as it is the case on the Siemens console? In which format the raw data will be provided? Results: Figure 2: A. Please specify if plasma curves are corrected for 18F radioactivity decay at the time of injection. Figure 3. Why was the correction applied for Zcorr? FWE? FDR? Figure 4. How exactly Â« percent final change Â» is computed: is it an average of the active periods compared to rest period? Is it computed from the beta regressor or directly on signal change? In the later case, on which interval? Figure 5. A well the average accros all protocols is provided in Fig3.D to serve as a reference, could you also provide the average accros References Please review references: check for incomplete references (2., 8., 21. for example), uniformity of format and provide DOI as it is already done for the majority of your them.
3. GigaScience 14 Feb 2023
  
  in GigaScience
  
  Abstract
  
  Reviewer 1: Antoine Verger
  
  Review on "Data Note: Monash DaCRA fPET-fMRI: A DAtaset for Comparison of Radiotracer Administration for high temporal resolution functional FDG-PET" This article is an important contribution in its field. This study is an open access dataset, Monash DaCRA fPET-fMRI, which contrasts three radiotracer administration protocols for FDG-fPET: bolus, constant infusion and hybrid bolus/infusion. The Monash DaCRA fPET-fMRI dataset is the only publicly available dataset that allows comparison of radiotracer administration protocols for fPET-fMRI. Even if the provided dataset is useful for the scientific community, the validation part needs some explanations.
  
  Comments: - Shame that this dataset is not available also for rest fPET-fMRI images. Indeed, most of the studies are also performed at rest (connectivity of neurodegenerative disorders for example) and should need some controls. Please discuss the opportunity to provide such databases. - Was the administered FDG dose unique for all patients or adapted to the body weight? Please detail. - The authors should discuss the gender variability across the 3 groups. Metabolism and radiotracer uptake is dependent of gender. The authors should at least include this covariate in their group analyses. - Of course, raw data are available. I have nonetheless one question: what is the interest of using PSF and after a Gaussian filter in reconstructed images? Why using PSF in dynamic PET (noisy) images? Please, can the authors justify the 16sec of frames for reconstruction of their images? Was it justified by any optimization? - The authors further applied a filter of FWHM 12 mm after having previously reconstructed their images with a Gaussian filter? They should choose one of these two filters. If not, smoothing of PET images is too important. - For the validation set at the group level, is the PET intensity normalization based on proportional scaling? It is particularly important to understand how the authors have obtained the grey matter mean signal. - How was the grey matter mean signal obtained? From a grey matter MRI mask? - Could the authors develop the way to have access to open access reconstructions algorithms? Particularly if images have been obtained with Biograph Siemens. They mention STIR and SIRF: please develop: is it able for anyone who has no access to a Siemens reconstruction algorithm? Is a specific PSF reconstruction for Siemens is implemented? - "there has not yet is not yet agreement in the best way to manage" : please rephrase. - Figure 1: Please include the conventional MRI sequences at the beginning of the acquisition. - Figure 2: Please provide units for signal intensity? It would be also more comfortable to provide elements to distinguish the tasks from the rest periods. - Figure 2: is the grey matter signal obtained for all the grey matter or only for the occipital cortex? Should the authors discuss the higher variability observed between patients for methods with bolus? Is it linked to the different sex ratio between the protocols? Discuss Why one patient in the infusion protocol has a truncated time-activity curve? - Figure 3: the authors should explain the variability of fMRI patterns in GLM albeit the same protocol was performed. Is there an influence of the coupled glycolytic metabolism? - Figure 3: how the authors explain the absence of correlation with task in the infusion protocol? (this was not observed in the 3 phases of the protocols for infusion in Figure 5). - Figure 4: define how the increase in signal percentage was calculated? How was the grey-matter normalized at the group level? Proportional scaling can be source of false positive abnormalities. - Figure 5: Can the authors display the changes in connectivity of the occipital area between the 3 phases for each protocol? (by adding a supplemental part at the bottom of the Figure).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.08.02.454708v1
Jan 2023
www.biorxiv.org www.biorxiv.org

Nanopore-Based Enrichment of Antimicrobial Resistance Genes – A Case-Based Study

1
1. GigaScience 25 Jan 2023
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.75) and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Ned Peel
  
  Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?
  
  Scripts have been made publicly available on GitHub (https://www.github.com/phiweger/adaptive) under an OSI-approved BSD-3-Clause license.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No.
  
  Is the code executable?
  
  Unable to test
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Not applicable.
  
  Additional Comments: Sent authors accompanying file with comments
  
  https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT0zNjkmZmlsZT0xMzcmdHlwZT1nZW5lcmljJnZpZXc9ZmFsc2U~
  
  Reviewer 2. Julian Sommer
  
  Is the code executable?
  
  The code used for analysis of the data has been published on the corresponding github page. Although, a link on this page for downloading data from a public database does not work at the time of testing. (Resource deleted). Also, most parts of the code are executable, the generated data and figures resulting from the code does not reproduce the figures from the publication.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. The code placed in the github repository can be executed mostly, but require basic knowledge of coding in the used programming languages. However, for the data presented in this work, I do not see the need for more detailed instructions.
  
  Is the documentation provided clear and user friendly?
  
  Only partly
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Only partly. However, I do not see the need for further instructions.
  
  Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?
  
  The data is available from the stated accession numbers, but an additional data link on the github page does not work and might be necessary to test the complete code.
  
  Additional Comments:
  
  The study compared three methods of oxford nanopore-based longread sequencing for detection of antibiotic resistant bacterial pathogenes. Therefore, the authors used cultivation based detection of carbapenem-resistant bacteria from a rectal swap and subsequent singe isolate sequencing. This technique was compared to an adaptive sequencing approach using a database of antibiotic resistance genes for adaptive sequence enrichment during the sequencing run facilitation oxford nanopore sequencing. The underlying technology is a unique approach, made possible by the oxford nanopore real-time sequencing technology and is of great interest for future applications in clinical microbiology diagnostics. Therefore, this study is of great importance for this field in general. As additional method, the authors performed metagenome sequencing of the rectal swap without culture, which is a completely different technique with unique advantages and drawbacks, compared to culture-based sequencing methods. This study is important for the development of real time sequencing and adaptive sequencing for the detection of antibiotic resistance genes and in future potentially other genes. It focusses on the adaptive sequencing approach, analysing in detail the factors influencing the performance of this new approach. The number of experiments is limited, as stated by the authors, but the data is nevertheless valuable for future projects. For further improvement, I have some suggestions for the manuscript. 1. The comparison of the three methods is quite complex and one of the main goals of this paper, illustrating, that low-cost sequencing devices (Flongle) can be used for detection of antibiotic resistance genes applying adaptive sequencing. Therefore, the description of this comparison and figure 1C is essential for understanding the data of this comparison of methods. However, figure 1C is hard to read and the represented data is not easily accessible. To clarify, I suggest including additional information. Does the “Set size” and “Intersection Size” describe absolute number of detected antibiotic resistance genes? This information could be included. To achieve additional connection from the legend of figure 1C, the absolute numbers of detected genes could be included to the text, supplementing the already stated relative detection numbers (lines 51-54, 137-142). Since this figure part is essential for the understanding, a larger version of this representation would be nice. 2. Figure 2 is essential for interpretation of the presented data on variables influencing the adaptive sequencing performance. a. Figure 2A is not easily accessible, in fact I am not sure, what information about the data is represented in this part of the figure (data throughput?). The figure legend does not explain, what is shown. I suggest clarification or, if applicable, deletion of this subfigure, for increased readability of figure 1B-D. b. Figure 2D: The meaning of the “log median read length” is not explained in the text or the figure legend and should be clarified. c. Figure 2E: Same as for Figure 2D. In line 119, the absolute read length (3 kb) is stated, but this number is not visualised in this figure. I suggest adding additional information to the text, to make the representation of the data in the figure easily discoverable. 3. Discussion: In my opinion, the discussion part has some potential for improvement. a. Line 158 – 162: The authors argue that selective cultivation and subsequent adaptive sequencing for antibiotic resistance genes leads to rapid results, important for public health responses. Metagenomic sequencing on the other hand needs at least the equal time and is not cost effective. However, might the combination of metagenomics sequencing without culture and adaptive sequencing decrease the turnaround time even more without significantly higher costs? Although, experiments on this are not in the scope of this study, the authors could discuss this for future applications. b. Line: 165: “[…] reads were detected for all resistance genes known to be present […] This result does not match the results stated in line 141 “57.9 % of the resistance genes found” and line 184 “nearly two-thirds of all resistance genes”. This should be clarified or the corresponding data should be referenced in the discussion for readability. c. Line 169: Since the identity of sequencing results and hit to the database is important for detection and overall performance of the adaptive sequencing approach, I suggest discussing, if future improvement of sequencing accuracy (basecalling algorithm, pore design) might influence the performance of this approach, as only shortly mentioned in line 190. d. Line: 190 “variable sequencing yield of this new flow cell type”: This aspect is solely introduced in the conclusion and should be mentioned and discussed beforehand.
  
  Minor comments: 1. Figure 1 description: “[…] carrying nine plasmids and four carbapenemases genes […]”. In line 12, the Raoultella isolate is described carrying three carbapenemases. The OXA-1 beta-lactamase pictured in figure 1A is not a carbapenemase. The correct number should be three carbapenemases. 2. Line 67: Flongle flowcells were introduced in 2019. I suggest to delete “recently introduced”. 3. Line 210: The link is not correct. 4. Line 244: “Community standards”: It would be nice to add an additional reference. 5. Line 255. Reference is missing. 6. Line 283: This step
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.08.29.458107v4
Nov 2022
www.biorxiv.org www.biorxiv.org

Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves high resolution for research and breeding

3
1. GigaScience 27 Nov 2022
  
  in GigaScience
  
  sequencing
  
  Reviewer 3. Murukarthick Jayakodi
  
  Aury et al have assembled the French bread wheat cv. Renan using Oxford Nanopore long read technology, optical map and Hi-C. They achieved a decent N50 of 2.2 Mb and constructed pseudomolecules with reference-guided approach. The assembly was corrected with Hi-C map. They annotated ~ 84% of repeats and projected gene models from previously assembled Chinese Spring reference genome. The assembly quality was validated with standard approach. The Renan assembly showed good collinearity with existing short-read wheat assemblies and pinpointed some large (1 > Mb) inversions. There is a potential to catalogue structural variants i.e. large INDELs. However, many false-positives are expected when long and short read assemblies are compared. Nevertheless, they compared a complex tandem repeat region. They used appropriate tools for assembly and downstream analysis. This is an improved additional genome resource for wheat community.
2. GigaScience 27 Nov 2022
  
  in GigaScience
  
  The
  
  Reviewer 2. Gabriel Keeble-Gagnere
  
  The authors report on a new assembly of a French wheat variety, Renan, using Oxford Nanopore sequencing technology combined with short read polishing, Bionano optical maps and Hi-C to validate chromosome-level ordering after anchoring to IWGSC RefSeq v2.1. This is the first study I know of to use Oxford Nanopore to assemble a complete wheat genome, and the results demonstrate that this technology (together with short read polishing, Bionano, Hi-C, etc) can be successfully applied to such a complex genome. Evidence is presented to support the quality of the assembly, but it is mostly at the global statistics level (eg: contig N50, total size of gaps) or macro-scale (whole chromosome dotplots). One detailed comparison between Renan and Chinese Spring of a biologically important region is presented. The assembly is clearly of a high standard and is a valuable addition to the growing set of wheat varieties assembled to chromosome-scale. However, given the high quality of the IWGSC RefSeq v2.1 assembly (Zhu et al. (2021)), the claim that this assembly "achieves higher resolution for research and breeding" is quite strong and needs to be supported by more evidence. Given what is presented here, a more accurate statement might be "achieves higher contiguity and local completeness". The high contig N50 of 2.2Mb is highlighted but I feel that more work is needed to demonstrate that the sequence is free of artefacts. The authors show in Figure 2 that this assembly has the lowest (though only slightly) complete BUSCO score out of the wheat genomes they compare with. Is it possible that some regions cause problems for the Oxford Nanopore technology and are either fragmented or completely absent from the assembly? Bionano maps were used but no evidence is presented to show the level of agreement with the assembled sequence and Bionano maps, as is done in Zhu et al. (2021).
  
  In summary I think there are two key things to address: 1) More evidence supporting that the assembly is locally accurate, especially validation with alignment to Bionano maps; 2) Some results presented to relate this assembly to the existing chromosome-scale assemblies of wheat genomes.
  
  To address these points, I think the following would greatly enhance the paper:
  
  a) Using any method (eg: the method in Brinton et al. (2020)), identify identical-by-state haplotypes between Renan and Chinese Spring and the chromosome-scale assemblies from Walkowiak et al. (2020). This analysis would essentially produce a table which would be valuable supplementary data. A figure similar to Figure 3 (b) from Walkowiak et al. (2020) for a single chromosome, showing the regions of the existing wheat genomes sharing haplotypes with Renan would help place this genome into context.
  
  b) This then defines large regions of the Renan assembly that can be directly compared at the base level to other assemblies. Select 2 or 3 examples to show how the Renan sequence compares to the equivalent region in other assemblies, and show the Bionano validation of Renan sequence together with presence of genes and gaps in each assembly being compared. Since the sequences being compared here should be the same (based on the previous step above), the genes from the Renan annotation can be mapped across and directly compared. This would provide direct evidence for the higher quality assembly being claimed. Figure 5 is a good comparison of a biologically important region, but it is unclear if the region in Chinese Spring and Renan is the same haplotype or not. This needs to be clarified at the start of this section. If the same, then the comparison is of two regions expected to be basically identical (and could be one of the examples used in the proposed comparison analysis above); if different, then that needs to frame the discussion since the region in Chinese Spring could theoretically contain different genes or more repeats, for example.
  
  Centromeres are not mentioned, though it is known to be a particularly difficult region in wheat genome assemblies. How do the centromeres look in this assembly and how do they compare to previous wheat assemblies? Do the Bionano maps validate the assembly in the centromere region? The analysis in point a) above would identify centromeres in common with other assemblies. Likewise, the distal ends of chromosome arms, including the telomere sequences, are known to cause problems for Hi-C ordering and orientation. Again, the Bionano alignments demonstrating correct ordering would be particularly valuable.
  
  Figure 2 should be a supplementary figure.
3. GigaScience 27 Nov 2022
  
  in GigaScience
  
  Abstract
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac034 and has published the reviews under the same license. These reviews were as follows.
  
  Reviewer 1. Sean Walkowiak
  
  First review: Comment 1: The authors could more clearly and accurately present and discuss sequencing and assembly approaches, including the advantages and limitations of the ONT assembly presented here
  
  While the standards of 'quality' for assemblies are evolving, there are standard sets of 'science-based' criteria for considering the quality of a genome, such as the 14 criteria listed in the manuscript here: https://www.nature.com/articles/s41586-021-03451-0#Tab1. Many of these criteria are ambitious, particularly for wheat due to its size and complexity, and many criteria are not met using previous assembly approaches, or the approaches used in this study. It is true that CS and 10+ Wheat Genomes do not use long reads; however, these assemblies are valuable and have been rigorously validated using 10X Genomics, Hi-C, and long read data. They also perform well for TE content, BUSCO (as outlined by Tables 1 and 2 and Fig 3 in this manuscript), and they were actually used in this MS as a reference for guiding the ONT assembly. I would also expect that they have a better base pair accuracy than the assembly presented here. I therefore suggest that the authors revise their statement "these assemblies have been produced using short-read technologies and are therefore not up to the quality standard of current genome assemblies". If the authors wish to discuss assembly quality, which I recommend they should, I suggest focusing on advantages and limitations of each technology and assembly approach in a constructive way, perhaps with a stronger focus on the value of the ONT resource developed here. In regards to base pair accuracy, ONT is at a disadvantage to short reads or to PacBio. This is particularly true in the context of HiFi reads, which have increased accuracy over ONT and Illumina and have greater lengths than Illumina, but PacBio and HiFi are not discussed. This is not to say that PacBio is superior in every way, the reads from ONT are longer and these hold a significant value. As an example of differences between PacBio and ONT that might provide useful context to describe the differences between ONT and PacBio approaches, please see: https://pubmed.ncbi.nlm.nih.gov/33319909/, for differences between short read (TriTex) and PacBio, please see https://www.nature.com/articles/s41586-020-2947-8 . All of these approaches are valuable but have both advantages and limitations, with ONT also having many clear advantages and disadvantages. But these need to be clearly communicated and supported, either through the results of this study or through the literature. For example, in the discussion, the authors state that "ONT devices HAVE a real advantage over other long-read technologies". There is only one other long read sequencing technology, so are if you saying that ONT HAS a 'real advantage' over PacBio based on read length, this is valid, but can be stated more explicitly and with examples of the read lengths from this study and the literature. It is then stated that the "error rate is drastically reduced for nanopore", again this valuable and a great advancement in regards to ONT, but it would be wise not to dismiss that this error rate is still higher than PacBio HiFi, which again can be stated explicitly with support from the literature. While both of these concepts are important, after they are stated, they are not actually discussed or framed to highlight the work from this study. The true advantage of ONT, even over PacBio HiFi, is that the long reads can resolve more complex regions that span greater distances, which are abundant in wheat (see reference from above). The authors are presenting an exciting and valuable resource with this genome assembly and this assembly has advantages due to the application of ONT, for the reasons mentioned above regarding long complex regions, but these are not fully highlighted and the authors do not take full advantage of what this assembly has to offer. I think the authors should provide additional context and support related to the value and drawbacks of their ONT assembly. The advantages are discussed superficially at the gene level through a couple of examples (Fig 5), though none of these examples are supported with any significant biological data or validation analysis. There are many interesting features of genomes that are captured by ONT that are not captured well by short reads or PacBio, and it is unfortunate that these are not explored in any significant depth in the manuscript.
  
  Comment 2: Some of the 'highlighted features' in the manuscript could be better selected/executed
  
  This comment relates to the previous comment on having little detail on what the ONT genome is uniquely capable of providing over other approaches. Instead, the authors focus on some anomalies in the D genome as well as differences in the nanopore software for base calling. It is unclear to me what the objective is of the report on the D genome. I suspect that this may be due to differences in repeat content between D and the other subgenomes, or an artifact of the tools and analyses used. Page 6, Figures S1 and S2, may be a consequence of poor read filtering for reads that align ambiguously - i,e perhaps reads from A and B may crossmap at a greater likelihood than those from D due to differences/similarities in repeat content between subgenomes. Once reads are aligned, the alignments should be properly filtered using standard 'best practices for NGS'- I do not see that any filtering or analysis of cross mapping was performed, but I may have missed it. Once the alignments are filtered, read coverage dips and peaks can then be assessed statistically using tools such as CNVnator and cn.mops, which are designed specifically for comparative read depth analysis since depth may not be normally distributed, rather than arbitrarily looking at 2 times the median. There may be differences between genes and intergenic regions in terms of mapping accuracy, so it may be ideal to interrogate read depth for those separately. The increased gaps is also interesting and I wonder if this could be due to the read accuracy of ONT and read mapping and assembly biases when having similar subgenomes.
  
  Nevertheless, the results and discussion on the D genome are interesting but distracting and likely reflect that the authors should take more time to explore their data and its biases before presenting this information. In summary, I believe that additional work is needed to bring value to the read depth and D genome analysis should the authors choose to include this in the manuscript. While I agree that it would be useful to communicate that a significant gain was observed when basecalling using the more accurate basecaller, the emphasis on this is disproportionate to its value in the manuscript. The observation of a better assembly when using reads from a more advanced basecaller is not something new. As for the error rate of the ONT between organisms (yeast and wheat), with a sample size of 2, I do not think that this is worth presenting or discussing in any detail. While this may just be an artifact of the DNA quality itself from two experiments, I suspect that this may be a valid result from the manuscript and due to sequencing repeats, which are more abundant in wheat, in combination with how these basecallers self train to be more accurate. While this is certainly valid, it is not novel or interesting. This result comparing species was not tested with sufficient scientific rigor/evidence, it distracts from the central result of the manuscript, and just reaffirms something that we already known about the basecalling software and challenges of sequencing homopolymers and the importance of getting accurate reads using the more advanced basecalling methods.
  
  Comment 3: Why Renan? This comment relates to the other two comments on the selected areas of focus. The biological story, which was on gliadins, was of some value and highlighted some of the advantages of an ONT assembly, but this was not supported by any significant biological data. Renan is a well-known cultivar with abundant genomic data, mapping populations, trait data for diseases, etc. It is unfortunate that the authors could not use the genome to dig deeper to more thoroughly demonstrate the value of this assembly specifically in the context of ONT and genomics of wheat or the biology of wheat and Renan, specifically. With abundant QTL data available specifically for Renan, these could have easily been anchored to the assembly to highlight novel transcripts from the RNAseq from this study, just as an example. Even the comparisons of the Renan assembly to other available assemblies was mostly superficial and did not highlight in significant detail the value of having an ONT assembly or the value of having data specifically for Renan. While a detailed 'biological story' may be beyond the scope of this manuscript, there was minimal effort to highlight the value of the assembly, and this comment is more of a larger reflection that more could have been done to highlight the value of the genome to support the author's vague claims that the genome "will benefit the wheat community and help breeding programs".
  
  Minor Comments The absence of numbered lines made it difficult to provide more detailed feedback, but there are minor items throughout, so I suggest numbering the lines and also giving the manuscript a thorough review. I appreciate that the authors present and suggest methods for future assembly of complex genomes using ONT, but unlike the abstract states 'we also provide the methodological standards to generate high-quality assemblies of complex genomes'. I would argue that the standards used for ONT assembly are known and are not established here. I would also suggest caution when stating that the methods here should be considered the 'standard' for the reasons indicated in Comment 1 regarding other approaches used to assemble complex genomes, such as PacBio/HiFi, and the lack of a proper investigation/discussion/comparison of assembly quality.
  
  Page 2: last line - what is the abbreviation ca. ? Table 1: Busco is presented twice with different values. Table 1 and 2 use different versions of RefSeq, I would stick to one version. It is unclear to me what trend or result is that the authors are trying to present in figure 1, which I would say is common for circos plots. Presenting data 'for the sake of presenting it' is not terribly valuable and I would encourage the authors to use the figures to present a trend or result that is impactful. In addition, the data that is presented is not presented clearly, and is cryptic. The roman numerals in the figure caption for Figure 1 are not actually in the figure. The caption also indicate that the dots indicate lower and higher values, but not of what - perhaps density of gaps? The color scales are not presented for each track. Two of the color scale pallets also look similar.
  
  Page 6: 62% of exons were identical, which means 48% had SNPs, so the authors argue that SNPs are therefore rare at 48% of exons? I do not think that 48% of exons having SNPs is rare, I think it that this would mean that nearly half of exons have SNPs, so this is therefore common. Perhaps this statistic is misleading and the focus should instead be on the 0.7% divergence. How does this value compare with other within species comparisons of gene content and could this be an artifact of ONT accuracy? This question relates to a general comment that the authors could do better at bringing relevant comparisons or parallels in from the literature throughout the manuscript to bring value to any findings or insights they are presenting. Particularly in the context of other ONT assemblies.
  
  Page 7, capitalize the T for technology, it is part of the name of the company and is a proper noun. This is repeated elsewhere.
  
  Page 7: 'on wheat'? this statement could be written more clearly The way that the text is worded, it sounds like the basis for selecting the SmartDenovo assembly was the number of unknown bases, when I suspect it was actually a multitude of factors (BUSCO, gene or TE content, assembly stats, etc). While I do not question the selection of the assembly, I do suggest a clearer presentation of the information. I appreciate that the authors presented the data from multiple assemblers, one of the concerns with ONT is that the read accuracy is low and this may lead to issues in assembly of complex polyploids with similar subgenomes. I suspect that based on this study, it is clear that this is a valid concern for some assemblers, but may have been overcome in others. Though none of this is explored or discussed. Again, is there any information in the literature contrasting assemblers that could provide insights into what you observed?
  
  Searches at 90% identify and coverage for genes and TEs is not strict, especially with genomes that have highly identical subgenomes. If you reduce your thresholds enough, all features will map to your genome.....
  
  The choice of language is often objective or not representative of the results. For example, the 'extremely' similar TE content between Renan and CS. Why not say it is similar and actually report a value or a % difference. This would be more concise and informative than using vague and overzealous language. Page 8, short reads (dash or no dash?) The font sizes in Figure 2 are very small.
  
  The RNAseq is not really presented at all, except in the Materials and Methods. I thought the genes were ab initio predicted until I saw RNAseq in the materials and methods. I suggest at least making a note of RNAseq into the results and/or discussion since this additional effort does bring added value to the annotations and the manuscript. The discussion says de novo annotations, but I suggest explicitly stating that RNAseq was performed.
  
  Figure 3 C and D do not have horizontal axis labels, the top should be labelled as subgenome, bottom as chromosome, and the vertical axis (not the top) should be labelled as number of gaps and chromosome length. Same comment for labelling of vertical axis for panels A and B, horizontal axis should be labelled as genome assemblies, which are reflected in the pallet/legend. Note that many of the colours in this pallet are similar and difficult to differentiate, it may actually take less space to label the bars with each wheat line to make it less cryptic.
  
  How were the dotplots in figure 4 generated? Perhaps I missed it in the materials and methods. Also one of the axis have labels or units, etc.
  
  Much of the text in Figure 5 is too small and illegible.
  
  Page 10: The discussion is superficial and vague and should provide an accurate and pragmatic discussion of the results in the context of the literature. For example, the manuscript boasts a 'higher resolution'... but of what? Perhaps 'complex repetitive regions'? To reiterate my previous comment on the lack of literature support throughout the manuscript - Were these 'higher resolutions' of <complex repetitive regions> comparable to what was observed in the literature when ONT was applied to other systems? Again, these advantages of ONT and the assembly could be more thoroughly
  
  Re-review:
  
  The revised manuscript addresses the major concerns/comments. The assembly and its report are an exciting new resource for the wheat community. I only have one very minor comment below:
  
  When writing variety names in text and figures, it is important to be exact because there are many varieties with similar names internationally. CDC Stanley, not "Stanley"; CDC Landmark, not "Landmark"; "LongReach Lancer", not "Lancer", not "LongRead Lancer" - typo on line 308. I suggest performing a thorough check throughout.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.08.24.457458v2
www.biorxiv.org www.biorxiv.org

The first complete mitochondrial genome of Diadema antillarum (Diadematoida, Diadematidae)

1
1. GigaScience 27 Nov 2022
  
  in GigaByte
  
  ABSTRACT
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.73 and has published the reviews under the same license. There is also a Spanish language version of this preprint available in SciELO preprints:
  
  Description
  
  The peer reviews are as follows.
  
  Reviewer 1. Joseph Lopez
  
  This is a review for manuscript “The first complete mitochondrial genome of Diadema antillarum (Diadematoida, Diadematidae) Majeske et al. DRR-202205-01 This is a very interesting topic. The methods and results are clearly explained. The original figures are very good and descriptive. The authors have competently analyzed the data and written a succinct manuscript. Marine biologists understand the legacy and impact of the Diadema epidemic from the 1980s. Therefore, it is important to help bring this species back to from the brink, if not dominance, in the Caribbean again. This could possibly happen with more systematic and molecular genomic characterizations such as this study. Was this project part of larger project to sequence the whole Diadema genome? If so, the authors could state this and not be penalized. Due to the large number of mtDNA molecules, assembling the mitochondrial genome is commonly done in whole genome projects. Having the mtDNA properly assembled is now a great asset for conservation and population genetics.
  
  Reviewer 2. Remi N. Ketchum
  
  Are all data available and do they match the descriptions in the paper?
  
  Yes. The GitHub is up to date but I cannot yet access the NCBI databases although numbers are provided (likely submitted but not publicly available).
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. I would suggest that the authors also make their alignments available to the public.
  
  For additional comments see: https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMzQzL1Jldmlld19HaWdhQnl0ZS5kb2N4
  
  Reviewer 3. Andreas Kroh
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The data was not provided together with the manuscript, so I am unable to check this. The manuscript, however, states that the data will be deposited in GenBank
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. Locality details missing, Voucher specimen number missing, Repository institution for voucher specimen not identified.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. See details below.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. See details below.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. Unclear - some detail is missing in the methods section to allow judegement - see details below.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. See above and below - voucher specimen number is missing, some methodological information is missing, references to original papers providing sequences used in the analysis are missing, etc. - see details below.
  
  Additional comments.
  
  The manuscript by Majeske et al. on the mitogenome of Diadema antillarum is an interesting contribution to the phylogeny of Echinoidea. There are, however, a number of issues which should be addressed in a revised version, in my opinion. 1) Please provide coordinates for the sampling site (and a locality name) instead of a general region 2) Please provide the repository number and institution where the voucher specimen has been deposited 3) Did you verify the identification and made sure that this is D. antillarum rather than D. africanum (which allegedly has repopulated some D. antillarum habitats in the Caribbbean and GoM) – for a morphological comparison see: Rodríguez, A., Hernández, J. C., Clemente, S. & Coppard, S. E. 2013. A new species of Diadema (Echinodermata: Echinoidea: Diadematidae) from the eastern Atlantic Ocean and a neotype designation of Diadema antillarum (Philippi, 1845). Zootaxa 3636, 144-170. 4) Please report the insert size that has been targeted during library prep. (typically either 350 bp or 550 bp for the kit mentioned) 5) Explain why the S. purpuratus mitogenome was uses to map the reads rather than one of the diadematid mitogenomes 6) Please explain why the custom assembly pipeline was used rather than one of the well-established assemblers like SPAdes, Abyss, Velvet, etc. 7) Please provide a coverage graph 8) Position of the non-coding region is given in # bp – but without information which feature is considered as zero in a linearized version of the circular sequence the position is useless 9) Please explain what exactly was used for the analysis – the full nucleotid sequence including non-coding regions, just the CDS of the protein coding genes, or …? 10) Please add the reference to original papers that published the sequences you use in the tree 11) Please explain the choice of the model used in the analysis – was some Modeltest run? 12) Please provide the fasta file together with a revised version to allow checking the quality of the annotation etc. 13) Fig. 1: please provide some information on the photo shown – is this the specimen that was sampled, add this info and the locality in the caption 14) Fig. 2: add the accession numbers in the tree and highlight the new sequence 15) Please see additional minor comments in the annotated version, which is attached Summing up, I recommend acceptance after major revision. Kind regards Andreas Kroh, NHM Vienna, 10/7/2022
  
  See following file. https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMzQzL2d4LURSLTE2NTM2ODM4NTRfQUsucGRm
  
  Re-review: The revised manuscript of Majeske et al is much improved in comparison to the initial submission. Some of the questions raised in the previous review, however, remain open and other new aspects have appeared. Open issues: 2) Please provide the repository number and institution where the voucher specimen has been deposited --> this issues has not been addressed in the revised version; it is unclear if a voucher specimen has been deposited or not, where it is stored and which inventory number it has; if the specimen has not been retained, this is unfortunate, but not a huge issue - it still needs to be clearly/openly stated 3) Did you verify the identification and made sure that this is D. antillarum rather than D. africanum (which allegedly has repopulated some D. antillarum habitats in the Caribbbean and GoM) --> this issue too has not been addressed; at the very least I would expect a statement that the authors were aware of this second Atlantic Diadema species and how they made sure they really had D. antillarum 7) Please provide a coverage graph --> the coverage graph is mentioned in the text, but not provided in the paper 9) Please explain what exactly was used for the analysis - the full nucleotid sequence including non-coding regions, just the CDS of the protein coding genes, or ? --> this is still unclearly formulated in the paper - I assume the whole mitogenome sequence was used, but the wording is very ambiguous; this needs to be very clearly stated in the material and metods section 11) Please explain the choice of the model used in the analysis - was some Modeltest run? --> this information is still lacking
  
  New issues: A) The description of the assembly process is still rather unclear - this needs to be better explained. For example, was any kind of preprocessing (read triming etc.) done? Which parameters were chosen for the various programms employed? How did the two-stage read extraction process really work - the wording in the manuscript is very unclear regarding this aspect B) The raw data need to be deposited in the GenBank Short Read Archive (SRA), in the Github repository only the extracted mitochondrial reads are available - this is insufficient to repeat the assembly process and analyses carried out in the present manuscript C) The fasta file included in the Github repository has 23 positions that are redundant (overlapping with the start of the sequence) - they need to be removed before submisson D) There is some inconsistence on the length of the mitogenome, the text says 15,708, the figure says 15,707 - the latter, judging form the files in your Github repository, is correct --> please make sure the information given is consistent E) No information is given on the reason for chosing the particular evolutionary model that has been used in the phylogenetic analysis F) The phylogenetic analysis has been done by NJ-methods, which are fast but can subject to a lot of problems, it would be better to use MAximu Likelihood (or Bayesian) methods G) The authors have made an important discovery in relation to the mitogenome deposited as "Echinothrix diadema" in GenBank. Rather than to speculate on the reasons that is the sister of D. antillarum in their analysis the authors should simply which of their hypotheses (AT-bias vs. misidentification) is correct. All the tools that are needed are already available in Genbank! There is an extensive dataset of three mitochondrial markers (12S, ATP6, ATP8; https://www.ncbi.nlm.nih.gov/popset/?term=MW329515 etc.) available for Echinothrix, which includes hundreds of sequences and encompases material from the complete geographical range of the genus (Coppard et al. 2021 https://www.nature.com/articles/s41598-021-95872-0). In addition, there are 16S sequences available for D. savignyi, the suspected candidate of the misidentification. I have downloaded these sequences and run preliminary analyses with with a subset of the sequences. These clearly show that the "E. diadema" mitogenome has nothing to do with true E. diadema and that it is a Diadema. While the data basis for Diadema is less extensive than for Echinothrix there are 16S sequences of D. savignyi (GenBank PopSet: 673458050) that are identical to part of the 16S sequence of the alledged "E. diadema" mitogenome. Thus I am convinced that the second hypothesis (misidentification) of the authors is correct. This is an important finding that should be discussed in depth in the manuscript. I am including the alignments and trees that I made in the attachment - similar analyses and trees should be included in the manuscript. Link to download the attachments: https://we.tl/t-y7ypbnZYPQ
  
  Summing up, I recommend acceptance after major revision. Kind regards Andreas Kroh, NHM Vienna, 11/9/2022
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.10.05.510842v1
www.biorxiv.org www.biorxiv.org

PhysiPKPD: A pharmacokinetics and pharmacodynamics module for PhysiCell

1
1. GigaScience 10 Nov 2022
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.72 and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Jeffrey West
  
  This is a very nice & useful extension to PhysiCell, in order to model PK/PD dynamics in agent-based simulations. Overall, the description of the software is good and easy to follow, but I offer a few suggestions for clarity:
  
  In "Statement of Need" -- the phrase "how much gets to the cells and what they then do to the cells" is vague and casual -- maybe use standard terms like drug exposure & response to describe PK/PD relationships
  
  Final sentence in "Statement of Need" that says "Substrates can target any cell type with PD dynamics" -- can you elaborate? Does this indicate that every cell type can have unique PD dynamics?
  
  In "Implementation" authors refer to Figure 2A and 2B but figure 2 only has one panel -- perhaps this should be figure 1A/B?
  
  In "Pharmacodynamics" -- "the list of PK substrates and the list of PDsubstrates need not have any relationship" -- this is slightly confusing. I assume that every substrate can have associated PK dynamics without having an PD dynamic, but is the opposite true? If so, how what is the drug dispersal / decay rate?
  
  Finally, the discussion section is focused mainly on future steps. I think it would be helpful for the discussion to focus more on current advantages and functionality. This is the publication record for this software, and as is often the case, future steps may be subject to change.
  
  Reviewer 2. Boris Aguilar
  
  Is the code executable?
  
  This code can not be in an. executable form as is an extension to PhysiCell
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  I am not familiar with running PhysiCell
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Author claim this is the first time PKPD module has been added to PhysiCell.
  
  I think there is mistake in calling Figure 1 in Installation sections, should be Figure 1.
  
  Reference to PhysiBoSS missing
  
  Figure 1 - I think there is mistake in calling Figure 1 in Installation section, should be Figure 1.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.09.12.507681v1
Oct 2022
www.biorxiv.org www.biorxiv.org

Message in a Bottle – Metabarcoding Enables Biodiversity Comparisons Across Ecoregions

4
1. GigaScience 10 Oct 2022
  
  in GigaScience
  
  biomonitoring
  
  Reviewer 4. Christina Lynggaard
  
  This manuscript assesses the variation in arthropod communities in three ecoregions in Canada. The study is well done, and the sampling was very thorough with a big sampling effort. I only have minor comments. Specially I consider that the aim can be focused on the ecoregions instead of the feasibility of the method, as this has already been shown. In addition, it would be nice to have more details in certain sections in the data analyses and in the results. I have addressed these comments below. -I am not sure why the title "Message in a bottle". -Line 65- Could you specify which indicator species have been targeted? Or cite studies that target those species? - Line 96- Based on the limitations of the ecoregions, it is not clear why ecoregions are an obvious candidate. -In line 104 seems that your aim is to demonstrate how feasible is to use metabarcoding for large-scale monitoring and that you use the ecoregions to prove that. However, showing the feasibility of this method for large-scale studies has already been done (e.g. Svenningsen et al 2021, Detecting flying insects using car nets and DNA metabarcoding; Bush et al 2020, DNA metabarcoding reveals metacommunity dynamics in a threatened boreal wetland wilderness). I suggest keeping it focused on the need to apply this method in different ecoregions. -In the Data description section, you mention that you examined phylogenetic diversity, but in the Analyses section you vaguely mention it. The phylogenetic diversity findings are discussed later on, but it is difficult to follow the discussion when the results were not presented previously. In addition, the authors use the findings in phylogenetic diversity to support the idea of a structure in the ecoregions, so I suggest making more emphasis in this in the results section. -Line 189. I agree that the higher number of BINs could be due to eDNA, but couldn't another reason be that the BINs were oversplit during data analysis? -Line 215-217. Has this been found previously in other studies using Malaise trap? If so, please reference to those findings. -Line222- This is a brief discussion about temporal turnover. However, these results are not presented previously, or at least not clearly enough. -Line 266-267- Yes, you showed compositional shifts using metabarcoding in bulk arthropod samples, but the way this sentence is structured it sounds like you are the first to show this. Compositional shifts in arthropods have been shown previously in other studies using metabarcoding. -Line 321- Did you have negative PCR controls? In line 326 you mention negative controls, but I assume you refer to the extraction negative controls. -Line 340- It is not clear why you queried the data against a bacterial library. -Line 348- What was the reason for choosing "at least three reads"? and the same for line 350 where you cluster sequences with a minimum of 5 reads per cluster. -Line 357- If you see tag switching in your negative controls that means that most likely you have it in the rest of the data. How did you ensure that the rest of the data did not have that? You may have tags switching in sequences not found in the negative controls but found in your samples. -Line 369- As you used the Bray-Curtis index in this metabarcoding data, did you convert your data to presence/absence? It is known that for metabarcoding data the use of read numbers for community analysis is not adequate (see Nichols et al 2018 "Minimizing polymerase biases in metabarcoding") .
2. GigaScience 10 Oct 2022
  
  in GigaScience
  
  Traditional
  
  Reviewer 3. Kingsly Beng
  
  Steinke et al used DNA metabarcoding of malaise trap samples from 52 protected areas spanning three Canadian ecoregions to assess the spatial patterns of arthropod biodiversity. The research question is relevant and interesting, the study is well designed, data collected are comprehensive, and manuscript is well written and easy to follow. I enjoyed reading it and would like to thank the authors for such a great contribution. My main concern is that the temporal aspect of the study was not explored even though it was mentioned as part of the research objective. Specific comments L60-62: These reductions are not only for abundance but also for diversity, at least based on the fourth reference cited here. I would therefore include "diversity" or "richness" in this statement. L63 & L105: The authors use biosurveillance in some places in the text and bio-surveillance in others. Isn't it better to stick to the same spelling all through, at least for consistency? L132: I am a bit confused here. Are these "Analyses" or "Results"? The whole subsection from L133-L176 read like results to me. L329: "of" omitted! Five samples were available from each of the other 22 sites... L332-334: The first "following" in this sentence can be either omitted or that part of the sentence completed using "manufacturer's instructions" L345-346: "Reads were trimmed 30 bp from their 5' terminus with a set trim length of 450 bp". Perhaps this needs more clarification. The amplified length was 463 bp, trimming 30 bp gives 433 bp. How then can set trim length be 450 bp? L348-349: What was the criterion for using "at least three reads matched an OTU in the reference database"? I mean why not at least two or at least four reads? If this was arbitrary please clarify. L349-350: Same question as above, why use "a minimum of five reads per cluster"? It would be nice to indicate if any benchmarking was applied a priori or if this was set arbitrarily. L346-349: Since the authors were mostly interested in arthropods, were reads that matched sequences from bacteria (SYS-CRLBACTERIA), chordates (SYS-CRLCHORDATA) and non-arthropod invertebrates (SYS CRLNONARTHINVERT) discarded or retained? This should be mentioned here and estimates of the number of reads, BINs or OTUs matching each of these categories should be provided. L149-153: These are interesting results. It would be nice to present them graphically, at least in the supplementary. The aim of the study was "to assess spatial and temporal variation in species richness and diversity in arthropod communities from 52 protected areas spanning three Canadian ecoregions" but the temporal aspect of the study was not fully explored. Although it is stated that "trap catches were harvested every second week from early May through September", this information has not be used in the analysis. Should the aim of the study be redefined and restricted to just spatial patterns then? L152-153: Without any table or figure to support these results, why not provide the actual number or proportion or percentage of BINs for each arthropod order in the text? L157-158: Please add some symbols (e.g. asterisks , , **or alphabet a, b, c) to Figure 3b to represent significant differences. Looking at the present figure without referring to the text does not tell the reader if the differences are significant. Besides, the authors only report a single p value (p < 0.003) which probably means at least one of the groups is different from the others but failed to report the pairwise multiple comparison tests that tell the reader which pairs or groups (e.g. ECF vs EGL, ECF vs SGL, EGL vs SGL) are significantly different. L159: Are the patterns similar if you control for the total number of sites per ecoregion? For example, taking 12 sites per ecoregion and resampling them 100 or 1000 times, similar to the approach used for beta diversity. It could be that one site is driving this pattern, as shown in Figure 2b and reported in L141 "...with more than a third (9,301) found at only one site (Figure 2b)". L164-166: Please provide the full PERMANOVA results in a table in the text or supplementary and reference it here. It is not clear what "decreased site elevation (R2 166 =â€‰0.035, P =â€‰0.03)" means. L168-171: Do these patterns change or remain the same if the same number of sites per ecoregion is used? This needs to be tested given that one site (probably from ECF or EGL?) is disproportionate species-rich and SGL has the lowest number of sites. L173-176: What about levels of turnover across time? Were they any temporal trends in alpha and beta diversity? Was the temporal dropped from the study objective and why? L221-223: Same question as above, were temporal changes in species composition considered? Which results, tables or figures point to this or how did the authors arrive at these statements.
3. GigaScience 10 Oct 2022
  
  in GigaScience
  
  Background
  
  Reviewer 2. Shanlin Liu
  
  Steinke et al. used a metebarcoding method to investigate the species compositions for 410 insect bulk samples collected in 3 ecoregions. The manuscript is well written, all the materials and methods were clearly described, I think the manuscript should be accepted for publication after addressing several minor issues as follows: 1. Line 126, as Ion torrent is not widely used nowadays, may the authors add some words regarding its sequencing length, error rate, throughput et al. 2. Please unify the format of chao 1 (or chao-1). 3. A rarefaction curve for each sample may need to check whether the species diversity is well represented by its raw reads. 4. Line 187 - 191. This BIN number inflation may also boil down to sequence errors introduced during PCR amplification or sequencing. 5. Please pay attention to the citation format. For example, in line 202, reference # 40 should follow the first author's name. 6. Line 226 - 227, please add some words to better explain the speculation of "passively transported by wind".
4. GigaScience 10 Oct 2022
  
  in GigaScience
  
  Abstract
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac040, and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Camila Duarte Ritter
  
  The manuscript is very well written and a great contribution to the field. However some analytical aspects need to be better described. Also, it would be great the authors provide their R-script in the supplementary material. Below my comments. Line 166: R2 = 0.035 is very low, it needs to be better considered. Lines 168-171: The alpha diversity comparison was based just in visual inspection or any test was made? Lines 173-176: There was any test to significance? It need to be reported. Lines 213-219: It is a nice discussion about local versus regional diversity, but very speculative, need at least some citations to support it. Lines 357-358: It reduce background contamination, you never can remove all. Lines 365-367: How the distances were controlled, any analysis of spatial correlation? Lines 367_370: The NMDS was with abundance or presence/absence data? If it was abundance, any correction was applied? Lines 374-376: How the author checked the quality of the tree as it was made with very short fragment? the blackbox toll set all parameters on the model? Line 382: Was there any correction to BINs table? Rarefaction, Shannon entropy? It is very necessary to metabarcoding data. Also why just BIN richness, other diversity measures may be included as Shannon or Fisher diversity on phyloseq, or the effective number of BINs with entropart. Figure 1 needs a reference to Canada to better understand where the region is.
  
  Re-review:
  
  The study is very well designed and written, with good and clear results. The author had considered all my comments from before, just some additional minor comments are below. Lines 118-119: species (bin) richness is a measure of alpha diversity and change in community composition a measure of beta diversity. Lines 112-122: Malaise-traps collect some random local no flighting insects, while discuss that it represent local population is ok I miss the part of the random sampling and that the lack of such insects in the samples does not exactly mean the non-presence of these insects. Lines 243-246: The sentence "Although current metabarcoding protocols cannot estimate the abundance of each species" is not completely right. Currently many metabarcoding studies estimate abundance/biomass of species, some discussion of it is necessary. Some examples (among several others):
  
  Elbrecht, V., & Leese, F. (2015). Can DNA-based ecosystem assessments quantify species abundance? Testing primer bias and biomass sequence relationships with an innovative metabarcoding protocol. PloS one, 10(7), e0130324. Thomas, A. C., Deagle, B. E., Eveson, J. P., Harsch, C. H., Trites, A. W. (2016). Quantitative DNA metabarcoding: improved estimates of species proportional biomass using correction factors derived from control material. Molecular ecology resources, 16(3), 714-726. Di Muri, C., Lawson Handley, L., Bean, C. W., Li, J., Peirson, G., Sellers, G. S., ... & HÃ¤nfling, B. (2020). Read counts from environmental DNA (eDNA) metabarcoding reflect fish abundance and biomass in drained ponds. Metabarcoding and Metagenomics, 4, 97-112. Ershova, E. A., Wangensteen, O. S., Descoteaux, R., Barth-Jensen, C., & PrÃ¦bel, K. (2021). Metabarcoding as a quantitative tool for estimating biodiversity and relative biomass of marine zooplankton. ICES Journal of Marine Science, 78(9), 3342-3355.
  
  For the figures comparing the ecoregions, as they are just three I would recommend a color blind safe palette, orange, yellow and green is not nice.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.07.05.451165v1

Abstract

This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.71), and has published the reviews under the same license. These are as follows.

Reviewer 1. John Hamilton

Are all data available and do they match the descriptions in the paper?

Yes. Downloaded and checked from the Gigabyte FTP site

Are the data and metadata consistent with relevant minimum information or reporting standards?

Yes. I was unable to check sequence data deposited in the SRA.

Is the data acquisition clear, complete and methodologically sound?

Yes, but summary tables are missing for the Illumina WGS and RNA-Seq sequencing in the manuscript.

Is there sufficient detail in the methods and data-processing steps to allow reproduction?

No. Some parts of the manuscript are very good in this respect and some parts (esp, annotation) are missing parameters and core details.

Is there sufficient data validation and statistical analyses of data quality?

No. Especially the analyzing of the quality /completeness of the genome annotation.

Is the validation suitable for this type of data?

Yes. Where it is not missing, it is suitable.

Additional Comments:

In this manuscript, Canales et al. present the long-read based assembly and annotation of the genome of fever tree (Cinchona pubescens), well known as the source of quinine alkaloids traditionally used to treat malaria. This will be a genome of interest and welcome resource for the community. I enjoyed reading this manuscript about this interesting species and I have several comments: 1. There is not a summary table for the Illumina WGS and the three RNA-Seq libraries. This should be added. 2. Since you have Illumina WGS short reads, it would be informative to add a Genomescope kmer plot (http://qb.cshl.edu/genomescope/) as an additional estimate of genome size and heterozygosity to section 1.3 3. The BUSCO metrics for the assembly are lower than expected. I believe this is due the lack of sufficient genome polishing. Refer to the Solanum pennellii genome paper (https://doi.org/10.1105/tpc.17.00521) where they used a similar assembly strategy and discuss the need for adequate polishing (see “Prior to Polishing, Genome Error Rate Is Substantial”). 4. Section 1.6 – It is noted that PASA describes transcript evidence as ESTs which is a legacy from the time it was developed, but then the RNA-seq transcript assemblies are also described as ESTs later in the section which is incorrect and confusing. 5. There is not an assessment of the annotation, just a statement of the number of CDSs predicted. This is an issue as the number of CDSs is far higher than reported in related species. There is not a discussion of repeat masking the genome assembly so I am assuming AUGUSTUS was run on the unmasked assembly with no downstream filtering or refinement. Doing this increases the number of TE-related gene models and annotation artifacts. As this is a data note/data release there should really be at a minimum: a. A table summarizing the annotation in the manuscript b. An analysis to identify models with evidence support c. BUSCO results for the annotation

Re-review: I’ve read the author’s responses to all the reviewer comments and read the updated manuscript and I am satisfied with the changes made.

Reviewer 2. Bing Bing Liu

I was very pleased to read your article on Fever tree's genomes, and I think it is a very valuable foundational work. The assembled genome recovered ~85% (903M or 904M, table1) of the estimated genome size (1.1 Gb/1C) with an N50 = 2802128 bp; 72,305 CDSs were annotated and 83% (or 87.6%, line 207) of BUSCOs were recovered, but there is a lack of clarity around these statistics in the study. And it is necessary to provide the repeat annotations, function annotations and non-coding RNA annotations. Besides, the BUSCOs recovered is no more than 90%, you should give your explanation.

Minor comments Lines 34-43, you should add the plastid genome results here. Lines 38, check the genome size. Maybe 904M? Lines 41 and 207, you gave two different percentages of BUSCOs, please check. Lines 144,145 and 149, the numbers of reads and bases are non-correspondence, please check, as the read length is 150bp. Lines 172, I doubt about the overall mapping re (7.34%). Lines 198, you should add the description about the genome produced by RACON. Lines 204 and 223, why do you use the different version of BUSCO software? Lines 206-207, why you did not give the result of mapping rio. Lines 243, you should provide the BUSCO result of proteins (CDSs). Lines 257 and 280, why you use different version of MAFFT? Lines 289, check the sentence ‘had a BS of 100%’.

Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.04.25.489452v2

www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov

svaRetro and svaNUMT: modular packages for annotating retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data

2
1. GigaScience 08 Oct 2022
  
  in Public
  
  interactive form created by the code and frictionless data package presented alongside this work [40 Reference40DongR, CameronD, BedoJ Supporting data for “svaRetro and svaNUMT: modular packages for annotating retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data”. GigaScience Database, 2022; http://dx.doi.org/10.5524/102318.].
  
  See GigaBlog forr more http://gigasciencejournal.com/blog/frictionless-data-interactive-figures/
2. GigaScience 08 Oct 2022
  
  in Public
  
  The remainder unreported events either had unmapped insSeqs, or undetected bps. In the online version of this paper this is presented in an interactive form created by the code and frictionless data package presented alongside this work
  
  See more in GiigaBlog on how these were created http://gigasciencejournal.com/blog/frictionless-data-interactive-figures/
Visit annotations in context

Annotators

GigaScience

URL

ncbi.nlm.nih.gov/pmc/articles/PMC9694029/
www.biorxiv.org www.biorxiv.org

svaRetro and svaNUMT: Modular packages for annotation of retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data

1
1. GigaScience 08 Oct 2022
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.70), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Surajit Bhattacharya.
  
  The authors of the manuscript have tried to address a significant problem in genomics study, i.e. annotation of non-coding elements of the genomes. The authors have built two R tools, to capture two non-coding regulatory elements, Retroposed transcripts and Nuclear mitochondrial integrations(NUMT). The authors have illustrated the efficiency of the tools with examples using 2 datasets, and also benchmarked the tools using other available tools. Although the authors have performed validations, there seem to be some points that still needs to be clearly elucidated.
  
  Minor Points: 1. On line 125, "BEDPE and Pairs [28]", should be written as "BEDPE [28] and pairs". 2. Although, the authors benchmark the two tools, can they briefly compare the time taken to run the ir tools against the tools they are benchmarking with? For example, compare the time between svaRetro and GRIPper and svaNUMT and dinumt. 3. It's not a question, but more of a comment. Is it possible to verify some of the novel variants identified by svaRetro and svaNUMT , using PCR or any other method? This can strengthen the point that svaRetro and svaNUMT, is better than the other tools.
  
  Reviewer 2. Gargi Dayama
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. Although additional clarification on features of svaRetro can be helpful
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes. Additionally, it might be useful to state in description on Github, R version required to install the tool (it doesn’t work with versions older than 4.1)
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. 1) The authors also need to benchmark their tools against other previously developed tools that they used for comparison (dinumt and GRIPper) using the simulated data. 2) Authors state they found calls that were not found by the other tool. This needs to be further tested to show the results were true positive. In fact, there is no test done to look at the false positives. Therefore, doing a test on their entire results for false positive/ true positive is essential.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  Yes. But there is a discrepancy for svaNumt. The following command on github “NUMT <- svaNUMT::numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)” doesn’t work. Instead this worked “NUMT <- svaNUMT::numtDetect(gr, max_ins_dist = 20)”
  
  Additional comments sent in an annotated file to the author.
  
  Re-review: I feel the authors have addressed my comments. I just have one small comment about their statement in conclusion section line 359-360. They made a statement that “svaRetro and svaNUMT demonstrated good performance on simulation and human cell line datasets similar to - or in some instances outperforming - other methods without re-analysis of alignment and the use of specialized detectors”. While this statement might be all right for simulated data, based on their results in lines 309-319 in cell lines, svaNUMT seems to almost has a 50% false positive annotation rate (although with low confidence). I feel this should be addressed as a caveat in the conclusion and a bit more clearly as false positives in results. Other than that, I do not have any additional comments.
  
  Reviewer 3. Raniere Gaia Costa da Silva.
  
  See the CODECHECK Certificate of independent execution https://doi.org/10.5281/zenodo.7084333
  
  See more in GigaBlog: http://gigasciencejournal.com/blog/frictionless-data-interactive-figures/
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2021.08.18.456578v1
www.biorxiv.org www.biorxiv.org

A phased, chromosome-scale genome of ‘Honeycrisp’ apple (Malus domestica)

1
1. GigaScience 07 Oct 2022
  
  in GigaByte
  
  Abstract
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.69), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Dr.Liyi Zhang
  
  ‘Honeycrisp’ is known for its exceptionally crisp and juicy texture as a source of interesting genetic diversity in apple breeding programs worldwide. In addition, high quality genomes are required for us to understanding the genetic characteristics of a core cultivar, This study presents a fully phased, chromosome-level high-quality apple genome with a higher contiguity and completeness than previously sequenced apple genomes and also reveals 121 ‘Honeycrisp’-specific orthogroups with a large data set, which provide a toolbox for apple genetic research and breeding.
  
  The paper is well written and the data is convincing. So, I recommend to publish this paper ASAP.
  
  Reviewer 2. Luca Bianco
  
  Are all data available and do they match the descriptions in the paper?
  
  I could not access to the Bioproject data nor see the results files (i.e. fasta, gff,...) but I am confident they will be available once the paper is accepted.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  The only exception is what I mentioned regarding the haplotype separation (see general comments below).
  
  General comments: This paper describes the genome sequence of Honeycrisp, an important apple cultivar, produced with the latest sequencing technologies and assembled into phased chromosomes. In my opinion, the manuscript is well written, very interesting and certainly worth publication. There are only a few points that I would like to see addressed:
  
  1) How can you be sure that the two haplomes are a good representation of each chromosome and not a mix of the two haplotypes? In other words, have you checked that the whole sequence of each chromosome represents one phase only? It would be great if you could provide some data (e.g. SNPs,...) to support this and discuss the results obtained in this regard.
  
  2) Some additional stats regarding the obtained sequence could be added to table 2 and/or table 5 (e.g. number of Ns in the genome, how many telomers were assembled in each chromosome -- if not all telomers were identified, )
  
  3) The gene family analysis among the different apple genomes is quite interesting but rather superficial. It would be nice to dig deeper into the function of the orthogroups that are unique to Honeycrisp, describe what pathways they are involved in and so on...
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.24.505160v1

GigaScience

Annotations: 878

Joined: September 13, 2019

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Minor comments:

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Here is the error message

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators