930 Matching Annotations
  1. Last 7 days
    1. Editors Assessment:

      In the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of the arboviruses that cause yellow fever, chikungunya and dengue in the human population. However systematic surveillance data on these species remains limited, hindering for entomological and modelling research and control strategies. This paper is one of a series of Data Release papers in GigaByte supported by TDR and the WHO describing datasets hosted in GBIF to tackle these data gaps in vectors of human disease data. To address this data deficiency this paper presents a geo-referenced dataset of 6,577 entomological occurrence records collected in 2024 throughout urban and peri-urban areas of Kinshasa in the Democratic Republic of Congo. The data collected using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Data auditing and peer review found the data well validated, but requested some additional fields and methodological details. This work and the extremely useful data provided representing an important step towards building a pan-African resource for Aedes mosquito data collection.

      This evaluation refers to version 1 of the preprint

    2. AbstractIn the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of medically important arboviruses, with major implications for yellow fever, chikungunya and dengue. However, systematic surveillance of these species remains limited, constrained by competing public health priorities such as malaria and other neglected tropical diseases. This gap in surveillance prevents the rapid detection of changes in the distribution, abundance and behaviour, particularly in rapidly urbanizing environments where breeding habitats are proliferating and ecological conditions are favourable for the establishment of these vectors. To address this gap, spatially explicit, small-scale data on Aedes populations in urban and peri-urban areas are needed to accurately assess transmission risk and develop targeted, evidence-based vector control strategies. Here, we present a geo-referenced dataset of 6,577 entomological occurrence records collected in 20224 throughout urban and peri-urban areas of Kinshasa city, DRC, using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Records include Aedes albopictus (n = 2,694), Aedes aegypti (n = 1939), Aedes vittatus (n = 2), and Aedes spp. (n = 1,942), each annotated with species, sex, life stage, reproductive status, and spatial coordinates. The dataset is published as a Darwin Core archive in the Global Biodiversity Information Facility (GBIF), and represents the most detailed, spatially explicit record of Aedes mosquito occurrence in Kinshasa to data, providing a robust foundation for entomological and modelling research to support data driven arbovirus vector control strategies in DRC.

      Reviewer 1. Bastien Molcrette

      Are all data available and do they match the descriptions in the paper?

      Correction needed in manuscript Table 1: row ‘Ae. spp (*unid)’ column ‘total’ should be 1942 (instead of 1932). Additional Comments: Aedes vittatus has only been observed and characterized twice in a full year, among 6577 samples: how confident are you that these samples have been correctly classified? Are there any other references for the observation of Aedes vittatus around Kinshasa?

      The full data review and audit is here: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD02NDAmZmlsZT0yODAmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==

      Reviewer 2. Paul Taconet

      Is the data acquisition clear, complete and methodologically sound?

      No. See attached.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. See below.

      Additional Comments: This data paper presents a valuable contribution, and the effort invested in publishing such a dataset is both commendable and highly appreciated. It represents an important step towards building a pan-African resource for Aedes mosquito data collection.

      Overall, the paper and dataset are highly promising, but clarifying the sampling design and improving metadata consistency will significantly enhance their usability and scientific value.

      Major comments:

      The main point of confusion concerns the geographical definition of the sampling sites. In the manuscript, it is stated that “within each area, two sampling sites were selected.” This suggests a total of four sampling sites (2 areas × 2 sites each). However, elsewhere the text mentions “adults collected from different households for each of the three sampling techniques,” which implies three households per area (i.e., three sites).

      In contrast, the dataset appears to include only two sampling points (one per area), each with extremely precise geographic coordinates (six decimal places, implying sub-meter accuracy). This suggests that collections were made at identical locations, contradicting the description in the paper (two sites, multiple households, etc.).

      To resolve this inconsistency, clarification is needed both in the paper and in the dataset:

      Minor comments (manuscript):

      • In the “Mosquito collection” section, please provide more detail about the sampling schedule (e.g., total number of sessions for each technique, average sampling frequency, etc.).
      • In Table 2, define precisely how dry and rainy seasons were determined (e.g., based on calendar months or rainfall thresholds or other).
      • The dataset contains information on mosquito sex and feeding status, yet the paper does not describe how these were determined. Please add methodological details.
      • Indicate how far apart the sampled households were located, since simultaneous sampling at nearby sites could bias results.
      • Typographical corrections:
      • Introduction: “entomological occurrence records collected in 20224 2024” → revise.
      • Introduction: “spatially explicit record of Aedes mosquito occurrence in Kinshasa to data date” → revise.
      • Methods: “Water from each breeding sites was using with a ladle...” → revise wording for clarity.

      Comments on the dataset:

      • For completeness, the event table could include additional fields such as habitat, samplingEffort (especially relevant for adult collection), sampleSizeValue, and sampleSizeUnit. These details are already provided in the paper and could easily be added to the GBIF dataset.
      • In the occurrence table, the entries under ScientificName are currently generic (e.g., “Aedes albopictus” should be written as Aedes albopictus (Skuse, 1895)). Consider renaming the current column as genericName and adding a proper ScientificName column with complete taxonomic names.
      • The use of MaterialSample as the basisOfRecord seems questionable. According to community discussions (e.g., https://discourse.gbif.org/t/understanding-basis-of-record/5857), HumanObservation would be more appropriate in this case.
  2. Sep 2025
    1. Editors Assessment:

      This paper presents present the genome sequencing of the house sparrow (Passer domesticus) carrying out genome assembly and annotation using in silico approaches with tools that could be a valuable resource for understanding passerine evolution, biology, ethnology, geography, and demography. The final genome assembly was generated using short read sequencing and a computational workflow that included Shovill, SPAdes, MaSuRCA, and BUSCO benchmarking. Producing a 922 MB reference genome with 24,152 genes. The first draft was significantly smaller than this but peer review provided suggestions on how to improve the assembly quality. And after a few attempts and assembly with a reasonable size and BUSCO score was achieved. This openly available data potentially serving as a valuable resource for checking adaptation, divergence, and speciation of birds.

      This evaluation refers to version 2 of the preprint

    2. AbstractThe common house sparrow, Passer domesticus is a small bird belonging to the family Passeridae. Here, we provide high-quality whole genome sequence data along with assembly for the house sparrow. The final genome assembly was assembled using a Shovill/SPAdes/MASURCA/BUSCO workflow, consisting of contigs spanning 268193 bases and coalescing around a 922 MB sized reference genome. We employed rigorous statistical thresholds to check the coverage, as the Passer genome showed considerable similarity to Gallus gallus (chicken) and Taeniopygia guttata (Zebra finch) genomes, also providing a functional annotation. This new annotated genome assembly will be a valuable resource as a reference for comparative and population genomic analyses of passerine, avian, and vertebrate evolution.Significance Avian evolution has been of great interest in the context of extinction. Annotating the genomes such as passerines would be of significant interest as we could understand the behavior/foraging traits and further explore their evolutionary landscape. In this work, we provide a full genome sequence of Indian house sparrow, viz. Passer domesticus which will serve as a useful resource in understanding the adaptability, evolution, geography, allee effects and circadian rhythms.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.161), and has published the reviews under the same license.

      Reviewer 1. Gang Wang

      Is the language of sufficient quality? Yes. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No

      Overall Comments: The article is logically clear and the analysis is complete. The description of both sample collection and sequencing is relatively clear. At the same time, the analysis process shown in Figure 1 is also very reasonable. However, as described by the previous reviewer, I suggest that you remove the high-quality level. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements. Figure 2, the letters of a and b are too different, please unify them. Figure 4 is completely unclear, please increase the font size. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this. Re-review: The authors used FCS-GX to exclude contaminating sequences in the genome, so I agree that this paper should be published.

      Reviewer 2. Agustin Ariel Baricalla

      Are all data available and do they match the descriptions in the paper? No. Matching data: NCBI project with access to the NCBI-SRA deposited raw data. Nonmatching data: Oxford Nanopore data: The authors reply to a previously submitted manuscript arguing that this data was not used, but Fig. 1 refers to Nanopore Minion data. The manuscript body and the additional data section do not include the Quast and BUSCO reports or their corresponding plots.

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No. GigaByte suggests a checklist including the genome, CDS, and proteins in FASTA format, as well as the annotations in GFF format; however, these items are not available for evaluation.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. The FastP step for raw data processing is mentioned in the results section but is not detailed in the methods section.

      Is there sufficient data validation and statistical analyses of data quality? No. The authors have not included the BUSCO results. The OrthoDB database for 'passeriformes_odb12' contains over 10,000 curated genes, representing approximately 50-60% of the total genes in a typical passeriform genome. Therefore, the BUSCO report for the new assembly should be provided. The author mentioned that "The gene completeness for Passer was assessed through Benchmarking Universal Single-Copy Orthologs ( Busco version 5.5.0 ) [26] by using the orthologous genes in the Gallus gallus [ chicken] genome" but BUSCO uses the OrthoDB datasets to run, I do not understand what this phrase refers to.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes. All the procedures are consistent and the programs or pipelines are well-known and well documented in the bioinformatic and genomic fields.

      Additional Comments: The inclusion of the mitochondrial genome represents a significant improvement in this manuscript. I recommend presenting all nuclear results together first, followed by a separate and clear description of the mitochondrial analysis and findings to enhance clarity. The data is interesting for analyzing the genetic dynamics behind Passer domesticus adaptation and evolution and can show differences between the previous genomes available from a European reference sample but this is not presented in this work. As of this revision, the NCBI's Passer domesticus genome includes two European reference genomes, both classified with 'chromosome-like' status (NCBI: GCF_036417665.1 and GCA_001700915.1). These genomes can be utilized in two distinct ways: (1) performing a 'genome-guided assembly' with MASURCA, using one of these genomes alongside the Illumina data, or (2) conducting genome scaffolding by employing one of these genomes as a reference and the assembled genome from raw reads as a query, using tools like RagTag or the chromosome scaffolder available in MASURCA. Both approaches could potentially lead to improvements in scaffold number and contiguity metrics, such as N50, N90, and the largest scaffold.

      Re-review: The authors have subtly improved the original version previously presented, but have not managed to surpass the minimum standards established by the publisher to be published by the journal. Easily achievable changes have been requested to complement the analysis previously made and have been ignored. Requests have not been answered, graphics that generate confusion between them and the text presented have not been fixed, and no relevant improvement between the previous and current versions has been shown.

  3. Aug 2025
    1. AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Chengzhi Liang

      The authors generated a 343-Mb telomere-to-telomere (T2T) genome assembly for an African wild rice (Oryza longistaminata), covering all telomeres and centromeres across the 12 chromosomes, and performed genome annotation and analyses on structural variations and NLR genes. While the manuscript has provided a valuable genome sequence, several problems should be addressed before the manuscript can be published.

      Major issues 1. The authors estimated that the genome heterozygosity is 1.27%, which is quite high, so I am wondering how large the assembled genome size is using only HiFi data, which could reflect the actual heterozygosity rate of the genome, particularly by comparing it with the final genome size of 12 chromosomes. If there was only one gap in the initial assembly of Hifiasm (a total of 13 contigs), it is unlikely that the genome has such a high heterozygosity. In Table 1, the total size of assembled genome was 331,045,917bp. If this is the summed size of 12 chromosomes, it should be used as the final genome size in the main text. Please clarify. Also, what is the base accuracy of Ultra-long CycloneSEQ data? which is useful to readers for this is a new sequencing technology. 2. For SV detection, considering that the assembled genome in the manuscript (does it have a accession ID or name?) is an African wild rice, it is rather strange that the authors did not compare it with an O. glaberrima genome, but with an O. sativa genome. Meanwhile, the name of the genomes should be mentioned since there were so many different genomes in each species, all with different SV variations between them. 3. The conclusion that "This distribution suggests that chromosomes 1, 4, 3, and 2 might have contributed to the evolution of rice in previously unrecognized ways (Table S8)" is purely speculative, and thus should be removed from the manuscript, or the authors should provide more evidence to support it. 4. The author claimed that "Compared with other Oryza species, O. longistaminata has many fewer NBS-lRR domain genes, which reflects a contraction of resistance genes in this species." Please give specific gene numbers for each species. Meanwhile, the conclusion does not look right here since it looks that O. longistaminata had more NBS-LRR genes than other species.

      Minor issues 1. What is "quartets"? 2. The author used "11 Oryza species" which included O. indica, please clarify what this species is.Bold

    2. AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Francois Sabot

      The manuscript from Guang et al deals with a T2T assembly for the wild perennial African rice Oryza longistaminata. Using last up to date technologies and approaches, authors provided a high quality assembly for this wild species, rending it a valuable ressource for understanding rice evolution. While the results as assembly are of high quality, the interpretation of some biological results, in particular about the NBS-LRR, are quite weird, in my opinion, and need to be more refined. That's why I think the manuscript should be published, but after major corrections.

      in details:

      -Introduction: not sure the exceptional biomass is a good idea from longistaminata, as this plant has avery high content in silicium, rendering its biomass complex to use. - Methods: We do not have access to most of the command options and command-lines. please provide them at least as a texte file in supp data. In addition, some of the references for tools are missing. Finally, please provide the accession number of the assembled plant. - Assembly in itself: O longistaminata is a outcrossing heterozygous organism. Did you obtained the two haplotypes ? - Comparison with the previous longistaminata genome: is the inversion in middle of Chr6 specific ? or due to an error of previous assembly ? - Table 1: what do you mean "Total size of assembled genomes (bp) 331,045,917" ? What is the residual percentage of N ? - Figure 1 and others: please show the legend in other way, here we may mix it with the main text. in addition, check the legends for spelling and the size of figure (3b eg) for lisibility - Syri/MUMmer analysis: you limit as min size at 1kb ? What was the order of query vs ref ? can we have a bed file with the positions ? - SD: is there a statistical link between chromosome size and number of SD ? It could explain why the first 4 ones have more SD. In general, the data are missing stats. - GO in SD: any statistical validation ? - Genomes comparison: please provide the acc number of the genome you used for comparison. - NBS-LRR: the longistaminata genome has 215 genes for 116 to 289 for other oryza so I cannot see any contraction or expansion. in addition, the text here is weird, starting speaking of onctraction then going to expansion ??? - TF analysis; the african assemblies are quite bad I think, explaining the discrepency. For glaberrima, did you check the one from Tranchant-Dubreuil et al, 2023 ?

    1. AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps (Ahl, 1926) (NCBI:txid103695).

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Yuan Li

      The authors de novo assembled a telomere to telomere phased genome assembly of the Australian central bearded dragon Pogona vitticeps, using PacBio HiFi, ONT, HiC, and Illumina sequencing platforms. The assembly achieves remarkable contiguity (scaffold N50: 266.2 Mb) and completeness (97.9% BUSCO score), surpassing Earth Biogenome Project standards. The phased assembly of sex chromosomes (Z/W) and identification of candidate sex-determining genes (Amh, Amhr2, and Nr5a1) provide valuable insights into reptilian sex determination. Overall, the study is well-executed and provides a valuable resource for comparative genomics and reproductive biology.

      Major concern: 1.The description of read depth had errors at lines 401-402, such as 60.6x. In addition, "4 x promethION", "2x150 bp" were should be revised and please check and revise all the similar description in the manuscript. 2.There are errors in the citation format of the journal references, such as the absence of punctuation "."marks between the title name and the journal name at lines 1005-1009, mixing abbreviations (e.g., "PNAS" vs. "Proceedings of the National Academy of Sciences USA") (lines 988-990, 1005-1009). Please check carefully the format of all references. 3.The script "calculateGC.py and processtrftelo.py" (lines 242 and 245) are mentioned without code availability or parameter details. Provide effective links or repository access. 4.The inconsistent use of "Gb" and "Gbp" is observed; it is recommended to adopt a unified description. 5.Units were missing in the descriptions in multiple places in Table 1 and 2, such as the unit for "Total Bases" and "Assembly length"; please include them. 6.At lines 683-687, the conclusion that Amh/Amhr2 are sex-determining genes relies solely on positional evidence. Discuss the need for functional studies (e.g., CRISPR knockouts) to strengthen claims. 7.There were errors in "Vasimuddin et al. 2019" (line 238) and "Danecek et al. 2021" (line 239). Please check all the other formats of references. 8.At lines 476-481, BAC mappings are cited as validation but lack visual evidence (e.g., alignment plots in figures or supplements). Please verify the accuracy of Figure 7 at line 478, as it does not correspond with the description.

    2. AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Heiner Kuhl

      Patel et al. present a genome assembly of the bearded dragon Pogona vitticeps a lizard species that is widely distributed as a pet and known for its interesting sex-determination, which may switch from genetic sex-determination (ZW) to temperature dependent sex-reversal. The methods chosen to assemble the genome are very state-of-the-art including HIFI and ONT long reads, Hi-C and suitable bioinformatic tools.

      I have to admit that I have recently been reviewing a similar manuscript for Gigascience (https://www.biorxiv.org/content/10.1101/2024.09.05.611321v1), where a female ZZ P. vitticeps had been sequenced/assembled from long read data of a different nanopore technology and analyses of the ZW-chromosome was done by short read coverage analysis. One of my major comments was that this approach lacked a true assembly of the W-chromosome. Thus, I am happy to see that the assembly of the W-specific region has been achieved here and the sequencing technologies used might even improve the assembly quality over the ZZ assembly in terms of phasing, consensus accuracy etc. The two manuscripts are highly complementary and I think they should be published, if possible, in the very same issue of Gigascience. Surely both groups have invested a lot of efforts. (Reading L. 685, I just have realized that this seems to be the intention of the journal and I very much support this idea.)

      Still there are some minor points that need improvement for the current manuscript:

      Why do you leave the Z and W splitted into PAR, Z- and W-specific scaffolds and do not assemble the full-length chromosomes (L. 676)? Would the Hi-C data not support that?

      Mitochondrial assembly: from ONT only (L. 307), please do a consensus correction with illumina data, or at least show that the MT assembly has a high consensus accuracy (Q40-Q50).

      Genome annotation: show BUSCO scores for annotated proteins (do they fit to BUSCO performed on the whole genome?). If possible, compare to results of the NCBI RefSeq annotation (is it already available?). In this regard please explain the relatively low mapping rates (L. 647) of RNAseq to the annotated sequences.

      Could you provide some expression data for the Z-specific Amh and AmhR2? Is it differentially expressed in testis/ovary (after correction for copy number)?

      Table1, could you show results for the two different ONT library types (ligation vs. ultralong kit). It seems the overall yield was low (5 cells -> 100Gb), any speculation why?

      I think assembly statistics (Table2) should also contain contig N50 length as an additional value to show the high continuity of the assembly.

      L. 488: "48.36 (1 error in 146kb)", I think something is wrong here. Q48.36 would be 1 error in 68.5kb. I would suggest to re-check these values and incorporate them in Table2. The high consensus accuracy is one selling point compared to the competitor's assembly.

      L. 490: "Individual haplotypes were 85.5% complete…". Explain why you are confident that the haplotypes are more complete than the Merqury results suggest (just one sentence).

    1. AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Heiner Kuhl

      Guo et al. present a new reference genome for Pogona vitticeps, a widespread reptile model organism that is also common as a domestic animal worldwide. The genome assembly shows much improvement over an older assembly from 2017. There are two points that make this manuscript outstanding from common genome assembly papers:

      1. The authors find a new sex determination locus in this species.
      2. the authors use a new nanopore sequencing technology ("CycloneSEQ"), which has so far only described in a preprint (https://www.biorxiv.org/content/10.1101/2024.08.19.608720v1).

      In my opinion this deserves a publication in Gigascience, but both points must be focused more in a revised manuscript.

      Major comments:

      1) The authors have sequenced a male individual (ZZ), which means the long-read reference assembly is missing the W-chromosome. PAR and SDR regions are deduced from the Z sequence, by analysis of sequencing coverage of only a few sexed samples (2 females and 4 males). It is unclear if these individuals are from the same family, which could mean that the newly found SD-region could just be a family specific variation. To make the whole story more intriguing and statistical sound the authors should at least test 15 males and 15 females from different P. vitticeps populations for W-specific markers near the proposed AMH deletion. The authors should also show that the prior proposed SD locus (nr5a1) does not carry W-specific mutations in these 15+15 individuals. Furthermore, a phased assembly of a female (ZW) Pogona vitticeps individual, could enable the assembly of the missing W-chr and should be included, it would even improve analysis of W-specific sequences in the proposed additional individuals.

      2) A technology aware reader would like to see more information on the specifics of the CycloneSEQ data quality and handling and maybe a comparison to competing technologies. Which enzymes and buffers were used to prepare the library? In the sections on the methods, there are only superficial descriptions such as (DNA repair buffer/enzyme, DNA clean beads, wash buffer for long fragments). Is it a kit or were the enzymes and buffers purchased individually? I cannot find the procedure for preparation and sequencing of the long-read cDNA libraries. How many flowcells were needed to generate the different datasets? How do the read-length distributions look like (statistics over all reads not only selected 40Kb+)? How was the variability between those runs, especially culmulative output over time? What hardware was needed to run the basecalling and what was the runtime? How is the Q-Value distribution of the reads? Why is the consensus accuracy of the assembly low (Q36.4)? can it be improved? Typically reference quality genomes should have Q40+. Which regions of the genome display lower consensus accuracies (is it random or sequence specific)?

      Minor comments:

      L.900: PRJNAxxxxxx looks like a placeholder, insert the true number,please.

    2. AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Nazila Koochekian

      Impressive work but needs major revision to be accepted. The authors compressed everything in the result section and did not put enough effort into the other sections. Introduction and discussion need major changes and more details regarding many aspects of the study that comes in the results. Methods need rearrangement. It's common to keep the order of methods such as first DNA extraction, then sequencing and so on. The data availability needs to be completed. Biosamples for each sequenced tissue, all the reads, and even intermediate assemblies need to be submitted to the database and reported in the manuscript. More specific comments are on the copy of the manuscript attached for the authors.

    1. ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Daniel A. Skelly

      Overall, this is a very nice writeup of a useful package that extends the Seurat package to expand possibilities for single cell analysts in R. I liked the visualization options, the ability to try certain python-based tools easily in R which was not previously easy, and some of the authors' new innovations like their use of pathway enrichment scores in broad ways. Kudos to the authors for releasing a package with really excellent documentation and tutorials!

      I think this paper could be made better if the authors stressed with a little more clarity how specifically their work is innovative. The text in the present manuscript is fine but reads like a bit of a grab bag of functionality. For example, from the abstract: "SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package integrates multiple databases, … and incorporates popular Python tools … [We] showcase its novel applications in pathway-level analysis and cluster annotation. SeuratExtend enhances data visualization …"

      How could they be more clear or specific? One example could be by categorizing what SeuratExtend can do that other packages can't. For example, I see innovations in perhaps three general areas: 1. Making single cell analyses easier/faster/prettier (i.e. visualizations, pathway enrichment) 2. Making previously published single cell tools more broadly accessible (e.g. first option to bring certain python tools to R) 3. New innovations (e.g. dimensionality reduction and clustering based on pathway enrichment scores; may not be completely new but I don't recall seeing this elsewhere) If this was added I feel the paper would more clearly communicate to readers the information necessary for them to choose whether they want to try the package.

      I have the following additional significant comments: * Integration of multiple databases for GSEA — these methods are good, but what about in a few years when those databases have been updated? Do the authors intend to continue updating? Could they provide a function for users to use their own database (e.g. .gaf and .obo files, for example for another model organism)? Similar comment about gene identifer conversion, which may need to be updated every few years. * "While the Python ecosystem has benefited greatly from the comprehensive scverse project [7], which utilizes the universal AnnData format to connect various tools and algorithms, a comparable integrated solution has been lacking in the R community. SeuratExtend addresses this gap by providing a unified framework centered around the Seurat object, effectively becoming the R counterpart to scverse." —> some might argue that SeuratWrappers is this solution. The authors should more clearly and explicitly comment on what SeuratExtend does differently/better than SeuratWrappers. * I'm not particularly convinced by the authors' example studies that used SeuratExtend. For example, they describe Hua-Vella et al. (2022) and Hua et al. (2023). These are very nice studies and I have no doubt they made use of SeuratExtend in their analyses. But I don't see anything these authors describe those authors doing as being uniquely possible with SeuratExtend. Perhaps SeuratExtend made their analyses easier, or faster. But it would be better if we had some further concrete details. For example, something communicating a message like one of the following: (1) the authors only tested method X on a whim because it was so easy to run in SeuratExtend, and found that it revealed unexpected biology Y; or (2) the authors were able to bring together method X which runs in R and method Y which runs in python and the joint inference — not possible in other packages — revealed key result Z. If the authors of this manuscript can't point to those sorts of examples, then I'm not sure it adds much to include this discussion in the present paper. * I really liked the section "Novel Applications of SeuratExtend in Pathway-Level Analysis and Cluster Annotation", especially "Exploring and Analyzing Single-Cell Data at the Pathway Level". I thought these applications could perhaps be stressed a bit more strongly or made more prominent earlier in the paper. * Figures 2 and 3 are showing example plots from which we don't actually need to infer any important biology. I thought these figures could be combined and each individual plot type only shown once. (This is for clarity and I don't see anything incorrect about the authors' current plots. * There may be some issues with dependencies for some users. For example, it prompted me to install viridis and loomR as I went through the Quickstart. I ended up encountering an error there is no package called 'loomR' while trying. I had to manually install with remotes::install_github(repo = "mojaveazure/loomR"). Maybe provide an explicit dependencies list/list of recommended packages to install? * I had an error the first time calling Palantir.RunDM(). I hadn't created a seuratextend environment. I found that I could do this manually using create_condaenv_seuratextend(), but that this wasn't supported for Apple Silicon chips. I would suggest that the authors do try to find a way to get this working on newer Apple chips, because Mac machines are very common among bioinformaticians in my experience. * While the writing is largely quite clear, I found it to be a bit voluminous. If the authors are able to cut down on text length that may help in emphasizing the key points that make their package valuable to users.

      I had these minor comments: * "Moreover, mainstream scRNA-seq analysis tools are primarily developed for either the R or Python platforms, with additional options like Nextflow and Snakemake" — I suggest revising this sentence. The tools are developed in R or python languages, which I would not call platforms. I would reword that Nextflow and Snakemake are workflow management systems that provide additional options for pipeline automation * "the R ecosystem surrounding Seurat appears relatively limited" — I'm not sure I would agree with this. I counted wrappers for 17 methods currently. Yes it is true that there are more packages in scverse. However, I suggest moderating your claims about Seurat being limited. * Suggest removing snakemake from Table 1 — it is really different from the other tools listed there

    2. ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Yu H. Sun

      This manuscript introduces an extended version of the widely-used Seurat package, named SeuratExtend. Specifically, Hua et al. developed an integrated an intuitive framework to streamline scRNA-seq data analysis, such as trajectory analysis, GRN construction, and functional enrichment analysis. The package also features direct integration with other popular tools, including Seurat, scVelo, etc. Notably, the software has been demonstrated through training programs, with over 100 stars on GitHub, which is impressive. I have tested the package, including installation and some basic functions. Moreover, the GitHub webpage is well-documented, featuring multiple use cases tailored for beginners. The overall user experience exceeded my expectations, though I have a few minor comments for improvement:

      1, The DimPlot2 function is very useful, and easy to customize the colors. However, the default color scheme seems to be too dark. Considering a more distinguishable and visually appealing color palette might be a solution.

      2, How to control the angles of cell type labels when using VlnPlot2? The 'Split visualization' has all the labels in a horizontal direction, leading to overlapping in some cases, while 'Subset Analysis' plots have labels in 45 degree, which is much better to read. However, I didn't see a parameter to control this. Does VlnPlot2 handle this automatically?

      3, It's a very nice feature to have the 'Statistical Analysis' function to label significant groups. However, in single cell analysis, the p values are easy to be inflated due to the large number of cells. While the example pmbc data is relatively small, larger datasets might yield significant p values without obvious differences in the violin plots. It would be beneficial to mention this in the documentation, and provide some guidance so the results won't be misleading.

      4, The ClusterDistrBar is another valuable function. Based on my experience with similar analyses, I suggest incorporating features to identify robust changes in cell type composition. For instance, tools like sccomp can help determine changes in cell population composition.

      5, I wonder if the gene label directions can be changed easily for WaterfallPlot?

      6, Regarding the volcano plot, does LogFC mean log2 or log(e)? I noticed that this may not be consistent if you used different tools. For example, some tools like Seurat FindMarkers uses Log2, while NEBULA uses Log(e). Clear labeling on the x-axis and tutorial guidance would help ensure consistency.

      7, Very nice introduction about the color palettes at the end of the Enhanced Visualization tutorial.

      8, The incorporation of python tools into R is innovative, including scVelo, Palantir. There may be a need to continue incorporating new tools, such as Dynamo, a newer tool I started to use recently. While this is not required for the current revision, it could be a valuable direction for future development.

      Overall, this tool represents a comprehensive extension of Seurat, combining enhanced visualization, pathway enrichment, and trajectory analysis into a single package. I look forward to seeing a revised version of this manuscript.

    1. AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Aijun Ma

      In the manuscript "Multiomics uncovers the epigenomic and transcriptomic response to viral and bacterial stimulation in turbot", many investigations were applied to uncover the immune regulatory response in the turbot. This multi-omics investigation provided an improved understanding of the epigenomic basis of turbot immune response and offers novel functional genomic information. However, some aspects need to be considered in order to improve the manuscript, as indicated below. 1 Line 16: In this sentence, authors used "the innate immune regulatory response" to describe the response of these two stimuli in a tissue and cell. Innate immunity is a very strict term, and it is not appropriate to use it here. 2 Line 34-36: poly I:C and inactive Vibrio anguillarum were just like PAMP, the response to these two stimulations cannot represent the process of disease defense. The sentence "which can be leveraged for disease resistance selective breeding" was listed in conclusions, that was not accurate. Suggest moving this sentence to the outlook section. 3 Line 80-87: Head kidney is a key lymphoid organ in most marine fishes, and plays central role in fish immunity. It is inappropriate to only talk about its innate immune function. Vibrio is a common bacterium in seawater, while Vibrio anguillarum is an opportunistic pathogen. Strictly speaking, experimental fish will inevitably meet Vibrio during the breeding process before the experiment. Suggest reorganizing the sentences of this paragraph.

    2. AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Elisabeth Busch-Nentwich

      This is a careful analysis of a large and high-quality dataset that will be a very useful resource for researchers across disciplines. I commend the authors on their extensive metadata, and comprehensive and well annotated data tables, which make this a truly accessible resource. I don't have any major criticism. A few minor points: 1. Typo in Figure 1 (it's immature, not inmature) 2. In Fig 3 Upset plots could be a bit easier to parse 3. Fig 5 doesn't have a legend for the blue gradient (but it's pretty self-explanatory)

    3. AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Laura Caquelin

      1. Summary of the Study This study provides the first multi-omics investigation of the innate immune response in turbot (Scophthalmus maximus). By integrating RNA-Seq, ATAC-Seq, and ChIP-Seq data, researchers identified changes in gene expression, chromatin accessibility, and histone modifications after viral and bacterial stimulation. The findings reveal a significant coupling between the transcriptome and chromatin state, offering insights for the selection of disease resistance in aquaculture.

      2. Scope of reproducibility

      According to our assessment the primary objective is: Association of ATAC-Seq and ChIP-Seq data with RNA-Seq data

      ● Outcome: Overlap of promoter DARs and DHMRs with DEG promoters ● Analysis method outcome: Hypergeometric test ● Main result: "DARs and DHMRs were much more overrepresented at the promoter regions of upregulated rather than downregulated DEGs" (Table 4, Supplementary Table 11; Lines 403-405, Page 9)

      1. Availability of Materials a. Data ● Data availability: Raw data are available, but generated data from the study are shared with the journal and not yet publicly available ● Data completeness: Complete ● Access Method: Manuscript's supplementary files/Private journal dropbox ● Repository: - ● Data quality: Structured, but lacks variable definitions in supplementary files, making it difficult to interpret and use. b. Code ● Code availability: Not available for the primary result ● Programming Language(s): Excel ● Repository link: - ● License: - ● Repository status: - ● Documentation: README lacks information on hypergeometric test.

      2. Computational environment of reproduction analysis

      ● Operating system for reproduction: MacOS 14.7.4 ● Programming Language(s): Excel ● Code implementation approach: Excel formulas based on methodology description provided by authors ● Version environment for reproduction: Excel version 16.94

      1. Results

      5.1 Original study results ● Results 1: Table 4 and supplementary table 11

      5.3 Steps for reproduction

       Reproduce supplementary table 11 to perform hypergeometric test * Issue 1: No code or instructions for constructing Table 4 in manuscript and README text. ▪ Resolved: Authors shared methodology upon request Authors' Clarification: The hypergeometric test wasn't carried out with any particular script but with the following public online tool, that can be replicated in excel: https://systems.crump.ucla.edu/hypergeometric/ The tool basically runs the following excel formulas: Cumulative distribution function (CDF) of the hypergeometric distribution in Excel =IF(k>=expected,1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) =IF(k>=((sM)/N),1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) expected = (sM)/N direction =IF(k=expected,"match",IF(k<expected,"de-enriched","enriched")) fold change =IF(k<expected,expected/k,k/expected)

      where k is the number of successes (intersection of DAR/DHMR in promoters + DEG), s the sample size (DEG), M the number of successes in the population (DAR/DHMR in promoters) and N the population size (28.602 genes). For each condition, the count of downregulated and upregulated DEG (s) was taken from supplementary table 4. Similarly, the count of downregulated and upregulated DAR/DHMR (M) was taken from supplementary table 10, considering only differential peaks that are annotated as "promoter-TSS" in the annotation column (column M). The population size (N) was the total list of genes that were DEG, DAR or DHMR (combining the data on supplementary tables 4 and 11, eliminating duplicates). Finally, the intersection of of DAR and DEG (k) for each condition was retrieved with the following venn diagram online tool: https://bioinformatics.psb.ugent.be/webtools/Venn/" * Issue 2: Discrepancies in DEG counts from supplementary table 11 ▪ Resolved: Investigated variable definitions (using the wrong variable - strand), confirmed that log2FoldChange determines up/down-regulation * Issue 3: Filling in DAR/DHMR values ▪ Unresolved: Unclear correspondence between "promoters" rows and excel file sheets. Does H3K27me3 correspond to the promoters? * Issue 4: Using the Venn diagram tool to find intersections ▪ Unresolved: Worked for one condition (ATC vivo poly (down)) but failed for ATAC vitro-vibrio and ATAC-vivo-vibrio. Tool returns a "Request Entity Too Large" error. * Issue 5: Define the population size ▪ Unresolved: The instructions for defining the population size are not clear. In supplementary table 4, it seems that the variable "Gene ID (ENSEMBL)" should be used, but in supplementary table 10, should the variable "Nearest PromoterID" or "Gene symbol" be used?  Using supplementary table 11 values to perform hypergeometric test Having failed to obtain the values required to reproduce supplementary table 11, the data already provided were used to obtain the "enrichment" and "p-value" values using the excel function provided. * Issue 1: Comparison of p-values ▪ Resolved: For Up condition, extremely small p-values are not displayed correctly due to Excel's limitations in scientific notation. Excel may either display them as zero or in an incomplete scientific format (e.g., 0.00E+00). Using the tool on the web.

      5.4 Statistical comparison Original vs Reproduced results ● Results: Based on the available data in supplementary table 11, the "enrichment" and "p-value" values have been successfully reproduced in most cases. ● Comments: The full table could not be reproduced, particularly the data corresponding to DAR/DHMR, DAR/DHMR+DEG and population size values, due to missing information or unclear definitions in the supplementary files. ● Errors detected: The enrichment value for the Up condition of promoters-vitro-vibrio was incorrectly reported in the manuscript/table. Based on the Excel formula and the online tool used, the correct value appears to be 2.28 instead of 2.82. ● Statistical Consistency: All the values that could be reproduced from the available data matched the original results, except for the detected error.

      1. Conclusion
      2. Summary of the computational reproducibility review The study's results were partially reproduced. Key values such as enrichment and p-values were successfully replicated, but some dataset elements (DAR/DHMR, DAR/DHMR+DEG, and size population) could not be verified due to insufficient methodological details provided in the manuscript. An error in the enrichment value for the Up condition of promoters-vitro-vibrio was identified (2.28 instead of 2.82). The p values used for statistical inference were however successfully reproduced.

      3. Recommendations for authors o Improve data documentation: Define variables in supplementary files. o Provide all code and scripts: Share the excel formulas used for table 4/supplementary table 11. o Clarify statistical methodology: Include detailed methods description for the hypergeometric test. o Enhance reproducibility workflow: Provide a structured README with all necessary steps.

  4. Jul 2025
    1. Editors Assessment:

      This paper presents Chevreul, a new open-source R Bioconductor (meta-)package for processing and integration of scRNA-seq data from cDNA end-counting, full-length short-read or long-read protocols. Alongside a R Shiny app for easy visualization, formatting, and analysis for exploratory analyses of scRNA-seq data processed in the SingleCellExperiment Bioconductor or Seurat formats. The name of the tool is inspired by the colour theorist Michel-Eugène Chevreul and the optical illusion of the same name. To demonstrate the use of Chevreul, the authors provide a sample analysis, which helps to demonstrate how users can visualize a wide range of parameters, enabling transparent and reproducible scRNA-seq analyses. Peer review also pushing the author to provide extensive guidance materials to assist with use. Being implemented in R, the R package and integrated Shiny application are freely available under an open-source MIT license in Bioconductor and their GitHub page here: https://github.com/cobriniklab/chevreul

      This evaluation refers to version 1 of the preprint

    2. AbstractChevreul is an open-source R Bioconductor package and interactive R Shiny app for processing and visualization of single cell RNA sequencing (scRNA-seq) data. It differs from other scRNA- seq analysis packages in its ease of use, its capacity to analyze full-length RNA sequencing data for exon coverage and transcript isoform inference, and its support for batch correction. Chevreul enables exploratory analysis of scRNA-seq data using Bioconductor SingleCellExperiment or Seurat objects. Simple processing functions with sensible default settings enable batch integration, quality control filtering, read count normalization and transformation, dimensionality reduction, clustering at a range of resolutions, and cluster marker gene identification. Processed data can be visualized in an interactive R Shiny app with dynamically linked plots. Expression of gene or transcript features can be displayed on PCA, tSNE, and UMAP embeddings, heatmaps, or violin plots while differential expression can be evaluated with several statistical tests without extensive programming. Existing analysis tools do not provide specialized tools for isoform-level analysis or alternative splicing detection. By enabling isoform-level expression analysis for differential expression, dimensionality reduction and batch integration, Chevreul empowers researchers without prior programming experience to analyze full-length scRNA-seq data.Data availability A test dataset formatted as a SingleCellExperiment object can be found at https://github.com/cobriniklab/chevreuldata.

      Reviewer 1. Dr. Luyi Tian and Dr. Hongke Peng

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Thus, the statement of need is well-defined, addressing both the problem (complexity of scRNA-seq data analysis without programming skills) and the intended audience (non-programming researchers in the field).

      Additional Comments: This study provides Chevreul, a Bioconductor package, for analysis and visualization of single-cell sequencing data. This package contains a shinny app. It also provide the functions which implemented by a set of bioconductor packages for standard scRNA-seq analysis to generate the necessary input of the shinny app. I believe that this app can provide an additional option for researchers who work with single-cell data. However, there might be a few comments need addressing.

      While the title emphasizes "exploratory analysis of full-length single-cell sequencing," the authors do not explicitly mention the analysis full-length data (e.g., isoform detection or quantification). For instance, the “sce_process(...)” pipeline figure lacks specific steps addressing full-length sequencing workflows. To strengthen this claim, the authors might need to mention/summarize the methods for isoform detection and quantification, for both annotated and novel ones. It would be better to specify recommended tools for transcript-level analysis (e.g., transcript assembly or differential isoform usage) that integrate with Chevreul's visualization features. Meanwhile, The manuscript focuses on Smart-seq as the representative full-length method. It might also be helpful to discuss other full-length methods such as ONT nanopore sequencing or PacBio, in aspect of data processing, transcript assembly, de novel usage or potential challenges in adapting Chevreul to these platforms, etc.

      There is another minor suggestion. Functions mentioned in the text and Figure 1 (e.g., “sce_process”, “sce_integrate”) should include parentheses (e.g., “sce_process()”) to align with R syntax conventions and clarify their roles as package functions.

      Re-review: I am happy with the revision and author have fully addressed my concerns.

      Reviewer 2. Dr.Tianhang Lv

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Chevreul provides tools for exploratory analysis of single-cell data and offers essential tools for the analysis and visualization of single-cell full-length transcriptomes. In several sections of the article, the authors discuss the key computational challenges addressed by this software. However, in the abstract, they need to emphasize the advantages of Chevreul in single-cell full-length transcript analysis (the current version lacks sufficient description). In the "Statement of Need" section, the authors could also highlight the limitations of existing single-cell full-length transcript analysis tools and introduce the advantages of Chevreul in this regard.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Yes. Although the authors have provided installation documentation, the current documentation on GitHub is not user-friendly. For example, the page at https://github.com/cobriniklab/chevreul does not include code for importing seuratTools, yet it runs the built-in function clustering_workflow from seuratTools. Additionally, the current documentation is overly simplistic and not accessible to those without programming experience.

      Is the documentation provided clear and user friendly?

      No. The authors have separated the example workflows for SingleCellExperiment objects and Seurat objects into two different GitHub projects, which is not conducive for users to understand the structure of Chevreul or to facilitate learning. Additionally, the batch integration mentioned in the article lacks specific implementation examples. The authors should at least provide implementation examples for the results mentioned in the manuscript. Furthermore, the current documentation needs further refinement to truly enable individuals without programming expertise to easily analyze single-cell data.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      No. The authors have developed an excellent Shiny app for single-cell visualization, enabling users without programming expertise to freely export visualization results from single-cell analysis. The installation commands provided by the authors on https://github.com/cobriniklab/chevreul do indeed allow for the installation of Chevreul. However, Chevreul involves nearly 300 dependency packages, including sub-libraries developed by the authors (seuratTools, chevreulPlot, chevreuldata, chevreulPlot, chevreulProcess, chevreulShiny) as dependencies. Relying solely on the installation commands provided by the authors to install all dependency packages may result in some packages (especially large ones) failing to install due to network bandwidth issues, which is not user-friendly for those without programming experience. Additionally, could the numerous dependency packages of Chevreul potentially cause dependency conflicts with existing R environments? Should the authors recommend users to deploy Chevreul in a new R environment? It is recommended that the authors provide a step-by-step installation guide, explaining potential issues and solutions during the installation process based on the dependencies of Chevreul and its sub-libraries. By installing dependency packages step by step, users can gradually complete the installation of Chevreul. The current installation documentation is clearly not user-friendly for non-programmers and does not align with the authors' statement in the manuscript: "It differs from other scRNAseq analysis packages in its ease of installation and use." At present, the installation documentation provided by the authors may not meet the original design intent of Chevreul. Additionally, the authors should specify that Chevreul supports Seurat version V5.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      No. The authors could provide specifications for the minimum hardware requirements needed to run Chevreul, such as the number of CPU cores and the amount of memory. Additionally, the authors could offer data on the runtime of Chevreul as the volume of data increases.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.

      Additional Comment. The authors have developed an R Shiny app for single-cell exploratory data analysis, which will significantly expand the application scenarios of single-cell data analysis and bring great benefits to a wide range of biology practitioners. The large size of Chevreul's installation package indicates the considerable difficulty in its development, reflecting the immense wisdom and effort the authors have invested in creating this package. Chevreul's advantages in visualization and analysis are evident, and if further developed and refined, it is certain to attract even more users in the future. To ensure that such an excellent package as Chevreul can be easily and quickly adopted by users, several suggestions for improving the documentation and enhancing user-friendliness are provided. We hope the authors can refine the package based on the reviewers' feedback and recommendations.

      Re-review: I have carefully reviewed the revised manuscript and am satisfied that all my comments have been adequately addressed. The authors have resolved the software errors reported in the original submission by updating the relevant shiny app modules. They have also enhanced the package documentation to assist users without programming experience in installing and using Chevreul. In the manuscript itself, the authors have provided detailed responses and explanations to each of my points.

      Overall, they have addressed all of my comments thoroughly. That said, a few minor issues remain in the manuscript (revised version with tracked changes) that should be corrected to ensure consistency with academic publishing standards and to help readers better learn how to use Chevreul: 1. On line 52, the placeholder “(doi reference for Shayler et al. data to be provided)” appears—did the authors forget to insert the citation or data link? 2. On line 96, would it be more appropriate to replace “SingleCellExperiments” with “SingleCellExperiment objects”? 3. On line 119, please add a space so that “databases[19–21]used” reads “databases [19–21] used.” 4. For consistency, should the second occurrence of “batchelor” on line 132 be italicized? 5. The Chevreul link is already cited in the “Availability & Implementation” section and need not be repeated in the Figure 1 legend. 6. On line 184, the gene symbol “NRL” should be set in italic Latin script. 7. On the GitHub page (https://github.com/cobriniklab/chevreul), the phrase “A demo with a developing human retina scRNA-seq dataset from Shayler et al. is available here” points to an inaccessible web demo. Restoring this demo in a future update would greatly facilitate experimental biologists in learning and using Chevreul.

    1. In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Shan Raza

      The paper presents a multimodal data set for cell segmentation and benchmarking. The major strength of the dataset is its multimodal nature and including both mouse and human tissue. The paper analyses existing data sets and the performance of state-of-the-art methods. However, the authors missed one of the biggest data sets on the cell segmentation and classification which includes more than 500,000 annotated nuclei in H&E https://www.sciencedirect.com/science/article/pii/S1361841523003079.

      The CoNIC challenge paper also analysis state-of-the-art nuclei segmentation and classification methods. The authors should add one of the best performing models in their analysis. I would also suggest the authors to include PQ and froc in the metrics to analyse the results as this is commonly used in this domain for comparison. I would also suggest to compare the results with HoVerNet or HoVerNext (https://github.com/digitalpathologybern/hover_next_train) which are state-of-the-art algorithms for nuclei instance segmentation. The code for these algorithms is publicly available.

    2. In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Jeff Rhoades

      General comments:

      1. Dataset Innovation: CellBinDB offers a significant improvement over existing datasets with its diversity of staining types (DAPI, ssDNA, H&E, mIF) and broad tissue coverage, including normal and diseased samples.

      2. Benchmarking of Models: The evaluation of seven state-of-the-art segmentation algorithms provides valuable insights for researchers selecting tools for various imaging modalities.

      3. Analysis of Influencing Factors: The manuscript thoroughly examines biological (e.g., cell morphology) and technical (e.g., image gradient) factors affecting model performance, providing practical recommendations for improving segmentation outcomes.

      4. Preprocessing Impact: Demonstrating the effectiveness of preprocessing (e.g., grayscale conversion for H&E images) is an immediately actionable takeaway for practitioners. However, authors should apply preprocessing uniformly to all segmentation approaches, not just those that did poorly initially.

      Major Areas for Improvement:

      1. Preprocessing Uniformity:
      2. Apply preprocessing steps uniformly across all segmentation approaches to ensure fair comparisons and avoid bias.
      3. Inclusion of Cellpose3 Training Dataset:
      4. The manuscript should include the dataset used for training Cellpose3 in its comparisons. Cellpose3's superior generalist model performance is emphasized, yet the absence of its training dataset in the comparisons raises questions about robustness of the benchmarking.
      5. Evidence of Dataset Utility:
      6. While the dataset's benchmarking is well-done, the manuscript does not provide evidence that models trained on CellBinDB outperform those trained on other datasets. Addressing this, though potentially out of scope, would strengthen the manuscript's impact.
      7. Figure Panels:
      8. Labeling in figure panels should be clearer to enhance interpretability. For instance, indicate whether the instance or semantic masks are being shown and consider making instance segmentation masks colorful to highlight unique IDs.
      9. Semantic masks could be omitted if space is constrained, as they are largely redundant with instance masks.
      10. Ensure figures are spaced more evenly throughout the text, ideally located near their first references, to improve readability.
      11. Abstract Clarity:
      12. The abstract should better reflect the intellectual contributions of the analysis of segmentation performance factors (i.e. cell morphology and image gradients).
      13. Normalization Methods:
      14. Provide details on how cell morphology indicators are normalized in the methods section to ensure reproducibility and clarity.
      15. Explanation of Image Gradient:
      16. The discussion of gradient magnitude and its calculation using the Sobel operator requires more accessible language. Not all readers will be familiar with this concept, so additional context is essential.
      17. Tissue Classification:
      18. Group related tissues, such as "brain," "half brain," and "cerebellum," under a common "neural tissue" category for easier interpretation and analysis. Additional Suggestions:
      19. Address grammatical errors and improve clarity in some sections, such as the benchmarking pipeline description.
      20. Replace vague terms like "ML-based" when referring to CellProfiler with specific algorithmic descriptions.
      21. Including public datasets, such as Cellpose, to create a unified, all-inclusive CellBinDB dataset might significantly enhance the resource's utility for machine learning practitioners.
    1. BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      **Reviewer: Christopher Tabone **

      This manuscript evaluates the use of large language models (LLMs) to improve the consistency and usefulness of BioSample metadata. The authors focus on extracting specific biological terms from freetext sample descriptions: first, identifying cell line names (using a curated gold-standard for evaluation), and second, identifying experimentally modulated gene names (in a scenario without prior manual curation). An open-source 70B LLM (Llama 3.1) was used and its performance was compared against a conventional ontology-mapping pipeline (MetaSRA). Overall, the study is well-motivated - addressing the challenge of heterogeneous metadata - and the approach is generally sound and well documented. Below, I address specific aspects of the work in detail: Methodological Appropriateness and Controls: The methods are appropriate to the study's aims and are described with detail. The two-part evaluation (cell line extraction and gene name extraction without prior curation) aligns well with the goal of demonstrating LLM utility in metadata curation. The authors took care to construct a gold-standard dataset for cell line extraction by leveraging ChIP-Atlas's manually curated sample annotations. This approach avoids starting from scratch and ensures the evaluation is grounded in experimental metadata. The sample selection strategy is well justified: using equal numbers of ChIP-seq and ATAC-seq samples to control for the presence/absence of protein names (a potential confounder for detecting cell lines), avoiding duplicate projects and identical terms, and restricting to human samples to leverage the Cellosaurus ontology. These controls strengthen the evaluation by preventing bias (e.g. one project dominating results or trivial cases duplicating answers). The LLM pipeline is clearly outlined (Figure 2) - the model is prompted with BioSample attributes to extract a representative cell line term. Importantly, the authors compare this LLM-assisted pipeline against an existing rule-based method (the MetaSRA ontology mapping pipeline). This serves as an essential control/baseline to quantify the improvement gained by using an LLM. For the second task (extracting modulated gene names), where no curated baseline exists, the authors sample thousands of BioSample entries and perform manual evaluation of the LLM's outputs. While manual checking is necessary here, the manuscript could clarify the evaluation procedure (e.g. how many evaluators or what criteria were used) to assure readers of consistency. Overall, the experimental design is solid. The necessary details (model used, prompt design, parameter settings like temperature=0 for reproducibility) are all provided, and the authors have made their code publicly available, which aids reproducibility. The methodology is transparent and should allow others to replicate or build upon the work. Support for Conclusions by Data: The conclusions are, for the most part, well supported by the data presented. In the cell line extraction task, the LLM-based method clearly outperforms the traditional MetaSRA pipeline in both accuracy and coverage (Table 4). For example, the LLM pipeline achieved substantially higher coverage (93.0% vs 72.1% for MetaSRA) without sacrificing accuracy (~92.3% vs 90.3%), and it also showed improved precision in identifying non-cell line samples. These results validate the authors' claim that LLMs can more flexibly and comprehensively interpret metadata, mapping many more actual cell line samples to ontology terms while maintaining low false-positive rates. The data support the conclusion that the LLM approach enhances metadata findability (since far more samples get correctly annotated) and does so with high reliability. The authors appropriately note that the conventional method's conservative strategy yields high precision at the cost of leaving many samples unmapped, whereas the LLM can confidently map a greater portion of samples. This finding is well substantiated by the numbers and the error analysis in Table 5 (which categorizes the few failure cases of the LLM, such as confusion with derivative cell lines or missing a cell line when certain keywords were absent). In the gene name extraction task, the authors report that the LLM identified at least one gene in 600 out of 3,723 tested samples, with an overall accuracy of ~80.3% for those outputs (about 91.6% accuracy on gene names themselves, and 84.7% on the associated modulation method). This demonstrates that the LLM can successfully parse complex descriptions to find gene perturbations in a majority of cases. While there is no baseline for direct comparison here, these results are consistent with the idea that LLMs can extend curation to new information types not yet curated (in this case, finding manipulated genes where an ontology or curated list didn't exist). The authors' conclusions about the utility of this - for example, that it could allow users to filter out experiments with gene knockouts/knockdowns to avoid confounding effects - are reasonable extrapolations from the data. The discussion correctly notes that coverage for this gene task wasn't evaluated (since no gold standard exists) and acknowledges that some fraction of relevant cases might be missed. All major conclusions (LLM outperforms rule-based methods; LLM extraction of new metadata is feasible and useful) are backed by the evidence provided. The authors also contextualize their findings by noting limitations and practical considerations (e.g. the processing throughput of ~400 samples/hour and the challenge of scaling to 40 million records). This adds credibility to their interpretation that LLM-based curation will need further resources or model improvements to handle the entire database. In summary, the data presented are analyzed in depth (with relevant tables, figures, and a breakdown of error types), and they support the paper's conclusions well. I have no concerns that the authors are overstating their results. Language Clarity and Quality: The manuscript is written in generally clear and professional English. The authors note that they translated the draft from Japanese with assistance from ChatGPT, and the result is readable and scientifically appropriate. The overall clarity is good - important terms are defined, and the narrative flows logically from the motivation to methods, results, and discussion. I did not encounter ambiguities that impede understanding of the science. There are only a few minor issues in language usage and grammar that require attention. For example, there is a small typo in the description of gene overexpression ("achieved by trasfection of a plasmid…" on page 19) - "trasfection" should be "transfection" (unless this typo was carried over from the original prompt). Another example is the sentence "the outcomes of this study can handle these errors to rescue the affected published data for further use," which is a bit awkward in phrasing - perhaps reword to clarify that the methods developed can help correct metadata errors from submitted data. These are relatively minor edits; the manuscript does not require heavy language revision, just light editing for a few misspellings and stylistic "smoothing". The structure of the paper is appropriate, with a clear Introduction and well-labeled sections (Methods, Results/Discussion, Limitations, etc.). Data presentation is also clear: figures and tables are easy to interpret, and captions are explanatory. For example, the flowchart in Figure 2 and the definitions in Figure 3 clearly help in the understanding of the pipeline and metrics. In summary, with minor editorial changes, the quality of language and presentation will be suitable for publication. Statistical Analysis and Data Presentation: I am able to assess all the statistics and quantitative analyses in the manuscript, and they appear appropriate. The study primarily uses descriptive performance metrics (accuracy, coverage, precision, recall) to evaluate the extraction tasks - these are standard and well defined (the text and Figure 3 provide clear definitions of each metric in the context of the task). The comparisons between the LLM pipeline and the MetaSRA pipeline are straightforward to interpret. The authors did not perform complex statistical tests (e.g., no p-values are reported), which can be justified given that the magnitude and consistency of the improvements are evident and the evaluation emphasizes practical performance metrics rather than hypothesis testing. However, the manuscript states in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq subsets. If the authors intend "significant" to indicate statistical significance, it would be necessary to include the specific statistical test used along with associated test statistics and p-values to substantiate this claim. If no formal statistical testing was conducted, it would be more accurate and clearer to rephrase this as a qualitative observation rather than implying formal statistical support. All underlying data needed to interpret the results are provided either in the main figures/tables or supplementary material. The presentation of results is clear and transparent: Table 4 quantitatively summarizes the performance of each pipeline, and Table 5 qualitatively categorizes the errors made by the LLM. I have no other concerns about the appropriateness of statistical methods used - the evaluation metrics are suitable for information extraction tasks, and the sample sizes (600 samples for the cell line task, and thousands scanned for the gene task) are adequate to support the conclusions. In terms of data transparency, the manuscript indicates that outputs and code are available (with a GitHub repository provided), which will allow others to reproduce the analysis. Additional comments and suggestions: Beyond the points above, I have a few minor suggestions to further strengthen the manuscript. First, it would be helpful if the authors could clarify in the Methods how the manual evaluation of gene name extraction was performed—for example, whether multiple curators independently reviewed the outputs or if any consensus procedure was employed to resolve ambiguous cases. Providing this detail would add transparency to the accuracy figures reported, although the existing explanation about handling ambiguous cases (e.g., fusion genes) is already helpful. Second, given the manuscript's emphasis on a zero-shot LLM approach, it would be beneficial for the authors to briefly discuss whether alternative strategies, such as fine-tuning smaller language models, were considered. This would more clearly position the study within the broader landscape of metadata curation techniques. Third, the authors describe the use of the locally deployed Llama 3.1 model and emphasize its advantages regarding data privacy and scalability. Since these benefits are significant for practical adoption, it would further strengthen the manuscript if the authors explicitly highlight practical considerations, such as specific hardware requirements (in addition to the graphics card usage already included) and runtime performance benchmarks. Finally, as mentioned earlier, the authors mention in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq samples. If the term "significant" here is meant to indicate statistical significance, please include details of the specific statistical test and associated values (e.g., test statistics and p-values) that substantiate this conclusion. If no formal statistical testing was performed, it would be more appropriate to rephrase this statement to indicate a qualitative observation rather than imply statistical testing. These points are relatively minor and do not indicate fundamental issues with the manuscript. Recommendation: In summary, this is a strong manuscript that addresses a pertinent problem in biological data management using modern LLM tools. The methods are sound and well controlled, the results are convincing, and the authors have been appropriately cautious and thorough in their analysis. I recommend minor revisions for this manuscript. The revisions needed are primarily editorial (minor language fixes and clarifications), with one note about statistics, and do not require additional experiments. With those addressed, the work should be suitable for publication in GigaScience.

    2. BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Sajib Acharjee Dip 1. The gold-standard dataset constructed for evaluation, though carefully validated by experts, was limited to 600 samples (300 ChIP-seq and 300 ATAC-seq). Such a limited scope may introduce selection bias or fail to capture the full variability present across the entire BioSample database (>40 million records). It is unclear how representative these samples are of real-world metadata submissions.Clearly demonstrate the representativeness of the sample selection or increase sample size to better represent BioSample's diversity.

      1. The manuscript predominantly compares the proposed LLM-based approach to the MetaSRA pipeline. While MetaSRA is a relevant baseline, the omission of comparisons with other contemporary methods like ChIP-GPT, and Bioformer is a notable oversight. These tools represent significant advancements in the field and have demonstrated efficacy in tasks closely related to the study's objectives. A comprehensive evaluation against these methods or comparative discussions would provide a clearer understanding of the proposed approach's relative performance and contributions. https://academic.oup.com/bib/article/25/2/bbad535/7600389 https://pmc.ncbi.nlm.nih.gov/articles/PMC10029052/

      2. "LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage." While the study reports improved performance over MetaSRA, the absence of comparisons with other SOTA methods renders this assertion less robust. Without such comparative analyses, it's challenging to attribute the observed improvements solely to the proposed approach.​ Rephrasing claims to accurately reflect the scope of the comparisons made would strengthen clarity.

      3. Despite high accuracy, complex cases (fusion proteins, inhibitors mentioned indirectly, ambiguous terminology) were recognized as difficult, yet were excluded from primary accuracy evaluations. By excluding these ambiguous cases from performance metrics, the accuracy results might be artificially improved. Provide additional metrics that include these complex or ambiguous cases, clearly quantifying performance drops. This would offer more realistic insights into real-world applicability.

      4. The error categorization provided (derivation issues, overlooked terms, selection failures, etc.) is helpful, but somewhat superficial. The deeper root causes—such as the LLM's lack of biological context knowledge, tokenization errors, or prompt ambiguity—were not thoroughly explored or explained. Discuss or perform deeper qualitative analysis on specific error instances, highlighting precisely why the LLM made incorrect decisions (e.g., lack of biological understanding, misinterpretation of abbreviations, limitations of prompt wording).

      5. Temperature settings were fixed at zero for deterministic outputs. While deterministic settings are valuable for reproducibility, exploring or reporting the effect of temperature variations on accuracy and robustness would have strengthened this methodological choice significantly.

      6. The authors have not sufficiently explored or justified their prompt engineering choices which are critical for reproducibility and optimization. I recommend providing additional experiments or discussions on alternative prompting strategies tested, including prompt variants that failed and reasons why particular prompts were selected.

    1. Despite the surge in data acquisition, there is a limited availability of tools capable of effectively analyzing microbiome data that identify correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here, we introduce CODARFE, a novel tool for sparse compositional microbiome-predictors selection and prediction of continuous environmental factors. We tested CODARFE against four state-of-the-art tools in two experiments. First, CODARFE outperformed predictor selection in 21 out of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data—that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects (e.g., ginseng field and cattle for arable soil, and HIV and crohn’s disease for human gut), using a model trained on one dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in five formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify - https://github.com/alerpaschoal/CODARFE.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf055), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:

      Reviewer: Jaak Truu

      This manuscript addresses key aspects of microbiome data analysis, particularly in relating continuous variables to microbiome data and utilizing microbiome data to predict variables of interest. The data analysis approach is well-articulated; however, there is a notable omission regarding the derivation of the microbiome datasets. While the sources of these datasets are mentioned, it remains unclear whether the authors processed the initial data to produce the count tables used as input or if these tables were directly adopted from the original publications. Given that the data in the main text are derived from studies based on 16S rDNA sequencing, variations in data processing pipelines between publications could introduce significant variability. Although the manuscript discusses the importance of the sequenced 16S rDNA region and the similarity of the environments from which the samples were obtained, it does not address the impact of the initial data processing pipeline (including taxonomy assignment).

      Additionally, the number of samples in each dataset is not provided in the tables.

      The manuscript includes a comparison of the proposed method with other tools; however, it omits MaAsLin (Microbiome Multivariable Association with Linear Models), that has been applied far more extensively in microbiome data analysis than the tools included in the current manuscript. Incorporating a comparison with MaAsLin would enhance the comprehensiveness of the evaluation.

    1. Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Wanneng Yang

      The manuscript presents a comprehensive dataset spanning six years, encompassing data from eight key growth stages of wheat, along with corresponding phenotypic data. The construction of such a comprehensive dataset is highly valuable. However, from the perspective of dataset construction itself, quality control and consistency checks require further refinement. Specific issues are as follows:

      1. How is the consistency check of parameters such as canopy cover and plant height at the eight key growth stages ensured? Especially for parameters like phenological stages and senescence assessment, which are determined through visual evaluation and thus susceptible to subjective influences, quality control and consistency check become particularly crucial. It is recommended to supplement relevant content for detailed explanation.

      2. For all images (151,150 out of 158,891 images), the success rate of alignment and within-field detection exceeded 95%. Does this mean that the final RGB sequence image dataset consists of 151,150 images?

      3. Regarding plant height measurement, the text mentions that "TLS (2016, 2017) or UAV (2018 to 2022) was used to measure plant height." Given the potential differences in height measurements obtained from these two methods, how were these differences addressed in the manuscript?

      4. Does this dataset cater to different tasks and include annotated data? If so, it is recommended to specify the concrete annotation methods and data.

      5. If possible, it is recommended to provide a summary table that specifies the different types of data contained in the dataset along with their respective quantities, facilitating readers' comprehensive understanding of the dataset.

      6. What are the potential limitations of this dataset? It is recommended to point them out.

    2. Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Abhishek Gogna

      Thank you for the submission. The dataset surely holds value for the plant breeding community but my major concerns are (1) the availability of genetic data, (2) non-conformity to MIAPPE standards (https://www.miappe.org/). These restrict value of the otherwise excellent publication. I would welcome a submission addressing these major points. In addition, I have some minor points for specific sections. Please use the strings in quotation marks ("") to locate the specific sections.

      1. Context Change of Equipment: Please indicate how the change of equipment from TLS to drone affects data interoperability. "Figure 2, gray bars": Kindly update Figure 2 to clarify the representation of the gray bars.* "Heads were annotated": Does this mean that not all relevant images were annotated? If so, please modify the title to avoid confusion.

      2. Description of FAIR: Please revise this section. Both links listed under "Findable" and "Accessible" are eligible for these tags. Please modify "Interoperability" with reference to the publication listed in the "Re-use Potential."

      3. Reference measurements "Senescence was": Was this measurement done for all relevant images? Please include this information. "Adjusted genotype means with year calculation": Please add variance decomposition data for traits.

      3. Compilation as Data set* "pure GABI-WHEAT set for the extended set": Please revise this sentence for clarity.

      1. Heritabilities of intermediate and target traits* "y of the public marker" - Please revise the sentence for clarity.

      2. Genomic prediction ability of unseen multi-environment trial* Is the CDC data part of the data publication? Please add this information.6. Example 1 to

      6* Please revise all code for consistency and updated results. Also, include the necessary packages required to run the code.7. Availability of Source code and RequirementPlease create connectivity between repositories and add descriptive README files outlining their usage. Additionally, please provide instructions on how individual repositories may be used.I appreciate your attention to these points and believe that addressing them will strengthen your manuscript

    1. Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: F Gözde Çilingir

      In this study, the authors generated a high-quality chromosome-level genome assembly and methylome for the loggerhead sea turtle (Caretta caretta) using a combination of Oxford Nanopore Technology (ONT) and Illumina sequencing. They also examined population size fluctuations, identified microchromosomes as key areas for monitoring genetic diversity and epigenetic flexibility, and focused on genes linked to temperature-dependent sex determination (TSD), with additional datasets from 10 individuals using whole-genome bisulfite sequencing (WGBS).The study consists of three key parts: 1) genome sequencing and assembly, 2) benchmarking ONT methylation calls with WGBS, and 3) epigenetic patterning of TSD-linked genes, which was contextualized for future studies. The first part certainly includes relatively novel genomic resources that will provide valuable tools for conservation and population genomics. It's encouraging to see the use of DNA modification detection via ONT, with a comprehensive analysis of 5mC and 5hmC methylomes alongside genomes—especially for chelonians, a group that is underrepresented among available vertebrate genomes. Benchmarking ONT methylation calls with WGBS is also relevant for the field (though some clarifications on the experimental design are necessary). However, I have several concerns regarding the biological rationale of certain study design choices and the conclusions drawn by the authors regarding the TSD-linked genes' methylation patterns.Overall, this study provides valuable genomic resources for loggerhead sea turtles. However, some of the biological assumptions and study design choices regarding the methylation patterning require further clarification and a more robust discussion to ensure that the conclusions drawn can be supported by the data produced.Detailed comments to the authorsABSTRACTThe abstract states: "Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date."Throughout the manuscript, this number changes. Please double-check and ensure consistency in the number of TSD-linked genes reported.BACKGROUNDI suggest using the phrase "a skew toward female-biased sex ratios" instead of "feminisation" throughout the text for a clearer and more neutral description of the biological phenomenon. For example, the third sentence of the second paragraph could be revised as:"As multiple theoretical studies have predicted a significant skew toward female-biased sex ratios and subsequent population collapse by 2100 in response to future climate scenarios."METHODSPage 5, DNA extraction, sequencing, and quality control - first paragraph:ONT kit chemistry numbers and flow cell types can be confusing for readers. Could you also clarify that the SQK-LSK109 kit used is associated with R9.4.1 flow cells, indicating the sequencing error profile of the technology?Regarding the Phred score >Q8 cutoff: Q8 corresponds to a sequencing error rate of ~15-16%. Could you clarify the reasoning behind choosing this cutoff? Citing similar studies that have used this threshold would add support to your decision.Page 8: I couldn't find the de novo assembled transcriptomes in the ENA or GigaDB repositories. Are these data publicly available? If so, it would be beneficial to provide the location.Page 9, ONT methylation call and validation with WGBS:There's a discrepancy between the retained CpGs: you mention "26,449,075 CpGs" in one place and later report different numbers in the results section. Please clarify these numbers and ensure consistency.It would be helpful to include a table summarizing key metrics of the ONT methylation call, such as mean/median CpG site coverage, similar to Table S3.Page 9, second paragraph: You mention "Ten nesting loggerheads." Please specify that these are ten adult loggerhead females for clarity. Additionally, correct the table references: Table S3 should be Table S2, Table S4 should be Table S3, etc.RESULTS AND DISCUSSIONGenome AssemblyFigure 1B: While Table 1 effectively illustrates the differences in contiguity levels, Figure 1B doesn't add much due to the difficulty in distinguishing closely aligned lines. If you retain the figure, I suggest using more contrastive colors to improve readability.Genome Annotation: I agree that the lack of a pre-determined training parameter set for chelonians within the BRAKER pipeline leads to relatively incomplete gene model predictions. However, lifting over gene models from other sea turtle genomes and combining them with predictions (again using TSEBRA) would likely improve the overall completeness of the annotations.Methylation Call and ValidationYou state, "To verify our ONT methylation call, we compared calls with ten loggerhead methylomes re-sequenced via WGBS." Does this mean you generated an ONT methylome from a single individual and compared it to the average methylation levels from ten different individuals obtained with WGBS? If so, this may not be an ideal benchmarking strategy. Generating both ONT and WGBS data for all individuals would provide a more robust comparison. Clarifying this design would help the reader understand the validation process better. Additionally, consider citing relevant benchmarking studies.In the last paragraph of this section, you highlight ONT as a robust alternative to WGBS but then use WGBS for the TSD-linked gene analysis. This appears somewhat contradictory. It might be useful to explain why WGBS was favored in this part of the analysis.Genome Properties: Figures 3C-F were difficult to read to me (low resolution), and they don't seem directly related to Figures 3A and 3B. I suggest separating these figure groups for better clarity. Additionally, it would be helpful to report or visualize the repeat content of both micro and macro chromosomes. Long-read sequencing assemblies are particularly effective at resolving repeat-rich regions, and microchromosomes are often repeat-rich. Highlighting this aspect would demonstrate the added value of long-read sequencing for assembling reference genomes of organisms like sea turtles.TSD-linked genes: methylation patternsTesting methylation differences between TSD-linked and non-TSD-linked genes focusing on specific regulatory regions is potentially informative, but the biological rationale for expecting consistent differences between these two groups is unclear. TSD-linked genes are involved in dynamic, environmentally responsive processes, whereas non-TSD-linked single-copy orthologues (as used in the study) typically represent essential, evolutionarily conserved functions with more stable methylation patterns. The use of single-copy orthologues as a control set is problematic because these genes could serve fundamentally different roles. A more relevant comparison would be between TSD-linked genes and other genes involved in similarly dynamic, environmentally responsive pathways.Additionally, all methylation data come from adult female blood (N=10, all from the same beach), which may not be the most appropriate approach for studying TSD, a process that primarily occurs during embryonic development, when temperature cues influence sex determination. Methylation patterns in adults may no longer reflect the active regulatory processes that control TSD during embryogenesis. In other words, adult methylation patterns could be influenced by factors such as reproductive status or aging, and may not reflect the regulation of TSD-linked genes during key developmental stages. These limitations/points should be addressed.CONCLUSIONSThe manuscript would benefit from a discussion of how biological context (such as developmental stage) affects the interpretation of methylation patterns in this study.It is also worth mentioning that both ONT and WGBS require substantial amounts of input DNA, and blood samples from reptiles are ideal because of their nucleated red blood cells-this could be acknowledged as a practical advantage somewhere in the text.SUPPLEMENTARY INFOCould you explain what "DMS" refers to in Text S3? This term isn't defined in the manuscript.There are two Figure S7, please change the last one to Figure S8.SUPPORTING DATAThe FTP server data look good, but I couldn't find the de novo transcriptomes. Some files have long, confusing names—adding a README file in each directory would help clarify the contents.Important note: It would be helpful to include line numbers in the manuscript to facilitate direct and effective feedback.

    2. Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Victor Quesada

      This work offers a in improved version of the reference genome for the loggerhead sea turtle. The authors have also analyzed the methylation patterns of blood obtained from different individuals and with two methods. The resulting data set includes gene annotations, methylation levels and the specific analysis of methylation levels of genes involved in temperature-dependent sex determination (TSD). While the improvements offered by this work seem modest, I think that the data sets may provide important resources for future works.-In my opinion, the use of a previous version of the same genome in the assembly process should be noted in the abstract. It would be enough to write "... followed by homolgy-guided scaffolding to GSC_CCare_1.0...".-If possible, the authors should clarify the taxonomic relationship between the reference individual in this work and the reference individual for the previous version of the genome (ref. 26). Is it the same NCBI taxid?-There is a mention to "lateral terminal repeats" at the "Genome annotation" section (page 7). I think it is a typo and it should read "long terminal repeats".-In the same section, at page 9, reference 73 refers to StringTie, not gffread. In addition, it is not clear how "in-frame stop codons were removed". A simple way to unambiguously explain this would be to provide the options that were used, as with other programs.-I would revise the use of "coverage" versus "depth". For instance, the expression "...a coverage of 9.2(...)X" would be more precise as "...a sequencing depth of 9.2(...)X". Coverage should be a fraction or a percentage. However, this is only a piece of advice, as there is no strong consensus at the moment.-The interpretation of methylation patterns is always difficult. In my opinion, the manuscript should discuss several limitations about the results:First, using blood as the starting tissue is convenient but not ideal, as many methylation patterns are tissue-specific. The authors may want to add a reference to preliminary evidence that some methylation changes in blood cells are related to TSD (Bock et al., Mol Ecol. 2022; 31:5487-5505).Second, the work examines broad patterns of methylation (all promoters, all coding sequences,...). While this may be interesting for descriptive purposes, it may also drown significant signals. The manuscript should mention this limitation.*Figure 2B shows methylation per gene. If the aim is to compare both kinds of sequencing, there should be at least one comparison of methylation per CpG, which might even be cathegorial or downsampled.-The origin of the duplication of EP300 seems outside the scope of the manuscript. Nevertheless, given that the question is posed, the authors may want to perform a simple phylogenetic analysis of the sequences. Even the basic analysis of the annotated copies plus an outgroup is likely to give a robust answer to this question.-For the benefit of non-specialists, the manuscript might include a brief mention of how microchromosomes allow a larger number of combinations of variants without chromosome recombination.-Some expressions may be edited for clarity and precission. Examples are "which should be verified whether they are true" (page 17) and "microchromosomes have greater methylation potential and realised levels...".

    3. Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Zhongduo Wang The study presents high-quality genomic and methylomic data for loggerhead sea turtles, serving as a significant resource for further genomic and epigenomic research on this species. Notably, this is the first methylome derived from a sea turtle using ONT technology, offering a new, reliable method for studying the epigenetic characteristics of non-model organisms. Moreover, by integrating genomic and methylomic data, the authors analyze the functionality and methylation patterns of TSD-related genes, contributing fresh perspectives to the molecular mechanisms underlying TSD. While the study offers valuable data, there are several areas that could be enhanced.1) Lack of Reference to Hawksbill Turtle Genome: The manuscript does not discuss any information regarding the hawksbill turtle genome. Given that hawksbills also published a comparative analysis of the loggerhead's genomic data, I recommend that the authors include relevant information or clarify why hawksbill data was not considered.2) Further Optimization of Genome Annotation: The authors acknowledge that the completeness of the genome annotation requires enhancement and mention future improvements such as species-specific parameter adjustments and manual curation. While it is understandable that time and resource constraints may have limited these optimizations prior to submission, it would be beneficial for the authors to clarify the reasons for this and outline a timeline for future enhancements.3) Information on Individual Variability in WGBS Results: The manuscript lacks specific information on inter-individual variability among the ten individuals in the WGBS data. I suggest that the authors consider adding this analysis or provide justification for its absence. If significant variability exists among individuals, averaging the methylomic data could obscure important biological information.4) Clarification on Statistical Tests and Data Processing: The manuscript employs several statistical tests such as t-tests, Ftests, and chi-squared tests. However, the methods section lacks detailed information on how the data was processed for these analyses. I recommend that the authors provide a more thorough explanation of the data preparation steps, assumptions checked, and justification for the choice of tests.In summary, this manuscript makes a significant contribution to the study of loggerhead turtle genomics and methylomics. Addressing the aforementioned points could further enhance the quality and impact of the work.

    1. Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Hardip Patel

      Dear Authors, thank you for compiling this resource and the manuscript. I apologise for the delay in my review. I have read the manuscript with great interest. I have some major concerns that need be addressed and a lot of minor concerns. Without line numbers, it was difficult to provide comments. I have chosen to write the part of the sentence that my comment refers to for you to consider for improvements.

      Major concerns:

      Abstract can include quantitative values for some key results such as the genome size, contiguity (e.g.N50, L90) and quality metrics (e.g. BUSCO) of the genome assembly among other result claims listed in the abstract. Venom as the keyword can perhaps be described/defined. Authors interchangeably use "venom", "toxin", "venom toxin", genes coding venom proteins. I strongly suggest the use of consistent terminologies that are well defined in the manuscript. Methods need elaborate descriptions about reagents, procedures including for library preparations, sequencing machines, library kits and versions, etc. These are relevant for downstream analyses. For all software, list parameters used, even if default, then explicitly state that "default parameters were used". For all software, list version numbers used for analyses. Authors are urged to change "macorsynteny" and "microsynteny" terms to chromosome level and local synteny analyses. This is to avoid confusion related to macro/microchromosomes. "Genomic diversity" analyses use cross-species alignments and variant calling using software and methods developed for same species data. This can introduce significant bias in downstream interpretation and use of the variant data (heterozygosity measure may be). I suggest removal of this section because of lack of accuracy. Discussion of new discovery is largely lacking. I would appreciate if authors contextualized their results with other discoveries in the field. Section headings in Results and Discussions can be changed to reflect main findings instead of "transcriptomics" or "genomic diversity". One of the main findings is about SVMP gene family expansion. However, due to the lack of evidence about assembly accuracy in the region, accurate annotation of copies, and the effect of studying "primary assembly" instead of "haplotype assembly" at this region, I am not convinced of claims made in the paper. Appropriate justification is required for this section. The nomenclature of SVMP genes is confusing. For example, In Figure 4A, they are all labelled as SVMPs with different colours, but then they are labelled as MDCs and MADs in Figure 4b and Supp Figure 6. Please label each gene in each species with consistent names that can reflect orthologous relationship. This is hard to discern, especially without appropriate species labels in Supp Figure 6. Provide MSA files and trees used to infer evolutionary history. In the absence of the sequence alignments, and raw tree file, I am unable to evaluate this section of the manuscript. Please provide all required details for reviewers and readers. ??: It is not clear what authors mean by the word, term, phrase. Please correct them to convey accurate meaning using established and accepted scientific terminologies and English conventions. Minor concerns:

      Abstract:

      "compousing" ?? "highly expressed toxin genes": in what tissues? "genome-wide diversity" ?? "toxin gene families in venomous species" -> "toxin gene families in venomous snake species" Background: "Such advances in sequencing technologies": remove "Such" "depending on their type, interactions, and the organism": interactions with what? "proteomic (and transcriptomic) approaches": remove parenthesis "to new therapies for human illnesses including but not": since the title contains "medically important", it would be great to include some specific examples here from the literature. "However, venomous snakes are one": remove "However" "therefore, the fundamental model system": change "fundamental" to "useful" "of medical importance by the World Health Organization (WHO) due to their": provide citation "Within venomous snakes, the most medically": restructure the sentence for brevity and clarity. "cytotoxic effects (among others)": remove "(among others)" "conducted using a proteomic approach": clarify what proteomic approach mean here. "Hirst et al., (in review);" : remove this citation "within the Viperidae family posses an available reference": change the word "posses" to something meaningful "Moreover, employing several -omics techniques": be specific about techniques "We deciphered numerous genomic attributes": be specific Methods: Describe how blood was extracted from animals with all details including animal handling techniques, body part etc. "was stored in RNAlater until RNA extraction": source for RNAlater "We extracted gDNA from the blood of a female individual": provide additional details such as the quantity of blood used, thawing process, qty of reagents, especially elution buffer etc. Manufacturer protocols may be suited best for mammalian blood (humans, mice) without nucleus in RBCs unlike snakes. "Then, we sequenced a total of two 8M SMRT HiFi cells, aiming for a ∼30x of coverage, at the University of Leiden": provide details of library preparation, sequencing machine etc. "(including venom glands, tongue, liver and pancreas, among others": Either list all or refer to the table. "RNA libraries were prepared with the VAHTS": Was the library and sequencing strand specific? Provide complete details on these processes. "8M SMRT HiFi cell containing two Iso-seq HiFi libraries": use correct names of these and also include sequencing machine details. "Quality control on HiFi and Illumina reads was assessed using FastQC": correct the phrasing of this sentence "To make an initial exploration of the genome, …..we generated a k-mer profile with Meryl": Explicitly state the purpose of this analysis. "Manual curation was performed with Pretext": cite Pretext properly. Explain decisions of this manual curation. i.e. what evidence was used to join or break contigs. "Then, we ran three iterative rounds of RepeatMasker to annotate the known and unknown elements identified by RepeatModeler and soft-masked the genome for simple repeats": break this sentence into two and explain reasons for running RepeatMasker three times. "We used GeMoMa v.1.9": Include all details about the annotations. This sentence is not sufficient for reproducibility. Were the RNAseq data assembled or provided as raw files to GeMoMa. How were they mapped to the genome assembly f "published: Anolis carolinensis from Alföldi": Remove the word "from" here as citation is sufficient. Provide details of assembly versions, annotation version, database of annotations etc. "Crotalus ruber from Hirst et al., (in review)": remove this citation or list it as personal communication "We previously quality checked and removed the adapters of the RNA-seq data": remove "previously" and provide details on how adapters were removed from RNAseq data "also removed the adapters for the Iso-seq data": Explain how this was performed. "We blast our ..": Change all occurrence of "blast" to "BLAST" and specify parameters, if it was BLASTN or BLASTP or something else. This is not clear at all. "we performed additional annotation steps for venom genes.": Details are not complete for reproducibility. State explicitly what decisions were made and how gene structure was determined. This is the main part of the paper and does require accurate details. "Whole-genome synteny was explored between": synteny by definition refers to being on the same string/chromosome. Therefore whole-genome synteny as a term doesn't make sense given that genome is divided into chromosomes. Revise it to say "chromosomal synteny" "chromosomes assembled in the reverse complement, which were corrected using SAMtools faidx": samtools faidx cannot do this. Explain how this was done. "After adapter trimming and quality control, we mapped our RNA-seq reads": how were adapters trimmed and QC implemented. "Gene counts per gene": change gene counts to read counts "Differential expression analyses were carried out": requires additional details such as filters applied for the count, groups compared, statistical model, multiple testing correction methods. "characterize the venom arsenal of Cerastes gasperettii": change the arsenal word. "Fragmentation spectra were matched against a customized database including the bony vertebrates taxonomy dataset of the NCBI non-redundant database": revise for accuracy "Unmatched MS/MS spectra were de novo sequenced": spectra were sequenced how?? "we used blast, incorporating both toxin and non-toxin paralogs": change blast to BLAST and provide additional details about the tool used "Then, we aligned those regions using Mafft (Katoh": provide coordinates of these regions for future research in each assembly "history for the main groups of toxins (i.e.,": parenthesis is not closed. Close it or remove it. "we also included other non-toxin paralogous genes from nontoxic species (for details about this see Supplementary Information": where do I look into the supplementary information? Be very clear. Provide coordinates of regions that were compared. "When needed, we translated CDS": when was this needed? Explain. "built a phylogeny for each of the toxin groups using Phyml": I presume that this is done with translated CDS sequences in toxin genomic regions. Please clarify. "Heterozygous positions were obtained from bam files with Samtools v1.9": provide details as to how this was done. Samtools doesn't have features to operate at a site level and therefore I am confused. "Filtered reads were mapped against the new reference genome of Cerastes gasperettii using the bwa mem algorithm": bwa mem is designed for same species comparisons. Here you have used it for crossspecies. Provide justification and perhaps biases it may have introduced for distantly related species. "SNP calling was carried out …": This is not appropriate as models assume same species data. You have used cross-species alignments, which can be highly biased. Results and Discussion: "PacBio HiFi (~40x), Hi-C (~60x) and Illumina data (~78x)": change to number of base pairs. 40x for a genome of 2GB is 80GB data and for genome of 1GB size, it is 40GB data. Before sequencing and assembly, the genome size cannot be known. "After manual curation, we enhanced the scaffolding parameters of our genome": what was done as manual curation. Please specify. "∼228 times more contiguous than the Anolis sagrei genome": how is 228 more measured. How is this useful as a metric without the known ground truth. Assemblies can and do have errors. "27,158 different protein-coding genes within our assembly": this seems large compared to other species. Can you elaborate or compare these numbers with other species. "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)." : please revise this statement. Two part of the sentence are saying opposite things. Second provide coordinates of these genes as GFF/BED file as supplementary file with their exon structure annotations for others to reuse this information. "showed a great level of similarity between Cerastes gasperettii and Crotalus adamanteus": provide quantitative metrics for "great" level of similarity. "we found several fission events in the A. sagrei genome,": Since A. sagrei genome is not contiguous and chromosome scale, you cannot infer fissions as it may be artefact of non-contiguous assembly. If that is not the case, provide evidence of this. "The last four…": Belongs in methods "Macrosyntenic differences between lizards and snakes": this is very superficial discussion point. Please remove it or strengthen it with evidence. "Heatmap analyses with the most 2,000": Revise this statement. It doesn't make sense. E.g. Heatmap is a visualisation technique and not analyses method. "We studied venom evolution within the most abundant toxin groups": rewrite the sentence for clarity and brevity. "After a thorough manual curation": Explain what was this manual curation process clearly and the purpose of it. "contiguous tandem repeat SVMPs for": Change "repeat" to "array" because tandem repeat has a different meaning in genomics research context. "flanked by the NEFL and NEFM": Unclear if they are both 5' or 3' of toxin genes. Clarify "Microsyntenic analyses showed": change to local synteny "gene copy number variation between": Since these are duplicate copies, clearly state how gene copies were identified. Include details of open reading frames, exon structures, pseudogene status, etc "we can see an expansion in": Describe number of new copies, their status as intact or not, and sequence similarity between copies. Provide evidence that there is no false duplication due to heterozygous allele collapse in the assembly. "More genomic data will indicate if SVMP12": Did you mean SVMP13? "This difference may be expected, as PLA2 only represents around 5% of the proteome for Cerastes gasperettii": This is not true. Proteome doesn't equal to genome in some cases and superficial inference such as this is not warranted. For PSMC analyses, please discuss the effect of mutation rate and generation time. Figures: Figure 1: Add y-axis scales to the circos plot. Figure 1b legend says it is a linkage map, but looks more like HiC contact map. Please edit. Figure 1b legend also says "including the sex chromosomes", which is not consistent with the circos plot. Figure 3A refers to transcriptome and 3b to proteome. Please make this very clear. Figure 4A, C and E, label genes consistent with the phylogenetic trees in supplementary figures so readers can know their genomic arrangements. Figure S4: Discuss why CG1 sample separates from rest of the samples. Seems like a batch effect.

    2. Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      ** Reviewer Blair Perry**

      Mochales-Riano et al. present a high-quality genome assembly for the Arabian horned viper and provide a suite of genomic analyses related to synteny, toxin gene evolution and expression, genomic diversity, and demographic history of this and related species. This species is a valuable addition to existing snake genome resources given its medical significance and the current underrepresentation of genomes for Viperidae. I also appreciate that the authors sequenced the heterogametic sex and successfully assembled both sex chromosomes. I do have a few questions and concerns about the manuscript in its current form that I highlight below. Most notably, I feel that the arguments throughout the manuscript about toxin gene copy number correlating with proteomic abundance to be poorly supported and generally problematic given the data and analyses that the authors present. I suggest that the authors reevaluate these claims, and either provide additional analyses in an effort to support these claims or otherwise remove them from the manuscript, as I don't think they are ultimately crucial to the value of this genome report.

      Introduction:

      I find the argument being made in the sentence beginning "Previous works have shown that changes in gene regulation" a bit confusing. Rather than this arguing that studying the expression of venom genes is "insufficient," I think that this instead argues that transcriptomic and proteomic data are critical for studying venom in conjunction with annotated genome sequence. You could for example have a species with 20 copies in a particular tandem array, but only two of them are ever expressed at biologically meaningful levels and thus contribute proteins to the excreted venom. Knowing both the total number of copies in the genome and the number that are actually contributing to the venom proteome are both valuable and necessary for understanding the evolution of that gene family, its role and significance in venom phenotypes, etc. I'm also not sure I follow the logic of the next sentence. Why exactly would the identification of specifically "unexpressed" toxin genes be particularly notable for antivenom, drug discovery, therapeutics, etc.? "We deciphered numerous genomic attributes of this species including its genetic diversity and failed to find evidence of inbreeding" - lack of inbreeding is never discussed in the context of the heterozygosity results, but is pitched here as a major result of the paper. Did the authors have a priori expectations regarding inbreeding in this species?

      Methods:

      "Gene counts per gene…" - should this be "Gene expression counts per gene…"? Venom gland RNA-seq data was generated from three animals, but proteomic data was generated from a pool of two other animals. This is not ideal for linking gene expression to venom proteome composition, where you really would want venom collected from the same animals you are getting venom gland RNA from. This is especially true is there is intraspecific variation in venom phenotypes within this species. The latitude and longitude are not provided for the two proteome samples. Were these collected from the same latitude and longitude as the RNA-seq animals? For analyses of heterozygosity, the authors map wgs data from diverse species against the cerastes reference and call variants. Why was this approach chosen over instead mapping the data for each species to either that species' reference (i.e., C. viridis and N. naja) or a more closely related species for those without a reference? Presumably that would reduce the potential influence of reference bias on these estimates of heterozygosity?

      Results:

      "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)" this feels a bit contradictory. Maybe just can state that toxin genes were found on both macro and microchromosomes? "Finally, we also found a battery of 3FTxs and myotoxin-like genes, but they were not represented in our RNA-seq dataset (see below)." The authors do not further discuss this result as implied by "(see below)," unless that was simply referring to subsequent discussion of RNA-seq data. From what I can tell, these are also not present in the proteomic data, correct? "The venom gland transcriptome contained a total of 7,237 genes expressed (TPM > 500), including a total of 65 putative toxin genes. Differential gene expression analyses revealed a total of 161 genes (33 putative toxin genes) that were differentially upregulated (FC > 2 and 1% FDR) in venom glands compared to other tissues (Fig. 3A)." Figure 3A only shows 10 toxin genes with "unique" expression in the venom gland, not the 161 upregulated toxin genes as implied here. The authors should add a heatmap with these 161 genes to the supplement, if not to Figure 3 (guessing it might not fit). Fig 3: The authors do not discuss the lack of unique/upregulated expression evidence for PLA2s and Disintegrins in Fig 3A, despite their contribution to protein composition in Fig 3B. For disintegrins in particular, they represent a higher proportion of the venom proteome than CTLs and CRISPs, yet there is no evidence presented for high expression in these genes. What do the authors think is going on here? Could this be a technical issue related to the processing of the RNAseq data, perhaps related to the small size of these genes? Alternatively, could this be indicative of a mismatch between venom phenotypes of the animals used to generate transcriptomic versus proteomic data? In the text, the authors state "These genes, together with other SVMPs, SVSPs, Disintegrins (DISI) and Ctype lectins (CTL), were highly expressed in the venom gland and form the core toxic effector components of the venom" but again there is no presented evidence for DISI expression in particular. Are these genes included in the 161 upregulated genes in the venom gland? The authors only present proteomic data in the form of a pie chart of overall composition grouped by toxin family (Fig 3B). Does the proteomic data generated here provide individual gene-level proteomic abundance estimates? If so, this would be valuable to include, especially in support of the authors claims about gene copy number being correlated with protein abundance. For example in Figure 3, SVMP9 and SVMP10, and to a lesser extent SVMP13, are highly expressed and therefore possibly/likely the major contributors to SVMPs in the proteome. Is the SVMP section of the pie chart in Fig 3B dominated by proteins from these 3 genes? "We studied venom evolution within the most abundant toxin groups (i.e., SVMPs and SVSPs, as well as PLA2)." PLA2s are a relatively low proportion of the venom proteome in Fig 3B, and are not present in the expression heatmap in Fig 3A. Why were these chosen for further investigation over CTL, CRISP, DISI, etc.? "The amplification of SVMP copy numbers is consistent with proteomic results, as SVMPs were the second most abundant component…". Related to my comment above, are all/many of these copies expressed in proteomic, or at least transcriptomic, data? As the data is currently presented, it appears that a small number of SVMPs are highly expressed and thus likely contributing to the proteome. This does not support, and might in fact contradict, the authors claim that there is an association with increased copy number and contribution to the proteome. Related to this, and more generally, the authors do not present a convincing argument for the relationship between gene copy number and the resulting percentage of a given toxin gene family in the proteome. If copy number is directly related to the resulting amount of a toxin in the proteome, the authors would need to show that many/all of those copies are expressed in the transcriptomic data, and that proteins produced from those genes are present and contributing to the venom proteome (beyond just the total percentage for the family). Further, making any links between copy number and percent overall composition in the proteome is problematic, because it inherently is impacted by copy number variation and expression of all the other toxin genes. You could, in theory, have copy number expansion in a species where all the genes are expressed and contribute to the proteome, but no overall change in the percent of that toxin family in the proteome if other toxin families have also expanded and/or are expressed more highly. Related to this, there is currently no obvious baseline to compare against in order to make these claims that expansion has resulted in higher venom proteome composition (i.e., a situation where we have fewer SVMP gene copies and a corresponding lower percentage of SVMP proteins in the venom proteome). This would potentially require comparison across species and/or populations with differing copy number, etc. My concerns above also apply to the interpretation of SVSP results: "The high number of SVSP genes found (although lower than in Crotalus adamanteus) were in line with the proteomic results, as SVSPs are the most abundant toxin in the proteome (Fig. 3B)." Further, C. adamanteus has a larger number of SVSP genes than C. gasperettii, yet a lower percent composition of SVSPs in the proteome (Margres et al. 2014), emphasizing my concerns about associating copy number and percent composition. Could the two large Group 2 SVSPs in Fig 4E be misannotations of multiple genes? Looking at the adamanteus genes above these, there genes starting and ending at roughly the same position the start and end of these large SVSPs, making me wonder if there are multiple cerastes genes that were annotated as one. In my own experience, I have seen similar situations where FGENESH+ was fed a large region containing multiple genes and annotated multiple genes together as one, so might just be worth double checking that that hasn't happened here. Alternatively, could these be gene fusions? If that's the case, that would presumably complicate the gene tree analyses, correct? i.e., these genes would probably need to excluded from those analyses

    3. Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Jiatang Li

      In the manuscript entitled 'Chromosome-level reference genome for the medically important Arabian horned viper (Cerastes gasperettii)', the authors assembled a high-quality chromosome-level reference genome for the Arabian horned viper (Cerastes gasperettii), a special Viperid species, which is an important data resource. Combined with multi omics data, the authors characterized the genome, conducted the analysis of toxin gene family, and identified a novel SVMP gene. The research is with great significance for the revelation of the origin and diversification of snake venom. Overall, I think the science and findings of the study are meaningful and merit publication, but in its current form, there are some issues should be noticed: 1. It should be noted that Fig. 1 and Fig. 2 both have unidentified border lines.

      1. In all phylogenetic trees presented by the manuscript, it would be better for authors to indicate all species information.

      2. I'm curious if the authors considered period differences in sampling, for example differences in venom glands after venom harvest or in the resting state, which could affect the analysis especially the transcriptome.

      3. In the transcriptomics section, the author stated that the batch effect of CG1 was due to the low mapping of that sample to our reference genome. It is a misinterpretation to me as CG1 itself is the genome sequencing sample. The authors should further explain for this.

      4. The authors need to ensure that all data generated by the manuscript is accessible and information about the data is not currently available.

      5. Please check the references to ensure that the formatting meets the publisher's requirements, e.g., some Latin names of species requiring italics.

    1. We outline the development of the Health Data Nexus, a data platform which enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security and privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform’s development, highlighting the importance of flexible uses, collaborative development, and open-source science.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf050 ), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:

      Reviewer: Hollis Lai

      The purpose of the paper is to demonstrate the adoption of PhysioNet as a medical data sharing platform. The authors outlined the process, workflow, and approval chain required to facilitate such process. The manuscript also provided initial data use and adoption to demonstrate feasibility of such platform. This is a difficult subject to publish as authors do demonstrate the use of platform, but it is difficult to present this subject in a scientific basis.1. The authors describe the datalake require for sharing medical data and does a good job on describing the administrative processes required for such datalake. However, how does this differ from the literature of other platforms? Why was this platform adopted and not other approaches? What information is provided in this adoption that other approaches did not consider or would need to know. I think there is an established literature out there on health data sharing platform that the authors should acknowledge, and highlight how this approach is needed to address these issues.2. The authors highlight adoption data, but no evaluation data was solicited nor provided. Such information would be helpful to know if we were to evaluate how this creation could be replicated. I think there are many great use cases for this outcome but very little is discussed on how it could be applied in the field. For example, is this a method paper promotine others in adopting the platform? or is this a paper demonstrating how others can develop similar platforms?3. There was acutally no relation to AI other than the use of data holding for AI training. The data holding would make sense for UToronto as the process and approvals are built based on local institution requirements. I have tried to access the system as an external and found it intuitive. But, other than building this platform for the purposes of UToronto holding data for UToronto researchers, is there any plans or process for adopting holdings for other institution? How should other users perceive this information? Could other holdings such as administrative data be used?I think the presentation of the article has merit but more needs to be done to capture what has already been done in the field and why this solution also needs to be presented (contribution to the field).

    1. Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Zexuan Zhu

      The paper presents an encoding of the VCF data using Zarr to enable fast retrieving subsets of the data. A vcf2arr conversion was provided and validated on both simulated and real-world data sets. The topic of this work is interesting and of good values, however, the experimental studies and contributions should be considerable improved.1. The proposed method is simply a conversion from VCF to Zarr format. Since both are existing formats, the contributions and originality of this work are not impressive.2. The compression and query performance is the main concern of this work. The method should be compared with other state-of-the-art queriable VCF compressors like GTC, GBC, and GSC.Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics, 2018;34(11):1834-1840.Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H,etal. GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species. Genome Biology, 2023;24(1):1-22.Luo X, Chen Y, Liu L, Ding L, Li Y, Li S, Zhang Y, Zhu Z. GSC: efficient lossless compression of VCF files with fast query. Gigascience, 2024; 2;13:giae046.3. The method should be evaluated on more real VCF data sets.

    2. Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Nezar Abdennur

      The authors present VCF Zarr, a specification that translates the variant call format (VCF) data model into an array-based representation for the Zarr storage format. They also present the vcf2zarr utility to convert large VCFs to Zarr. They provide data compression and analysis benchmarks comparing VCF Zarr to existing variant storage technologies using simulated genotype data. They also present a case study on real world Genomics England aggV2 data.The authors' benchmarks overall show that VCF Zarr has superior compression and computational analysis performance at scale relative to data stored as roworiented VCF and that VCF Zarr is competitive with specialized storage solutions that require similarly specialized tools and access libraries for querying. An attractive feature is that VCF Zarr allows for variant annotation workflows that do not require full dataset copy and conversion. Another key point is that Zarr is a high-level spec and data model for the chunked storage of n-d arrays, rather than a bytelevel encoding designed specifically around the genomic variant data type. I personally have used Zarr productively for several applications unrelated to statistical genetics. While Zarr VCF mildly underperforms some of the specialized formats (Savvy in compute, Genozip in compression) in a few instances, I believe the accessibility, interoperability, and reusability gains of Zarr make the small tradeoff well worthwhile.Because Zarr has seen heavy adoption in other scientific communities like the geospatial and Earth sciences, and is well integrated in the scientific Python stack, I think it holds potential for greater reusability across the ecosystem. As such, I think the VCF Zarr spec is a highly valuable if not overdue contribution to an entrenched field that has recently been confronted by a scalability wall.Overall, the paper is clear, comprehensive, and well written. Some high-level comments: The benefits for large scientific datasets to be analysis-ready cloud-optimized (ARCO) have been well articulated by Abernathey et al., 2021. However, I do think that the "local"/HPC single-file use case is still important and won't disappear any time soon, and for some file system use cases, expansive and deep hierarchies can be performance limiting (this was hinted at in one of the benchmarks). In this scenario would a large Zarr VCF perform reasonably well (or even better on some file systems) via a single local zip store? The description of the intermediate columnar format (ICF) used by vcf2zarr is missing some detail. At first I got the impression it might be based on something like Parquet, but running the provided code showed that it consists of a similar file-based chunk layout to Zarr. This should be clarified in the manuscript. The authors discuss the possibility of storing an index mapping genomic coordinates to chunk indexes. Have Zarr-based formats in other fields like geospatial introduced their own indexing approaches to take inspiration from? Since VCF Zarr is still a draft proposal, it could be useful to indicate where community discussions are happening and how potential new contributors can get involved, if possible. This doesn't need to be in the paper per se, but perhaps documented in the spec repo.Minor comments: In the background: "For the representation to be FAIR, it must also be accessible," -- A is for "accessible", so "also" doesn't make sense. "There is currently no efficient, FAIR representation...". Just a nit and feel free to ignore, but the solution you present is technically "current".* In Figure 2, the zarr line is occluded by the sav line and hard to see.

  5. Jun 2025
    1. Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Revision 2 version

      Reviewer 2: Benjamin D Rosen

      The first near-complete genome assembly of pig: enabling more accurate genetic research.

      General comments:

      The authors have clarified how their HiC manual curation efforts were able to remove gaps from the assembly. This was my only remaining major issue. I only have a few minor comments remaining.

      Minor comments:

      Line 1 - Title: "A Near Telomere-to-Telomere Genome Assembly of the Jinhua Pig"

      Line 369 - replace "only 6 gaps left in our final JH assembly" with "only 6 gaps remain in our final JH assembly"

      Line 370 - Figure S5 needs a more detailed legend

      Line 405 - I just noticed this, but are the authors proposing that chr9 has 2 centromeres? Given the know pig karyotype (metacentric chr9), it seems more likely that they have identified some other form of tandem repeat at the beginning of chr9.

    2. Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Revision 1 version

      Reviewer 1: Martien Groenen

      In their revised version of the manuscript, the authors have addressed all my major concerns raised in my earlier review and have made the many editorial edits as suggested. I only have a few (mostly editorial) comments for the revised version. The most important one is the title of the manuscript. I realize I did not mention this in my earlier review, but I think the title is not very appropriate and could be more informative. I suggest something like "A telomere-to-telomere genome assembly of the Jinhua pig"

      Minor editorial comments: Line 40: Replace "provides" by "provide"; "genome" to "genomes" and "JH" to "Jinhua" Lines 50-51: "This study produced a gapless and near-gapless assembly of the pig genome, and provides a set of diploid JH reference genome." Should be changes to something like "This study produced a near-gapless assembly of the pig genome and provides a set of haploid Jinhua reference genomes." Line 177: Change "with with" to "with" Line 194: Replace "population" by "populations" Lines 232-233: Referring to human as a "closely related species" is rather awkward and not correct. I suggest replacing this with "eleven other mammals" Lines 299, 301 and 303: Insert "of" after "consisting" Line 317: Insert "and" before "2.33 Gb" Line 319: Insert "and" before "2.17 Gb" Line 320-321: Change to "The more continuous contigs of the two assemblies were selected to construct the final haploid assemblies". Line 323: Replace "assembly" by "assembler "Line 354: Delete "ranging" Lines 358359: Change "The average properly mapped rate" to "The average rate of properly mapped reads" Line 379: Insert "respectively" after "60.07"Line 380: "suggested" (remove space)Line 385: Change "indicate a gapless and near-gapless" to "indicate a near-gapless" Line 455: Change "were overlapped with" to "were overlapping with" Lines 557-559" The sentence "The insertion found in the SLA-DOB gene, which serves to enhance the immune system's response and is relevant to transplant rejection" seems incomplete and sound awkward. Perhaps you mean something like "The insertion found in SLA-DOB, a gene involved in enhancing the immune system's response to infection, might be relevant in relation to transplant rejection"

      Reviewer 2: Benjamin D Rosen

      The first near-complete genome assembly of pig: enabling more accurate genetic research

      General comments: I thank the authors for addressing most of my points and providing more details on the parameters they have used. Unfortunately, I still have some unanswered questions regarding the methodology. My current understanding from the authors responses to my previous comments leads me to believe that the assembly has been scaffolded incorrectly. If the authors did indeed use HiC data to place 8 contigs into gaps and then joined those contigs without placing gaps at the joins or doing any further gap filling, that calls into question the validity of the assembly. Finally, the language needs further improvement for readability.

      Specific comments: Line 85 - *will contribute to. Lines 187-191 - HiC interaction maps do not provide information for gap filling. Either this has been explained insufficiently, or it has been done incorrectly. Placing assembled sequences in the correct order does not mean that it is okay to join them without a gap. It is necessary to return to the gap filling procedure now that the contigs are in the correct order and attempt to fill them as done previously. Line 191 - Figure S3 - These HiC contact maps are not very informative they need to be labeled and have a scale bar. Additionally, contact maps can have a lack of signal due to a gap in the sequence or due to multimapping reads in repetitive regions being filtered so it's not clear what they are trying to show in A-C. The authors reply to my previous concern regarding the labeling of this figure does not help, furthermore, the figure legend in the supplemental materials is still insufficient. I think I understand that panels D and E are chr3 before and after misassembly correction, it would be helpful if the two panels were at the same scale. I still don't know why panel F is shown, how is this related to panel C and I don't see any red ellipses indicated by the legend. Line 275 - "ensemble from Duroc pigs" is incorrect. It is an "assembly of a Duroc pig". Lines 299, 301, 303 - "containing" not "consisting" Lines 306-308 - Again, HiC data orders and orients contigs, but it does not fill gaps. Please clarify how the assembly was reduced from 14 gaps to 6 gaps with HiC data. Was an additional round of gap filling performed? Lines 313-314 - How is the contig N50 larger than the scaffold N50 above? Lines 335-336 - Does this refer to the Merqury analysis? I don't think "using mapped K-mers" is correct here, please reword. Lines 367-368 - what does it mean that "8 out of 63 gaps were corrected" is this from the HiC ordering of contigs? Line 369 - what does the mapping between Sscrofa11.1 and JH-T2T shown in figure S6 have to do with the JH-T2T gap filling being described here? Line 369 - I previously asked about this supplemental table only containing 55 entries. The authors response "The other filled 8 gaps were resolved through adjustments made to the Hi-C map to correct misassembles. As a result, these gaps cannot be precisely located within the existing order of the assembly." indicates that contigs must have been incorrectly joined solely based on the HiC signal between contigs. The authors must know what contigs were added or joined to form the final assembly. It would be trivial to align the two assembly versions and identify the positions of the old contigs in the new assembly. I believe that these incorrectly joined contigs should be broken and put through the same gap filling procedure as performed earlier. Lines 375-378 - Dramatic coverage changes in read mappings as found in these figures are usually indicative of assembly errors. I do not agree that "These findings confirmed the accuracy and reliability" of the assembly. I suggest replacing the last sentence with something more measured such as "Although supported by some read data, the inconsistency of coverage across these gap filled regions suggests that caution should be used when interpreting findings in these regions, cross-referencing results with the gap positions (Supplementary Table S9) is advised." Line 375 - "evidenced by fully coverage" remove "fully", it isn't proper usage of the word and I wouldn't interpret the low coverage in many of these regions as "full coverage". Line 385 - should read "Overall, our assembly quality metrics indicate a near-gapless assembly of the pig genome" Line 390 - should read "a gapless T2T sequence for 16 out of 20" Line 396 - Supplemental table 10 not 9.Lines 398399 - according to supplemental table S4 and figure 3A, chromosome 2 also has a single telomere. Line 402 - the centromeres are not marked in Figure 3A.Line 402 - Figure S8 - please rename chr19 and chr20, chrX and chrY. Line 406 - "at early research" unclear what is meant by this. please reword. Line 423 - as indicated on line 397, 33 telomeres were identified, not 35.Line 426 - "The JH-T2T assembly IDENTIFIED 17 centromeres" Line 450 - "are located in" Line 453 - "these SVs are located in" Line 455 - Moreover, 12,129 genes overlap these SVs" Line 502 - "which contained 544 gaps" Line 841 - Figure 2 legend description is still incorrect. Only A is mapping rates, B and C are PM rates and base error rates.

    3. Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Original version

      Reviewer 1: Martien Groenen

      The manuscript describes the T2T genome assembly for the Chinese pig breed Jinhua, which presents a vast improvement compared to the current reference genome of the Duroc pig TJTabasco (build11.1). The results and methodology use for the assembly are described clearly and the authors show the improvement of this assembly by a detailed comparison with the current reference 11.1. While clearly of interest to be published, several aspects of the manuscript should be improved. Most of these changes are minor modifications or inaccuracies in the presentation of the results.

      However, there are two major aspects that need further attention:

      1. The T2T assembly presented, represents a combination of the two haplotypes of the pig sequenced. I am surprised why the authors did not also develop two haplotype resolved assemblies of this genome. Haplotype resolved assemblies will be the assemblies of choice for future developments of a reference pan-genome for pigs. The authors describe that they have sequenced the two parents of the sequenced F1 individual, so why did they not use the trio-binning approach to also develop haplotype resolved assemblies. I, think adding these to the manuscript would be a vast improvement for this important resource.

      2. The results described for the identification of selective sweep regions is not very convincing. This analysis shows differences in the genomes of two breeds: Duroc and Jinhua. However, these breeds have a very different origin of domestication of wild boars that diverged 1 million years ago, followed by the development of a wide range of different breeds selected for different traits. Therefore, the comparison made by the authors cannot distinguish between differences in evolution of Chinese and European Wild Boar, more recent selection after breed formation and even drift. To be able to do so, these analyses would need the inclusion of additional breeds and wild boars from China and Europe. Alternatively, the authors can decide to tone down this part of the manuscript or even delete it altogether, as it does not add to the major message of the manuscript.Minor comments Line 34: Change the sentence to: "with thousands of segments and centromeres and telomeres missing" Line 37: Insert "and Hi-C" after "long reads "Line 46: Delete " such as GPAM, CYP2C18, LY9, ITLN2, and CHIA" Line 54: Insert "potential" before "xenotransplantation" Line 82: Delete "in response to the gap of a T2T-level pig genome" as this does not add anything and the use of "gap" in this context is confusing. Line 93: Change "The fresh blood" to "Fresh blood" Line 100: The authors need to provide a reference for the SDS method. Lines 152-153, line 444, and table S6: This is confusing. The authors mention Genotypes from 939 individuals, but in the table it is shown that they have used WGS data. You need to describe how the WGS data was used to call the genotypes for these individuals. Furthermore, in line 444 you mention 289 JH pigs and 616 DU pigs which together is 905. What about the other 34 individuals shown in table S6?Line 244: Replace "were" by "was" and delete "the" before "fastp" Lines 287292: Here you use several times "length of xx Gb and yy contigs". This is not correct as the value for the contigs refers to a number and not a length. Rephase e.g. like "length of xx Gb and consisting of yy contigs" Line 294: The use of "bone" sems strange. Either use "backbone" or "core"Line 306: Replace "chromosome" by "genome" Lines 308-309: For the comment "Second, 16 of the 20 chromosomes were each represented by a single contig" you refer to figure 1D however from this figure it cannot be seen if the different chromosomes consist of a single or multiple contigs. Line 346: Do you mean build 11.1 with "historical genome version". If so, please use that instead. Line 349: "post-gap filled" Line 353: The largest gap is 35 kb not 36 kb. Figures 2F-I should be better explained in the legends and the main text (lines 353-358). Lines 378: For the 23,924 genes you refer to supp table S13. However, that table shows a list of SV enriched QTL not these genes. Furthermore, I checked all tables but a table with all the protein coding genes is missing. Line 380: For the 799 newly anchored genes, refer to table S10. Now you refer to table S17 which shows genes enriched KEGG pathways. Lines 383-386: For the higher gene density in GC rich regions, you refer to figure 1D, but it is impossible to see this correlation from figure 1D. For the density of genes and telomeres, you refer to figure 1G. However, that figure does not show gene densities only repeat densities. Line 406-407. This should be table S11.Lines 409412: For this result you refer to table S11. However, that table only shows data for the gained genes, not the lost genes. Lines 419-420: You refer to table S12 and figure 3B, but the information is only shown in figure 3B and not in table S12.Line 420: Replace "were" by "is" Line 422: Better to use "repeats" instead of "they" Line 425: "Moreover, 12,129 genes located in these SVs". Unclear to what "these" refers to and I assume that you mean genes that (partially) overlap with SVs? Also, this is an incomplete sentence (verb missing). Likewise, this number is not very meaningful as many of these SVs are within introns. It is much more informative to mention for how many genes SVs affect the CDS. Line 433 and table S14: This validation is not clear at all. What exactly are these numbers that are shown? You also mention "greater than 1.00" but the table does not contain any number that is greater than 1.00. Line 435: "Table" not "Tables" Line 436: Change to " SVs with a length larger than 500 bp "The term "invalidate" in figure 3D is rather awkward. Better to use "not-validated" and "validated" in this figure. Line 449: This should be Table S16. Line 452: There is not Table S18Lines 484-486: Change to "Similarly, in human, the use of the T2T-CHM13 genome assembly yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions [61]." Lines 500-501: Change to "For example, in human, the T2T-CHM13 assembly was shown to improve the analysis of global" Lines 517-528: This paragraph should be deleted as these genes have already been annotated and described in previous genome builds including 11.1. Why discuss these genes here? Following that line of thinking, almost every gene of the 20,000 can be discussed. Line 532: "%" instead of "%%" and insert "which" after "SVs" Lines 537-542: These sentences should be deleted. It is common knowledge that second generation sequencing is not very sensitive to identify SVs. The authors also do not provide any results about dPCR. Line 544: "affect" rather than "harbor" Lines 544-547: This is repetitive and has been stated multiple times so better to delete. Line 561: "which is serve to immune system's response and relevant to transplant rejection" This is an incorrect sentence and should rephrased. Lines 562-568: I don't agree with is statement and suggest to remove it from the discussion.

      Reviewer 2: Benjamin D Rosen

      The first near-complete genome assembly of pig: enabling more accurate genetic research. The authors describe the telomere-to-telomere assembly of a Jinhua breed pig. They sequenced genomic DNA from whole blood with PacBio HiFi and Oxford Nanopore (ONT) long-read technologies as well as Illumina for short reads. They generated HiC data for scaffolding from blood and extracted RNA from 19 tissues for short read RNAseq for gene annotation. A hifiasm assembly was generated with the HiFi data and scaffolded with HiC to chromosome level with 63 gaps. The scaffolded assembly was gap filled with contigs from a NextDenovo assembly of the ONT data bringing the gaps down to 14. Finally, the assembly was manually curated with juicebox somehow closing a further 8 gaps. This needs to be clarified. Standard assembly assessments were performed as well as genome annotation. The authors compared their assembly to the current reference, Sscrofa11.1, and called SVs between the assemblies. The SVs were validated with additional Jinhua and Duroc animals. They then identified signatures of selection present in some of the largest SVs.

      General comments: The manuscript is mostly easy to read but would benefit from further editing for language throughout. The described assembly appears to be high quality and quite contiguous. Although the authors do mention obtaining parental samples and claim the assembly is fully phased, there is no mention of how this was done. There are many additional places where the methods could be described more fully including the addition of parameters used.

      Specific comments: Line 39 - Figure 1 only displays 34 telomeres, not 35. Additionally, I was only able to detect 33 telomeres using seqtk telo. Seqtk only reports telomeres at the beginning and end of sequences, digging further, the telomere on chr2 is ~59kb from the end of the chromosome, perhaps indicating a misassembly. Lines 79-81 - there are not hundreds of species with gap free genome assemblies and reference 19 does not claim that there are. Line 82 - the assembly is not gap-free, replace with "nearly gap-free" Line 95 - were these parental tissue samples ever used? Lines 151-156 - this section would be better located below the assembly methods. Please number supplementary tables in order of their appearance in the text. Line 171 - please provide parameters used here and for all analyses. Lines 187-188 - how did rearranging contigs decrease the gaps? Was the same gap filling procedure used after HiC manual adjustments? Line 188 - Figure S3 - I don't understand the relationship between the panels nor what the authors are attempting to show. If panels A-C display chromosomes 2, 8, and 13, Why does D display chr3? Both panels C and E are labeled chr13 but they look nothing alike. Are D-E whole chromosomes or zoomed in views? Missing description of panel F. Lines 222-224 - why weren't pig proteins used? Ensembl rapid release has annotated protein datasets for 9 pig assemblies. Line 264 - although most will know this, make it clear that Sscrofa11.1 is an assembly of a Duroc pig. Line 292 - how was polishing performed? This is missing from the methods. Line 294 - should this read "selected it for the backbone of the genome assembly."? Lines 298-299 - methods? Line 314 - what is meant by "using mapped K-mers from trio Illumina PCR-free reads data"? Line 331 - accession numbers for assemblies would be useful. Line 333 - what is "properly mapped rate"? Do you mean properly paired mapping rate? Line 346 - what is the historical genome version? Line 349 - Supplemental Table S8 only has 55 entries including the 6 remaining gaps. Where are the other filled 8 gaps located? Lines 350-358 - read depth displays wouldn't show the presence of clipped reads which would indicate an improperly closed gap. It would be more convincing to display IGV windows containing these alignments showing that there are no clipped reads. Line 354 - Figure S5 needs a better legend. What is ref and what is own? Line 359 - the assembly is near-gapless. Line 359 - where is the data regarding assembly phasing? How was this determined to be fully phased? Line 363 - 16 of 20 chromosomes are gapless. Line 370 - only 33 telomeres were found at the expected location (end of the chromosome), if you count the telomere on chr2 59kb from the end, then 34 telomeres were identified. Line 372 - chr13 also only has a single telomere. It does not have a telomere at the beginning. Line 372 - chr19 is chrX correct? Line 374 - Figure 1G - It would be nice to have the centromeres marked on this plot (or in Figure 3A). Are the long blocks of telomeric repeats internal to the chromosomes expected? Line 423 - Figure 3A - there is no telomeric repeat at the beginning of chr4 or chrXLine 431 - why were only 5 pigs of each breed used to validate SVs when 100's of WGS datasets from the two breeds had been aligned? How were these 5 selected? Line 481 - Sscrofa11.1 only has 544 gaps.Line 492 - ONT data was used to fill more than 6 gaps. Gaps in the assembly were reduced from 63 to 14 using ONT contigs. Lines 588-589 - please make your code publicly available through zenodo, github, figshare, or something similar. Line 815-824 - Figure 2 - legend description needs to be improved. Only A is mapping rates, B and C are PM rates and base error rates. The color switch from A-C having European pigs in blue to D having JH-T2T in blue might confuse readers.

    1. Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Revised 1 version

      Reviewer 1: Qingyu Zhao

      Thank for the authors for the thorough response. The only remaining comment is that some new supplement figures (figures 8-12) are not cited or explained in the main text (maybe I missed it?). Please make sure to discuss these supplement figures in the main text otherwise readers wouldn't know they are there. The response reads "To provide even more insights, we now present the relationship between the internally validated scores at the time of stopping (I_{act}), the corresponding external validation scores and sample sizes, for all 4 datasets in supplementary figures 8-11. The figures show a relatively good correspondence between internally and externally validated performance estimates with all splitting strategies". What insights are given? What do you mean by relatively good correspondence between internal and external performance? All I see in those figures are some normally distributed scatter plots, so it needs better explanation.

      Reviewer 2: Lisa Crossman

      I previously reviewed this MS and all the comments I made were answered in full. I would be pleased to recommend publication. I was fully able to replicate the adaptive split results from the GitHub repo. I have only one comment which is that I received several generated warnings of "RuntimeWarning: divide by zero encountered in scalar divide", and these can also be seen in the Jupyter notebook example.

    2. Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Original version

      Reviewer 1: Qingyu Zhao

      The manuscript discusses an interesting approach that seeks optimal data split for the pre-registration framework. The approach adaptively optimizes the balance between predictive performance of discovery set and sample size of external validation set. The approach is showcased on 4 applications, demonstrating advantage over traditional fixed data split (e.g., 80/20). I generally enjoyed reading the manuscript. I believe pre-registration is one important tool for reproducible ML analysis and the ideology behind the proposed framework (investigating the balance between discovery power and validation power) is urgently needed. My main concerns are all around Fig. 3, which represents the core quantitative analysis but lacks many details.

      1. Fig. 3 is mostly about external validation. What about training? For each n_total, which stopping rule is activated? What is the training accuracy? What does l_act look like? What is \hat{s_total}?
      2. Results section states "the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on Figure 3)". I'm confused by this because the blue curve is often below other methods in accuracy (e.g., comparing with 90/10 split in ABIDE and HCP).
      3. Why does the half split have the lowest accuracy but the highest statistical power?
      4. How was the range of x-axis (n_total) selected? E.g., HCP has 1000 subjects, why was 240-380 chosen for analysis?
      5. The lowest n_total for BCW and IXI is approximately 50. If n_act starts from 10% of n_total, how is it possible to train (nested) cross-validation on 5 samples or so?

      Two other general comments are: 1. How can this be applied to retrospective data or secondary data analysis where the collection is finished? 2. Is there a guidance on the minimum sample size that is required to perform such an auto-split analysis? It is surprising that the authors think the two studies with n=35 and n=38 are good examples of training generalizable ML models. It is generally hard to believe any ML analysis can be done on such low sample sizes with thousands of rs-fMRI features. By the way, I believe n=25 in Kincses 2024 if I read it correctly.

      Reviewer 2: Lisa Crossman

      External validation of machine learning models - registered models and adaptive sample splitting Gallito et al. The Manuscript describes a methodology and algorithm aimed at better choosing a train-test validation split of data for scikit-learn models. A python package, adaptivesplit, was built as part of this MS as a tool for others to use. The package is proposed to be used together with a suggested workflow to integrate an approach invoking registered models as a full design for better prospective modelling studies. Finally, the work is evaluated on four alternative publicly available datasets of health research data and comprehensive results are presented. There is a trade-off in the split between the amount of sample data to be used for training and the amount of data to use for validation. Ideally the content of each must be balanced in order for the trained model to be representative and equally for the validation set to be representative. This manuscript is therefore very timely due to the large increase in the use of AI models and provides important information and methodology.

      This reviewer does not have the specific expertise to provide detailed comments on the statistical rule methods.

      Main Suggested Revision: 1. The Python implementation of the "adaptivesplit" package is described as available on GitHub (Gallitto et al., n.d.). One of the major points of the paper is to provide the python package "adaptivesplit", however, this package does not have a clear hyperlink, and is not found by simple google searches, and it appears is not yet available. It is therefore not possible to evaluate it at present. There is a website found available with a preprint of this MS after further google searches, https://pnilab.github.io/adaptivesplit/ however, adaptive split is here shown as an interactivate jupyter-type notebook example and not as a python library code. Therefore, it is not clear how available the package is for others' use. Can the authors comment on the code availability?

      Minor comments: 1. Apart from the 80:20 Pareto split of train-test data, other splits are commonly used in ratios such as 75:25 (the scikit-learn default split if ratio is unspecified), and 70:30. Also the cross-validation strategy with train-test-validation split 60:20:20, yet these strategies have not been mentioned or included in the figures such as Fig 3. The splits provided in the figure and discussed are 50:50, 80:20 and 90:10 only. Could the authors discuss alternative split ratios?

    1. To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf035), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Revision 1 version

      Reviewer 1: Hua Zhang

      The quality of this manuscript has significantly improved in this revision. I appreciate the author's effort in thoroughly addressing all concerns and comments.

      Reviewer 2: Santhoshi Krishnan

      All my concerns have been adequately addressed by the authors and I have no further questions.

    2. To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf035)), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Original Submission Reviewer 1: Hua Zhang

      Wess et al reports a Python framework, MIIT (Multi-Omics Imaging Integration Toolset), for integrating spatially resolved multi-omics data. Multi-omics imaging represents a pivotal approach for systems molecular biology and biomarker discovery. This method introduces a timely and valuable tool to advance the field. However, in my opinion, this paper still has some issues that need to be addressed before consideration for publication. Cancer tissue exhibits significant heterogeneity effects, in this study, different molecular information obtaining from different tissue sections, this means from different cells as the tissue section is 10 um thickness, almost the diameter of the cells. Please height the meaningful of co-registration information if they are obtained from different cell layers. In particular, for the datasets of spatial transcriptomics and MSI, the experiments were conducted on serial sections with an axial sectioning distance of 40 to 100 μm. This means that the mRNA and metabolites originate from different cells, raising questions about how integrating these two datasets can provide meaningful insights. The multi-omics imaging integration toolset is based on the GreedyFHist, a non-rigid registration algorithm, it suggests including more details about this algorithm and highlight the difference comparing to previously reported non-rigid image co-registration algorithm. The author should demonstrate the accuracy of background segmentation, it concerns certain low signal sample area would be removed in the denoising step. What is criterion to define the background region and sample region in the background segmentation.

      In the Method section, more details need to be included in the spatial transcriptomics part, what the spatial resolution of the 10x Genomics was used. As the MALDI resolution is 30 um, how the pixel alignment of the ST and MSI data if their spatial resolution is different. In the MALDI-MSI of prostate tissue, on tissue MS/MS data is missing to confirm the identification of target analytes of citrate, ZnCl3-, and spermine.

      **Reviewer 2: Santhoshi Krishnan **

      Overview: In this paper, the authors present the Multi-Omics Imaging Integration Toolset, which is a python framework for integration multiple spatial omics datatypes. To facilitate this, they also development a registration method (GreedyFHist) for jointly analyzing sequential tissue layers that have undergone different types of staining/phenotyping regimens. The method validation was done on a 244 fresh-frozen prostrate tissue sections. The highly detailed methods and results section is well appreciated and helps fully contextualize the significance of the study. The definitions of study-specific terms mentioned throughout the paper at the beginning are also appreciated. Data and Code Availability: Detailed code, tutorials and associated instructions have been made available for use by the public, which is appreciated. All systems requirements have also been explicitly laid out for ease of installation and use. The workflow examples provided are quite detailed; however, a more extensive codebase with stepwise explanations within the code will be appreciated. Data has not been made available publicly, except for the raw and processed spatial transcriptomics data; however, detailed and explicit instructions have been provided on data access, keeping in mind local regulations. Revisions: Major Revisions: 1. In recent years, a lot of other platforms, both free and paid, tend to support registration across multiple slides. For example, HALO has a registration feature available as well, along with a host of other open-source datatypes. In that regard, how is your platform different? 2. It is mentioned that downscaling occurs during the registration process in order to reduce runtime - how are nuances in features selected as registration landmarks preserved in such a case? 3. How is the fixed image determined in this case? The assumption would be that a standard H&E image is selected for this purpose- is that assumption, correct? 4. The authors have stated and justified their rationale for using the mentioned evaluation metrics in the paper. However, in the general image registration space, metrics such as the dice coefficient and jaccard index are commonly used and accepted. Is there a particular reason why these were not used as well? It would offer a more complete picture for the general user if these metrics were provided as well. 5. The validation of registering distance neighboring sections is quite a valuable contribution, as the authors rightly stated that in many multi-omics experiments, this might be a necessity. However, when looking at tissue sections that are 80-100 microns apart, it is quite likely that the set of cells that one may be looking at on the x-y coordinate system may not be the same at all; in fact, for a highly heterogeneous/flexible piece of tissue, they might be completely different. In such a circumstance, how much value is there in registering these two sections together instead of, say, separately analyzing them and using alternative methods to combine the results downstream? 6. In the proof of concept presented in the paper, the authors mention using ST and MSI data for validating their framework. Have they also investigated ST integration with more commonly available datatypes such as IHC/mIF? 7. The work that the authors have put in to validate the registration and MIIT framework using different approaches (selecting spatially distant slides, integration using augmented/artificial data) is thorough. However, different tissue types bring in their own challenges, and thus validation of this framework on an external dataset would lend more credence to this much needed framework, especially in the era of increased multiomics analyses.

      Minor Revisions: 1. Please ensure all typos/grammatical mistakes are corrected. 2. In the 'preprocessing of stained histology images', can more details be given on the thresholding process? It is also stated that the threshold is manually adjusted for each image if necessary - how is this determination done? 3. The headings/subheadings organizations within sections can be done in a more organized manner, in some parts it was challenging to determine the organization of sections/subsections. 4. Can some more details be given on the landmarks that were identified per image? Could some examples be provided on what these landmarks are, and how they remain consistence across tissue layers? 5. Currently, the way various samples are used for validating the GreedyFHist and MIIT frameworks are listed out in the paper is quite confusing. It would be appreciated if the authors can distinctly mention the number of samples out of the set of samples, and the associated stained slides are used for each. 6. How were the annotations from the 3 annotators cross validated?

  6. May 2025
    1. Editors Assessment:

      This Data Release paper presents the first genome assembly of the lemon sole (Microstomus kitt), a commercially important flatfish found in European coastal waters. It is also interesting that this work was carried out in a University course setting involving the students. The resulting chromosome-level genome was assembled using long-read PacBio HiFi sequencing and the Hi-C technique. The 628 Mbp reference (which is consistent with other Pleuronectidae fish species) is assembled into 24 chromosome-length scaffolds with high completeness, achieving a scaffold N50 of 27.2 Mbp. Peer review and data curation made the author clarify a few points and share all of the data and results in an open and well curated manner. The annotated genome of the lemon sole, with its high continuity, should therefore provide important reference data for future population genetic analyses and conservation strategies of this organism.

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground The lemon sole (Microstomus kitt) is a culinary fish from the family of righteye flounders (Pleuronectidae) inhabiting sandy and shallow offshore grounds of the North Sea, the western Baltic Sea, the English Channel, the shallow waters of Great Britain and Ireland as well as the Bay of Biscay and the coastal waters of Norway.Findings Here, we present the chromosome-level genome assembly of the lemon sole. We applied PacBio HiFi sequencing on the PacBio Revio system to generate a highly complete and contiguous reference genome. The resulting assembly has a contig N50 of 17.2 Mbp and a scaffold N50 of 27.2 Mbp. The total assembly length is 628 Mbp, of which 616 Mbp were scaffolded into 24 chromosome-length scaffolds. The identification of 99.7% complete BUSCO genes indicates a high assembly completeness.Conclusions The chromosome-level genome assembly of the lemon sole provides a high-quality reference genome for future population genomic analyses of a commercially valuable edible fish.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.156), and has published the reviews under the same license.

      Reviewer 1. Alejandro Mechaly

      Are all data available and do they match the descriptions in the paper? No. The BioProject number is not included in the submitted manuscript.

      Are the data and metadata consistent with relevant minimum information or reporting standards? No. The BioProject number is not included in the submitted manuscript.

      Comments: The paper presents a valuable contribution to the genomics of Microstomus kitt (lemon sole), a commercially important species. The study introduces a chromosome-level genome assembly using PacBio HiFi sequencing, resulting in a highly contiguous assembly with 99.7% completeness in BUSCO genes. This high-quality genome will serve as a key resource for future population genomics and aquaculture studies. Overall, this assembly offers a solid foundation for advancing research on the biology and management of lemon sole. The main critique of this study is that, while it highlights the sexual dimorphism in lemon sole, where females are larger than males, it does not delve into this aspect in detail. Although the research presents valuable data through a high-quality chromosomal-level genome assembly, it focuses exclusively on male specimens. Comparing the genomes of both sexes would be highly insightful, potentially revealing the genetic mechanisms or pathways underlying this dimorphism through comparative genomics. Recent studies on flatfish (Villarreal et al., 2024. https://doi.org/10.1186/s12864-024-10081-z) have used comparative genomics to examine sex determination genes, and applying this approach to lemon sole would significantly enhance the study’s impact. Furthermore, there are numerous sequenced flatfish genomes that should be analyzed alongside these results to provide a more comprehensive context.

      Re-review: Thank you for addressing my comments. While I understand the study's limitations, including its focus as part of a university course and the use of a single specimen, I believe the manuscript lacks sufficient impact without exploring the genetic basis of sexual dimorphism or incorporating comparative analyses with other flatfish genomes. The genome assembly and annotation are well-executed, but the absence of biological context limits the broader relevance of the work. Sexual dimorphism in lemon sole, a commercially important species, is a key topic that could inform aquaculture and fisheries management. Without addressing this, the manuscript misses an opportunity to answer important scientific questions. For these reasons, I cannot recommend the manuscript for publication in its current form. While the technical work is solid, additional analyses or a broader scope are needed to enhance its contribution to the fieldS

      Reviewer 2. Yongshuang Xiao

      This MS presents the chromosome-level genome assembly of Microstomus kitt, a species belonging to the Pleuronectidae family and mainly distributed in the North European seas. The study utilized PacBio HiFi sequencing technology combined with Hi-C data for chromosome-level assembly, resulting in a high-quality reference genome of approximately 633 MB, including 23 chromosomal length scaffolds, completing 99.7% of BUSCO genes, demonstrating high assembly completeness and gene annotation quality. Further analysis revealed abundant repetitive sequences and gene features in the lemon sole genome, providing important resources for future genetic studies of this species and its close relatives. The paper presents several issues as follows: 1. From the evaluation of the genome, the estimated size is around 542 Mb, while the manually curated Hi-C results yielded a genome size of 633 Mb. The authors are requested to explain why there is a difference of nearly 100 Mb between the second-generation sequencing evaluation and the third-generation results. 2. Utilizing PacBio HiFi sequencing technology, which generates long reads, and its associated assembly software, the authors were able to assemble the genome at the chromosome level. The authors explicitly state that the size of the 23 chromosomal level genomes assembled using YaHS and Chromap software is around 500 Mb, which is consistent with the genome survey results. How does the author know that the assembled genome is erroneous? 3. Based on the author's description, it is not clear what the size of the assembled genome from a single chain using PacBio sequencing is. The author needs to provide this data in the results. 4. The authors performed quality assessments of the assembled genome using various methods such as Merqury. However, the description of the evaluation results is lacking. The authors are requested to include the QV evaluation values and additional results of SNP alignment for the second-generation sequencing data. 5. For gene annotation, the authors used the genomes of five species of Pleuronectidae as references. We are eager to see the results of the alignment analysis between the genome obtained using PacBio Revio and the aforementioned five fish genomes. Although these results do not need to be included in the main text, they should be provided as part of the response to the reviewers, including the alignment results and alignment rates for both sets of assembled genomes (500 Mb and 633 Mb). 6. The authors are requested to include the length information of each chromosome in the supplementary files. From the assembly results, it appears that the PacBio Revio results are not as impressive as anticipated, particularly with a Scaffold N50 of 29.4 Mbp. Is this due to limitations in the length of the chromosomes themselves, affecting the quality metrics of this genome? 7. The data should be uploaded to NCBI and obtain the corresponding registration code.

      Re-review: This study aims to perform chromosome-level genome assembly of the lemon sole (Microstomus kitt) and conduct a comprehensive analysis of its genome using high-throughput sequencing technology. Researchers utilized PacBio HiFi sequencing technology to carry out whole-genome sequencing of this species, resulting in a high-quality and complete genome sequence. The genome sequence has a length of 633 Mbp, with 23 chromosome-level sequences successfully assembled. Additionally, BUSCO analysis indicated that this genome sequence possesses a high level of completeness. These results suggest that the lemon sole genome sequence can serve as an important reference for future population genetic studies of commercially valuable edible fish species. However, there are certain issues with the paper that need to be addressed: The authors emphasize that female lemon soles grow larger than males, yet they chose to sequence the male genome instead of focusing on the more unique female. The authors should clarify this choice. The HI-C assisted assembly results show that male lemon soles have 23 chromosome pairs. Are there any heteromorphic chromosomes? The authors need to elucidate the karyotype of the lemon sole, as this information is significant for both the genome assembly and subsequent research. The survey results indicate a high level of heterozygosity in lemon sole. How did the authors account for this high heterozygosity to obtain a relatively complete genome? Could this affect the accuracy of the genome? Although the authors achieved high-quality genome results through PacBio sequencing, they used BUSCO for genome quality assessment. To further highlight the completeness and accuracy of the assembled genome, it is recommended that the authors utilize QV for additional evaluation. To ensure high levels of data sharing and reproducibility, the authors are requested to provide the chromosome-level genome fasta file and gff annotation file. In summary, the authors are encouraged to provide additional information and make necessary revisions.

    1. **Editors Assessment: ** Sinocyclocheilus are a genus of freshwater cavefish fish that are endemic to the Karst regions of Southwest China. Having diverse traits in morphology, behavior, and physiology typical of cavefish, that make them interesting models for studying cave adaptation and phylogenetic evolution. The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. Assembling S. grahami (the golden-line barbel), using PacBio and Hi-C sequencing technologies, a final chromosome-level genome assembly was 1.6 Gb in size with a contig N50 of 738.5 kb and a scaffold N50 of 30.7 Mb. With 93.1% of the assembled genome sequences and 93.8% of the predicted genes anchored onto 48 chromosomes. Subsequently the authors conducted a homologous comparison to obtain chromosome-level genome assemblies for four other Sinocyclocheilus species: S. maitianheensis, S. rhinocerous, S. anshuiensis, and S. Anophthalmus. With over 82% of the genome sequences anchored on these constructed chromosomes. Peer review provided clarification on the assembly strategy and provided more benchmarking. This data having the potential to contribute to species conservation and the exploitation of potential economic and ecological values of diverse Sinocyclocheilus members.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTSinocyclocheilus, a genus of tetraploid fishes, is endemic to the karst regions of Southwest China. All species within this genus are classified as second-class national protected species due to their unique and fragile habitat. However, absence of high-quality genomic resources has hindered various research efforts to elucidate their phylogenetic relationships and the origin of polyploidy. To address these academic challenges, we at first constructed a high-quality genome assembly for the most abundant representative, golden-line barbel (Sinocyclocheilus grahami), by integration of PacBio long-read and Hi-C sequencing technologies. The final scaffold-level genome assembly of S. grahami is 1.6 Gb in length, with a scaffold N50 up to 30.7 Mb. A total of 42,205 protein-coding genes were annotated. Subsequently, 93.1% of the assembled genome sequences (about 1.5 Gb) and 93.8% of the total predicted genes were successfully anchored onto 48 chromosomes. Furthermore, we obtained chromosome-level genome assemblies for four other Sinocyclocheilus species (including S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous) based on homologous comparison. These genomic data we present in this study provide valuable genetic resources for in-depth investigation on cave adaptation and improvement of economic values and conservation of diverse Sinocyclocheilus fishes.

      Reviewer 1. Jun Wang

      The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. The manuscript was meaningful and provided valuable genome resources in Sinocyclocheilus genus, which will further help with the evolution and functional genomics of these species. The analysis was accurate, and the results were solid. My comments are as follows

      1. Please detail the method how you assembled four other species on homologous comparison? You just map the assembled scaffold to the reference genome?
      2. In the manuscript, the author only provide the sequencing info of S. grahami but not the other four species. What are the sequencing information of other four species, like how many reads have been sequenced with Illumina?
      3. There was no results description for figure 2 and why there are there only repeat annotation results for S. grahami and not the other four species?

      Reviewer 2. Fei Li and Shili Li

      This paper entitled “Chromosome-level genome assemblies of five Sinocyclocheilus species” reported a chromosome-level golden-line barbel genome by using combination of Pacbio and Hi-C data. Using this chromosome-level assembly as reference, the author also constructed other four psedo chromosome-level assemblies of S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous. These data are really important resource for conservation of these endangered species. However, some important results have not shown: 1. Protein BUSCO result has not been shown. 2. Raw reads were not uploaded to NCBI. 3. What’s the detailed number for functional annotation.

      Some minor suggestions: Add “,” before “and conservation”. What’s the meaning of “R & D”? Line 58, “a good model” should be “good models”. Line 64, remove “at first”. Line 84, change “a” to “the”. Line 90, change ‘muscle’ to “muscle tissue”. Line 105, remove ‘which was’. Line 112, remove ‘this study’. Line 122, change “Repeat annotation, gene prediction, and function prediction” to “Annotation of repeat, gene and function”. Line 137, ‘with’ should be ‘by using’. Line 127, remove ‘(TEs)’. Line 134, What’s meaning of NCBI GenBank? Remove GenBank. Line 140, ‘was’ should be ‘were’. Line 178, ‘Species’ should be ‘species’.

    1. Leveraging the use of multiplex multi-omic networks, key insights into genetic and epigenetic mechanisms supporting biofuel production have been uncovered. Here, we introduce RWRtoolkit, a multiplex generation, exploration, and statistical package built for R and command line users. RWRtoolkit enables the efficient exploration of large and highly complex biological networks generated from custom experimental data and/or from publicly available datasets, and is species agnostic. A range of functions can be used to find topological

      Reviewer name: Francis Agamah Reviewer Comments: The paper introduces a species agnostic random walk with restart toolkit built for R and command line users. The tool enables constructions of multiplex networks from any set of data layers and enables the discovery of gene-to-gene relationships. The tool offers a collection of functions for network analysis. Overall, the tool is a significant contribution to network analysis. Major Comments The manuscript's background section should provide a more comprehensive overview of the rationale behind the development of RWRtoolkit. It should clearly outline the existing RWR implementation tools, identify the gaps in these tools, and explain how RWRtoolkit addresses these limitations or offers a new approach. To demonstrate the effectiveness of RWRtoolkit, the authors could evaluate the ranking performance against other established random walk with restart algorithms that can handle heterogeneous multiplex networks. Additionally, a detailed explanation of the scoring approach implemented in RWRtoolkit is necessary to justify its choice and potential advantages. The authors have indicated in the section "network layer and multiplex statistics" that the tau parameter affects the probability of the walker visiting each specific layer. To address potential bias issues in the network exploration, it would be beneficial to provide an exploration of the parameter space and indicate how it informs the stability of the RWR output scores under variations of the various algorithm parameters.

    1. model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering—to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, conservatively estimating that the total plant chemical space likely spans millions, if not more, with the vast majority still unexplored.

      Reviewer name: Kohulan Rajan Reviewer Comments: Review: Defining the limits of plant chemical space: challenges and estimations This work presents an important contribution to understanding the chemical diversity of plants through a systematic analysis combining metabolomics data and literature mining. The authors address a question in the field and employs multiple complementary approaches to estimate the size of the plant chemical space. Here are my few suggestions and question to the authors to clarify, 1. When introducing an abbreviation one could use caption letters "Natural Products (NP)" 2. There is no list of abbreviations in the document, so introduce them first and then use them. There may be some readers who are unfamiliar with the terms COCONUT and LOTUS. 3. Is there any prior work using similar combined metabolomics/literature approaches to estimate plant chemical space? If so, these should be cited. If not, please state this explicitly to highlight the novelty of your method. 4. Cite SMILES 5. While the paper describes the use of 'literature datasets,' it appears that only existing databases (COCONUT and LOTUS) are being utilized. It would be helpful if authors could clarify whether any direct literature mining was conducted. If not, consider revising terminology to more accurately reflect the use of curated databases rather than primary literature sources. 6. Great to see the data and code openly shared on both Zenodo and GitHub. I also find the GitHub repository very useful with regard to all the provided notebooks. To maximize reusability, please consider adding a detailed "How to Use" section to the README that guides others in replicating or building upon this work. 7. The different clustering thresholds (0.7 vs 0.8) lead to notably different estimates. Could you discuss which threshold might be more appropriate for this specific application to plant metabolomics data?

    2. The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species have been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches—a formula prediction

      Reviewer name: Carlos RodrÃ-guez-López Reviewer Comments In the reviewed manuscript, Chloe Engler Hart et al. utilize different approaches to estimate the size of plant chemical space through analysis of publicly available datasets of mass spectrometry-based metabolomics. The authors tackle this issue by using data from ca. 2,000 LC-MS runs, and different formula predictors and structure annotation algorithms, and extrapolate to the estimated number of plant species. While the approach is useful at estimating structural variation, and the collected data and here-published source code can certainly be of use to the plant metabolomics community, I consider the manuscript requires modifications before it can be recommended for publication. Particularly, the language of the article should more accurately reflect the nature of this estimate; for example, mentions of the approach being "the most accurate estimate possible" (p.8, section 3.2) are not supported, and throughout the article, mentions of the calculation as a "conservative estimate" are not consistent with the approaches used, beyond formula prediction. E.g. it is mentioned that the MS2 curve being lower than formula prediction suggests that the curves may be conservative without further clarification on why this might be the case and not, e.g., a product of estimates dispersion. The authors mention that since they identify most limitations (in table 2, p. 13) are underestimations (again, with limited or no explanation) their estimate is conservative. Since no effect size can be calculated on these limitations, this statement is not true; e.g. if the approach is missing half of molecules due to extraction, and another half due to tissue coverage (total, ¼), but overestimating the plateau of plant chemical diversity by 100-fold, even if more factors underestimate the chemical space, the effect size of the latter would be dominant by far. I recommend the authors to change mentions of this estimate being a conservative approach, and instead clearly mention that this is a fragmentation-based estimate, or a similar term that better reflects the nature of the figure. Similarly, assumptions on the models should be explicitly stated, along with their limitations. The authors, for example, rely on CID induced fragmentation, and they mention that the estimate "[relies] on the predominant adduct ([M+H]+)" (p.15) and thus "this likely underestimates the true chemical diversity, as other adduct forms" (p.15). It should be stated that this is an assumption: the authors do not have evidence for the adducts being [M+H]+, which is nigh impossible with the available data, they are assuming all features are [M+H]+ adducts. This carries the implicit assumption that fragmentation mechanisms will be the same for all MS2 spectra and thus structural diversity can be estimated through MS2 clusters. It is unclear how this would yield an underestimation, as the authors claim, but rather yields an overestimation, as fragmentation of [M+H]+ and e.g. [M+Na]+ adducts of the same molecule would yield different fragmentation patterns, given the former favors charge migration dependent mechanisms compared to the latter. Thus, since the authors consider all features to be [M+H]+, two adducts of the same molecule might be considered as different moieties, given that fragmentation patterns will differ, even if no difference exists. On the same vein, since similarity thresholds of the MS2Mol algorithm are essential for the estimation of diversity, the authors should clearly state how are they calculated in text, not by reference, along with potential limitations. Finally, I believe the work would greatly benefit from including data on phylogenetics of the samples, adding diversity estimates to their sample and extrapolation data. If, for example, most of the 400,000 plant species are phylogenetically distant from the sampled species, then the reader can reasonably assume that this might be an underestimation of chemical diversity when presented with the evidence. If, on the other hand, the original sample has more diversity than the total number of plant species, this might not be the case. In any case, all of the relevant assumptions should be clearly stated. Minor note: One of the main arguments for extrapolating the diversity estimate into the rest of the plants comes from Figure 3D, where increasing MS1 adducts increases with number of samples; it would greatly help explaining the difference seen between species if the authors clarify the tissues sampled per species. E.g. if the species that only doubles the number of features contains only aerial and vegetative tissue, compared to the species that increases 6fold which might include root or reproductive tissue, etc. This might also help the authors in justifying the extrapolation of the estimate.

  7. Apr 2025
    1. Editors Assessment:

      With the recent official launch of BGI’s new CycloneSEQ sequencing platform that delivers long-reads using novel nanpores, this paper presents benchmarking data and validation studies comparing short, long-rea data from other platforms and hybrid assemblies. This study tests the performance of the new platform in sequencing diverse microbial genomes, presenting raw and processed data to enable others to scrutinise and verify the work. Being openly peer-reviewed, and having scripts and protocols also shared for the first time helps provide transparency in this benchmarking process to increase trust in this new technology. On top of benchmarking typed strains, the technology also was tested with complex microbial communities. Yielding complete metagenome-assembled genomes (MAGs) which were not achieved by short- or long-read assemblies alone. By directly reading DNA molecules without fragmentation, the study demonstrating CycloneSEQ delivers long-read data with impressive length and accuracy, unlocking gaps that short-read technologies alone cannot bridge. Future work is expanding to real samples, with and fine-tuning the balance between short-read and long-read data for even faster, higher-quality assemblies.

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground Current microbial sequencing relies on short-read platforms like Illumina and DNBSEQ, favored for their low cost and high accuracy. However, these methods often produce fragmented draft genomes, hindering comprehensive bacterial function analysis. CycloneSEQ, a novel long-read sequencing platform developed by BGI-Research, its sequencing performance and assembly improvements has been evaluated.Findings Using CycloneSEQ long-read sequencing, the type strain produced long reads with an average length of 11.6 kbp and an average quality score of 14.4. After hybrid assembly with short reads data, the assembled genome exhibited an error rate of only 0.04 mismatches and 0.08 indels per 100 kbp compared to the reference genome. This method was validated across 9 diverse species, successfully assembling complete circular genomes. Hybrid assembly significantly enhances genome completeness by using long reads to fill gaps and accurately assemble multi-copy rRNA genes, which unable be achieved by short reads solely. Through data subsampling, we found that over 500 Mbp of short-read data combined with 100 Mbp of long-read data can result in a high-quality circular assembly. Additionally, using CycloneSEQ long reads effectively improves the assembly of circular complete genomes from mixed microbial communities.Conclusions CycloneSEQ’s read length is sufficient for circular bacterial genomes, but its base quality needs improvement. Integrating DNBSEQ short reads improved accuracy, resulting in complete and accurate assemblies. This efficient approach can be widely applied in microbial sequencing.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.154), and has published the reviews under the same license.

      Reviewer 1. Ryan Wick

      This manuscript introduces CycloneSEQ data as a means for producing complete bacterial genome assemblies, with a focus on hybrid assemblies made using a combination of CycloneSEQ data and DNBSEQ data. It also publicly provides deep CycloneSEQ+DNBSEQ read sets for a range of bacterial species.

      Major comments

      The reads for the project were made publicly available via CNGBdb (https://db.cngb.org/search/project/CNP0006129), but I found it to be unusably slow (both the HTTP website and the FTP data downloads). To ensure the data is accessible to a wide audience, I request that it also be hosted in another location to make it available to readers. For example, SRA, ENA or GigaDB.

      The paper makes no mention of the other major long-read platforms: Oxford Nanopore Technologies and Pacific Biosciences. Given the widespread use of these platforms (especially ONT) in bacterial genome assembly, some discussion on CycloneSEQ’s relative advantages or limitations would be beneficial.

      Minor comments

      Lines 100-103: this sentence (‘The GC content was sensitively affected…’) is not clear to me. How are the completeness and accuracy of the assembly affecting GC content?

      Figure S2 unnecessarily includes reference-vs-reference difference counts, which are by definition zero.

      Figure S2 could mention the genome (Akkermansia muciniphila ATCC BAA-835) in the caption – I did not immediately understand what 'for type strain' meant.

      I found Figure 5 difficult to read, with its use of colour to indicate accuracy. This data would be better shown using another visualisation (e.g. bar plot) that more clearly shows quantitative values.

      For the mixed microbial community analysis, it should be stated that Unicycler is exclusively designed for bacterial isolates (its documentation explicitly says to not use it on metagenomes).

      Some of the supplementary figures are erroneously labelled 'Supplementary Table'.

      Some stats on the metagenomic reads would be helpful: e.g. total bp for short and long reads, N50 for long reads, etc.

      The methods describe using seqtk, but the reference for this (#25) is SeqKit (a different tool), so either the tool in the methods or the reference is wrong. Re-review: Thank you for the revisions to the manuscript. While many of my minor comments have been addressed, I still have concerns regarding my major comments, which have not been fully resolved.

      First, I appreciate that the data has now been made available on NCBI. However, the long-read datasets are labelled as Oxford Nanopore MinION data, which is misleading (example: SRR31850034). I understand this may be because SRA does not yet provide CycloneSEQ as a platform option, but this can be clarified through additional metadata. Specifically, the ‘design’ field for each SRA entry simply says ‘genome’, but it could have more detail, including that these are CycloneSEQ reads. The BioProject (PRJNA1194773) description could also include a clear statement that the long-read data is generated using CycloneSEQ.

      Second, I had requested a brief discussion of existing long-read platforms (ONT and PacBio) to provide context on where CycloneSEQ fits into the broader sequencing landscape. The authors have chosen not to include this, stating that they do not have direct comparison data. While I understand that such a comparison is not the purpose of this paper, I still believe that some mention of these platforms is necessary in the Background and/or Discussion sections. This paper introduces a new long-read technology for bacterial genome assembly, and readers will naturally want to understand how it relates to widely used alternatives.

      Finally, regarding my comment about supplementary figure labels, I still see the issue in the revised version provided for review. For example, the caption for Supplementary Figure S3 begins with ‘Supplementary Table S3.’ The authors stated that there were no errors, but this mislabelling remains in the PDF I received.

      As these concerns remain unresolved, I do not consider the manuscript acceptable in its current form.

      Reviewer 2. Keith Robison

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      N/A - no software presented (relates to other software questions)

      Additional comments: This is a useful presentation of an emerging sequencing platform.

      Given the complex nature of nanopore signals and the difficulty of decoding them, it has been a pattern with the prior nanopore platform that improvements in basecalling software have yielded significant changes in basecalling performance. Therefore, it would be highly advantageous if the manuscript listed which specific versions / revision numbers of the basecalling software were used so that these results are properly contextualized for comparison to future results which may use newer basecalling software.

      Ideally, the publication would include a link to git (or similar) repository with the complete pipeline used to generate the results

    1. Background Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Dan Shao

      This manuscript provides an approach to highlight critical proteins within protein-protein interaction networks by Weighted Graph Anomalous Node Detection (WGAND). I see a lot of serious issues, as follows.

      1. Overall, the author submitted the article to GigaScience, so the problem he needs to solve should be the protein-disease relationship rather than anomaly detection in graphs. However, from the Abstract to the Introduction, the article always introduces the methods and applications of anomaly detection.
      2. Also, the logic of the whole article is confusing. There is a repetition of the specific method design in Methods (2.1 and 2.2). The overall program lacks method diagrams or flowcharts for explanation. In addition, the results should be in Results and not in Methods.
      3. The results do not go to the significant achievements and cannot fully reflect the superiority of the methods.
      4. Conclusion is missing from the text. 5.The use of the English language is very awkward at times.
      5. The font in some panels of some Figures (e.g., 6) is way too small.

      Re-review: Comments to the Authors The manuscript " Network-based anomaly detection algorithm reveals proteins with major roles in human tissues" triggered a positive initial impression, regarding abstract, introduction and figures, but going deeper, I see a lot of serious issues, as follows.

      Methods and Results are very hard to read at times. In many cases, where tools or parameters are used without further justification, the impression is given that various choices were tried extensively until some setup gave plausible results. In this study, the authors treated an anomaly as a node that behaves differently from most of the nodes in the network. However, the basis for this assumption requires further substantiation. The authors' research is fundamentally rooted in this premise, yet it is not adequately verified in the article. In the evaluation, the authors employed non-standard parameters to validate the effectiveness of the model. For example, they used the value of 24% associated with Mendelian disease among the top 10 proteins calculated by WGAND to compare with results obtained from other models. However, is this method of comparison credible? Results contain a lot details that I would expect to be part of Methods. Details of the model are missing in Methods. The use of the English language is very awkward at times. Minor, nice to have

      The font in some panels of some Figures (e.g., 2) is way too small.

      If a Figure consists of more than one part, e.g. A part, B part, each part should be explained separately.

      In the explanatory part of Figure 5, (a) (b) ... should be replaced by (A) (B) .... to maintain consistency with the figure.

    2. AbstractBackground Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf034), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Yong Zhang

      This study introduces the WGAND method, an innovative weighted graph anomaly detection algorithm to identify key anomalous proteins in human tissues using machine learning techniques. Given the critical role of abnormal proteins in disease prediction and treatment, this research area is pivotal for understanding complex systems' dynamic behaviors, especially in bioinformatics. In general, this article contributes to weighted graph anomaly detection. While this study provides valuable insights and demonstrates the WGAND method's good performance and practicality, here are some suggestions and potential directions for improvement:

      1. Building on existing research, conducting a detailed performance comparison analysis between the WGAND algorithm and similar cutting-edge methods (such as OddBall, Yagada, etc.) is recommended, explicitly highlighting WGAND's advantages in anomaly detection accuracy. A series of standard metrics should be used, including but not limited to precision, recall, F1 score, and AUC curve, to quantify WGAND's effectiveness and superiority rigorously.

      2. While AUC and P@K are valuable as main evaluation metrics, introducing additional metrics such as recall, precision, and F1 score for anomaly detection tasks can provide a more comprehensive assessment of model performance.

      3. Delve into optimizing the selection of node embedding methods and edge weight estimators based on different application scenarios and explore more systematic model selection and hyperparameter optimization strategies.

      4. Investigate strategies for dynamically setting thresholds to allow the WGAND method to adapt to changes in the data environment and various task demands.

      5. Discuss the applicability of WGAND across different types of weighted graphs (such as undirected and directed graphs) and assess its generality and adaptability.

    1. Editors Assessment:

      Acropora pulchra is a species small polyped stony corals in the family Acroporidae from the the Indo-Pacific. This Data Release is the first study in stony corals to present the DNA methylome in tandem with a high-quality genome assembled utilizing PacBio long-read HiFi sequencing. Sequencing an A. pulchra specimen from Mo’orea, French Polynesia. From this single molecule sequencing data DNA methylation data was also called and quantified, and additional short-read Illumina RNASeq data was used for gene annotation. This producing an assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp, and 40,518 protein-coding genes called. Peer review requested some improved benchmarking, and it is impressive to see from the results that the genome assembly represents the most complete and contiguous stony coral genome assembly to date. As an important indicator species and this data will hopefully serve as a resource to the coral and wider scientific community. Further quantification of the genome-wide methylation is needed aid the study epigenetics of non-model organisms, and specifically future analyses on methylation in coral.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractReef-building corals are integral ecosystem engineers in tropical coral reefs worldwide but are increasingly threatened by climate change and rising ocean temperatures. Consequently, there is an urgency to identify genetic, epigenetic, and environmental factors, and how they interact, for species acclimatization and adaptation. The availability of genomic resources is essential for understanding the biology of these organisms and informing future research needs for management and and conservation. The highly diverse coral genus Acropora boasts the largest number of high-quality coral genomes, but these remain limited to a few geographic regions and highly studied species. Here we present the assembly and annotation of the genome and DNA methylome of Acropora pulchra from Mo’orea, French Polynesia. The genome assembly was created from a combination of long-read PacBio HiFi data, from which DNA methylation data were also called and quantified, and additional Illumina RNASeq data for ab initio gene predictions. The work presented here resulted in the most complete Acropora genome to date, with a BUSCO completeness of 96.7% metazoan genes. The assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp. Structural and functional annotation resulted in the prediction of a total of 40,518 protein-coding genes, and 16.74% of the genome in repeats. DNA methylation in the CpG context was 14.6% and predominantly found in flanking and gene body regions (61.7%). This reference assembly of the A. pulchra genome and DNA methylome will provide the capacity for further mechanistic studies of a common coastal coral in French Polynesia of great relevance for restoration and improve our capacity for comparative genomics in Acropora and cnidarians more broadly.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.153). These reviews (including a protocol review) are as follows.

      Reviewer 1. Yanshuo Liang

      The manuscript by Conn et al. detail the high-quality genome assembly of Acropora pulchra, a Acropora of ecological and evolutionary significance, and also analyzes its genome-wide DNA methylation characteristics. These data complement the genetic resources of the Acropora genome. This manuscript is well written and represents a valuable contribution to the field. I have some comments below for the authors to address but look forward to seeing this research published. Q1: In the first sentence of the second paragraph of the Context: This is the first study to utilize PacBio long-read HiFi sequencing to generate a high quality genome with high BUSCO completeness, in tandem with its DNA methylome for scleractinian corals. Language such as "new", "first", "unprecedented", etc, should be avoided because it often leads to unproductive controversy. As far as I know, the genome you assembled is not the first stony coral to be sequenced using PacBio long-read HiFi sequencing. Back in 2024, He et al. assembled Pocillopora verrucosa (Scleractinia) to the chromosome level using PacBio HiFi long-read sequencing and Hi-C technology. Here I would suggest please rephrase. Reference: He CP, Han TY, Huang WL, et al. Deciphering omics atlases to aid stony corals in response to global change, 11 March 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4037544/v1]. Q2: In this sentence: “On 23 October 2022, sperm samples were collected from the spawning of A.pulchra and preserved in Zymo DNA/RNA shield.” Please “A.pulchra” to “A. pulchra”. Q3: Please change all “k-mer” into “k-mer” in the manuscript. Q4: Please change “Long-Tandem Repeats” to “Long Terminal Repeats” Q5: In this sentence: “Funannotate train uses Trinity [18] and PASA [19] for ab initio predictions. Funannotate predict was then run to assign gene models using AUGUSTUS [20], GeneMark [21], and Evidence Modeler [19] to estimate final gene models.” Please write versions of these software. Q6: [20] Later references do not correspond well in the manuscript, please check!

      Reference 2. Jason Selwyn

      Is the language of sufficient quality? Yes. There are some minor grammatical issues throughout that warrent a closer reading to correct. E.g. Abstract: "...urgency to identify how genetic, epigenetic, and environmental...", "...management and and conservation...". Context: "...we aim to provide..." etc. Are all data available and do they match the descriptions in the paper? Yes. The link to the OSF repository in the PDF did not work. However, the link to the OSF repository from the github did work. Is the data acquisition clear, complete and methodologically sound? No. It isn't mentioned in the manuscript where the RNAseq data used to annotate the genome is from, nor any quality filtering steps that may have been applied to the RNA data prior to its use for annotation. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. Excluding the above comment about the RNA data. Additional Comments: This is a well assembled, and annotated genome that will contribute to the growing database of Acropora genomes. The manuscript could do with a simple pass to identify and correct some relatively minor grammatical issues and inconsistencies (Table 1 includes a thousands comma separator in some instances and not others) and needs to include details about the source of the RNA data used to train the ab initio gene predictors. There also appears to be a problem with the citation numbering after 20.

      **Reviewer 3. Benjamin Young ** Are all data available and do they match the descriptions in the paper? Yes. Raw reads, metadata, and genome assembly are publicly available and have a NCBI project number in which they are all linked. Is the data acquisition clear, complete and methodologically sound? Yes. Collection of sperm samples, HMW DNA extraction, and SMRT Bell Library prep are written clearly. I have asked for a few clarifications on wording in this section in the attached edited pdf document. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. I think the pipeline used for de-novo genome generation (including raw read cleaning and assembly), repeat masking, and gene prediction and annotation is of high quality and best practices. With the inclusion of the GitHub and all analyses scripts, it is possible to reproduce the assembly generated. Is there sufficient data validation and statistical analyses of data quality? Yes. This is not super relevant for a genome assembly paper so I have no additional comments here. Is the validation suitable for this type of data? Yes. The authors use tools such as GenomeScope2 and BUSCO for validation of their data. It would be nice to see the tool they used to identify N50 and L50 (maybe Quast) included in the methods. Additionally, I would like to see a Merqury analysis of the HifiAsm primary and alternate assemblies to show that duplicate purging was successful. Additional Comments: I would first like to commend the authors for a well assembled genome resource for a coral species that will be greatly beneficial to the wider coral and scientific community. I have provided a PDF with comments throughout for the authors to address. The majority of these are easy fixes, including things such as sentence structure, inconsistent capitalisation of subheadings, additional references for methods, clarification of statements, and other suggestions. I do have a few larger requests for this to be published, and these are the reasons for selecting the major revision option as there may need to be figure updates, and quick additional analyses to be run. 1. Can you please correct the verbiage around BUSCO analysis throughout the manuscript. It is often stated "BUSCO completeness of xx%". BUSCO doesn't directly measure completeness, rather completeness of single copy orthologs against a specific database. I have left comments throughout on potential rewording for these instances. Please also specify the exact database you used (i.e. odb10_metazoa). Finally, can you please be more specific when stating BUSCO results, specifically when you use 96.9% this is single copy and duplicated complete BUSCOS. I have left comments in the pdf again for this. 2. In the results for Genome Assembly section can you please include results (i.e. length, N50, L50, number contigs/scaffolds) for the primary assembly and the scaffolded assembly. 3. I think it would be not much work and provide additional information to show successful duplicate purging to run a Merqury analysis on the primary and alternative assemblies from HiFiAsm. 4. Can you include some additional information in the "Structural and Functional Annotation section". Specifically, can you provide information on the results from the funannoatate predict step, and then how funannotate update improved this (if at all). 5. Please double check the methods section for funannotate. From reading the funannoatate documentation I think there may be some confusion on what each step (train, predict, update, annotate) is doing. I have provided comments in the pdf to help clarify, and have also linked the funnannotate documentation. 6. On NCBI I see that an additional Acropora pulchra genome has just been made available (29th Jan 2025), with this to the chromosome level (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_965118205.1/). I think it would be prudent to include this assemblies statistics in your Table 1, and also run a BUSCO analysis on this other assembly to compare with your one. While they got to chromosome level, you do have markedly less contigs. I do not think this is necessary for this manuscript, but future work you could look to use their chromosome assembly to get your scaffolded assembly to chromosome level. Again, I want to say this is a wonderful resource for the coral and wider scientific community, and the pipeline for de-novo assembly and annotation is best practices in my opinion. Annotated additional file: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTk0L2Nvbm5ldGFsMjAyNV9yZXZpZXdjb21tZW50cy5wZGY=

      Re-review:

      The authors have addressed all my comments and queries, and included nearly all recommendations. Thank you ! A few quick notes to fix before publication -
      

      "The input created Funannotate train uses Trinity v.2.15.2 [22] and PASA v.2.5.3 [23] for transcript assembly prior to ab initio predictions". This sentence reads weird, reword before publishing. I think maybe just remove "created Funannotate train" and then it reads correctly. Or "Funnannotate trains uses .....". - "PFAM v.37.0 [28], CAZyme [29], UniProtKB v[30] and GO [31]." Missing a few version numbers, and UniProt just has a v. - "The mitochondrial genome was successfully assembled and circularized using MitoHifi v3.2.2 The final assembled A. pulchra mitogenome is". Just missing a period i think before "The final assembly". Great job and a very useful resource for the coral community !!

    1. AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Saritha Kodikara

      In this study, the authors present a novel metagenomic health index designed to differentiate between healthy and unhealthy microbiomes. This area of research is crucial for developing a non-invasive, cost-effective method to assess patient health status. However, I have several suggestions that I believe will enhance the study and address some key points.

      Main Comments:

      1.) The study would benefit from additional post-analysis to provide greater depth. Although the authors applied their approach to several diseases, they did not elaborate on the significance of individual microbiome features across different diseases. For instance, the GMHI parameters were identified as least important in IBD—does this observation hold universally across all diseases analysed?

      2.) The index Q2D performed worse in AGP1 compared to HMP2 and AGP2. Is there a specific reason for this discrepancy? For example, does the index underperform in the heterogeneous functional landscape presented in AGP1 (Figure 2C)? An explanation for the reduced performance in this cohort would provide valuable insights into the method's performance under varying conditions.

      3.) It would be beneficial to make all processed data and relevant scripts available in a GitHub repository to ensure that the results presented in the paper can be replicated by other researchers.

      4.) When attempting to run the script available at https://github.com/Kizielins/q2-predict-dysbiosis, I encountered an error related to the scikit-learn version. The script appears to be compatible with version 1.2.2, whereas I was using version 1.4.2. Please consider updating the script or providing instructions for resolving version compatibility issues.

      5.) The rationale behind considering only positive correlations when calculating the index is unclear. It would be helpful to clarify why negative correlations were excluded from the index calculations.

      6.) In analysing longitudinal alterations, did the authors account for dependencies from previous time points Q2D index? If not, how do these longitudinal alterations differ from those observed in independent studies?

      7.) For each dataset analysed, additional details would be useful, such as the number of samples, species, functions, core functions, and the number of species remaining after applying the MDFS algorithm.

      8.) On Page 13, the authors state that they chose GMHI as their benchmark because hiPCA and Shannon entropy produced worse results for the HMP2 cohort. However, Supplementary Table 3 indicates that Shannon entropy had a lower p-value than GMHI in the Mann-Whitney U test.

      Minor comments:

      1) Page 11 Original: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 5b)." Suggested: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 4b??)."

      2) Page 12 Original: "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts." Suggested: "Most importantly, Q2PD produced visually the highest median?? scores for all healthy in comparison to unhealthy cohorts."

      3) Page 12 Original: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in HMP2" Suggested: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in AGP2??"

      4) Page 14 Original: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 7)." Suggested: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 8??)."

    2. AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Vanessa Marcelino

      The manuscript proposes a new method to distinguish between healthy and diseased human gut microbiomes. The topic is timely, as to date, there is no consensus on what constitutes a healthy microbiome. The key conceptual advance of this study is the integration of functional microbiome features to define health. Their new computational approach, q2-predict-dysbiosis (Q2PD), is open source and available on GitHub.

      While the manuscript is conceptually innovative and interesting for the scientific community, there are several major limitations in the current version of this study.

      1. To develop the Q2PD, they define features associated with health by comparing it with microbiome samples from IBD patients. There are many more non-healthy/dysbiotic phenotypes beyond IBD, therefore it is not accurate to use IBD as synonymous of dysbiosis as done throughout this version of the paper.

      2. The study initially tests the performance of Q2PD against other gut microbiome health indexes (GMHI and hiPCA) using the same data that was used to select the health-associated features of Q2PD. Model performance should be assessed on independent data. On a separate analysis, they do use different datasets (from GMHI and hiPCA), but these datasets seem to be incomplete - GMHI and hiPCA publications have included 10 or more disease categories, and it is unclear why only 4 categories are shown in this study.

      3. While Q2PD does provide visible improvements in differentiating some diseases from healthy phenotypes, the accuracy and sensitivity of Q2PD isn't clear. To adopt Q2PD, I would like to know what are the chances that the classification results will be correct.

      4. There is very little documentation on how to use Q2PD. What are the expect outputs for example, do we need to chose a threshold to define health? Is the method completely dependent on Humann and Metaphlan outputs, or other formats are accepted? The test data contain some samples with zero counts. I got an error when trying it with the test data (ValueError: node array from the pickle has an incompatible dtype…).

      Therefore, I recommend including a range of disease categories to develop Q2PD and use independent datasets to validate the model in terms of accuracy and sensitivity. Alternatively, consider focusing this contribution on IBD. Making the code more user friendly will drastically increase the adoption of Q2PD by the community.

      Please also use page and line numbers when submitting the next version. Other suggestions:

      Abstract: I recommend replacing 'attributed' with 'linked', as 'attributed' suggests that dysbiosis may be causing (rather than reflecting) disease.

      Results: Please indicate what it is meant by 'function' here - it will be good to clarify that this method uses Metaphlan's read-based approach to identify metabolic pathways. What is used, pathway completeness or abundance?

      Results regarding Figure 3a are difficult to interpret. Is 'non-negatively correlated' the same as 'positively correlated'? What does the colour gradient represent - their abundance in those groups, or the strength of their correlation?

      "We observed that the prevalence of the pairs positively correlated in health was higher than in a number of disease-associated groups (Figure 3b)" . This is a very generalised statement considering that only half of the comparisons were significant. How co-occurring species were selected?

      "To test this, we compared the contributions of MDFS-identified species to "core functions" in different groups (Supplementary Figure 4)." How was this comparison made, based on species correlations? The caption of these figures could include more detail - it just says 'Top species contributions to functions.' but how do you define 'top' ? What do the colours represent?

      'This finding was congruent with our earlier suspicions of functional plasticity; modulation of function and thus altered connectivity in the interaction network, shifting towards less abundant, non-core functions upon perturbation of homeostasis.' This is reasonable, but I don't understand how you can draw this conclusion from these figures where there seems to be no significant difference between health and disease.

      Section 'Testing q2-predict-dysbiosis, GMHI and hiPCA accuracy of prediction for healthy and IBD individuals'

      What is the difference between fraction of "core functions" found the fraction of "core functions" among all functions?

      "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts" . This was not statistically significant. In fact, GMHI finds more significant differences between health and disease than Q2PD.

      Sup. Figure 7 - would be informative to add the name/description of these metabolites not just their ID).

      'Although the threshold of 0.6 as determinant of health by the Q2PD was not applicable to the new datasets'. Does the threshold to define health with Q2PD change depending on the dataset? What are the implications of this for the applicability of this index?

      Effects of sequencing depth - this is a very good addition to the paper, the effects of sequencing depth can be profound but are ignored in most studies, so I commend the authors for doing this here. It would be even better, in my opinion, if this was done with the same datasets used to test/compare Q2PD with other methods, as using a different dataset here adds a new layer of confounding factors.

      'the GMHI and the hiPCA produced the opposite trend, wrongly indicating patient recovery.' The difference here is striking, what is driving this trend?

      The Gut Microbiome Wellness Index 2 (GMWI2) is now published. I don't think it needs to be part of the benchmarking, but it could be acknowledged/cited here.

      Methods: More information on how the data was processed is needed - how were the abundance tables normalized? Which output from Humann was used for downstream analyses?

      To ensure reproducibility, please provide the scripts/code used for analyses and figures.

    1. AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Sandra Correa-Garhwal

      The manuscript "Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider" uses comparative genomics to study the underlying mechanisms of starvation resistance. I appreciate that the authors have produced a high-quality genome for an RTA species. The methods are sound and some interesting gene families are highlighted as key factors in starvation resistance.

      One primary concern I have relates to the study's setup and hypothesis. As currently written, the study comes across as a fishing expedition rather than a focused research project. Although the introduction is informative, it lacks a clear rationale for including this particular species. The reasoning only becomes apparent at the end of the gene family expansion and contraction section. Additionally, I am unsure if being an active hunter makes feeding more unpredictable compared to web-based prey capture. I recommend incorporating this information into the introductory paragraph to better establish the context for the analysis. While terms like "autophagy" and "energy homeostasis" are appropriate for a scientific audience, consider briefly defining them for clarity, especially if the intended audience might not be familiar with all the terminology. Although authors mention that there is no high-quality genome sequence for H. venatoria, it could be helpful to elaborate on why this is significant for understanding starvation resistance. A brief explanation of how genomic data could enhance understanding of the molecular mechanisms involved would strengthen this point. The conclusion provides a clear goal for your study, but it could be more impactful. You might want to emphasize the broader implications of your research findings for ecological conservation and biodiversity. End with a statement about the importance of understanding these mechanisms in the context of preserving ecosystems and addressing challenges posed by climate change.

      For the discussion, while the content is detailed, some parts feel slightly repetitive or could be more concise. For instance, the description of P450 gene expression could be streamlined by removing redundant mentions of their role in metabolic rate regulation. Example: In the discussion section "Interestingly, we found that some P450 families are expanded in H. venatoria, and most P450 genes are more highly expressed in the fat body than in other tissues…" This point is later reiterated in the sentence about other spider species. These ideas could be combined for efficiency. The paragraph about the phylogenetic analysis of the CYP3 clan could be shortened. While it is an interesting finding, some of the details (like the number of genes or proteins) might be better suited for the main text rather than a summary. Focusing more on the functional implications of these duplications would keep the reader engaged. Though the findings are well-explained, the broader significance could be emphasized more explicitly. For example, why is understanding these mechanisms important for the field of arachnid biology, evolutionary biology, or even practical applications (e.g., pest control, conservation)? You could add a closing sentence that ties everything together and highlights the broader relevance of the findings, such as the evolutionary or ecological importance of these adaptations in spiders.

      Other comments: Last paragraph of the introduction: When introducing Heteropoda venatoria, please spell out the species name the first time that is used. The sentence "However, these findings indicate that H. venatoria does not feed in a stable manner and often experiences periods of starvation." Does not fit the rest of the text. Finding from what study? Transcription design for starvation resistance in H. venatoria section: First sentence: What samples? confusing to start like this. Please add information about the samples. You could delete "the samples of H. venatoria were subjected to" it will read better. Are all 23 CYP# clan genes on chromosome 4 tandemly arrayed? Figure 4 - add more information about the figure. For pannel C, What do the red lines show? Grey? Numbers in the circles? While I know what they represent, other readers might not. The finding that H. venatoria chromosomes have undergone lots of chromosomal fragmentation is very interesting, and it is clearly shown on the figure. Which is why I think that more detail is needed. In this sentence "In Uloborus diversus, members of this subfamily are located on Chr5 and an unanchored scaffold." You need to specify which members. Figure 5 - Include a description of the tissues. What is Epi? Ducts? Tail?

    2. AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Hui Xiang

      In this study, the authors deciphered the chromosome-level genome of a RTA spider Heteropoda venatoria with large body size and generated comprehensive comparative transcriptomes of fat body and whole body among CK and starvation status. Generally, this study added important genomic and transcriptomic data of spiders and provided some cues in understanding the molecular changes during starvation. However, the organization of the manuscript is quite problematic. 1. As to the Results section, please be concise and highlight the main results,avoiding accumulating complex results. Do not present too many statements in terms of introduction and discussion in Results. Do not raise too many hypotheses in the results. 2. As for the involvement of the Hippo signaling pathway in lipid metabolism regulation, the cited literature and mentioned genes are not related to the results of this study. As for the analysis of P450 results, the descriptions of structural analysis are quite complex and difficult to understand. The authors did not explain clearly the relationship between the expansion of P450 genes and hunger resistance in the results of this study. 3. The author's analyses of DEG enrichment results in transcriptome analysis is confusing. Firstly,I can't agree with the authors in that "During the early stage of starvation (from CK to 2 W), many genes, specifically those involved in oxidative phosphorylation and thermogenesis pathways, were up-regulated (Fig. 2E). These findings indicate that during the early starvation stage, energy metabolism in H. venatoria occurs regularly, with sufficient supply of energy." There are a batch of DEGs between 2W and CK, and a lot of pathways involved in neurodegeneration related pathways. How to explain these changes? Secondly, as to 4W to 8W, I can not understand the relationship of down-regulation of hippo signaling pathway to the authors' speculation that "H. venatoria may reduce its cellular glucose uptake and utilization to adjust to the food-scarce environment.", as this pathway involved in lipid metabolism, as the authors stated. Thirdly, from 14 W to 19 W, pathways such Lysosome and apoptosis were down-regulated instead of up-regulated. So how the authors thought autophagy became more active? 4. "We speculate that during the evolution of spider genomes, two types of repeat sequences, TcMar and LTR sequences, had a significant impact on the size of spider genomes. Interestingly, we found that in H. venatoria chromosomes, regions with a high proportion of repeats also presented an increase in GC content (Fig. 1B)" The author's conclusion that high repeat region has higher CG content is based on Fig1B alone, which is too arbitrary. They needs more solid evidence and more detailed analysis. For example, the GC content of TE region could be compared with that of whole genome, and the GC content of gene region. The significance of the relevant results should be explained. In addition, the author should make a more convincing discussion of this result based on the more literature. 5. "We gathered genomic data and annotations for one scorpion and seven chromosome-level spider genomes using the scorpion as an outgroup [35-42]"。Many spider genomes have been published at the chromosomal level. What were the principles behind the spider genomes the authors selected in this study? 6. "Transcriptome design for starvation resistance in H. venatoria" in Results should be partially moved Methods and here the authors should straightforwardly highlighted the results . 7. I can't understand the significance of Fig 2C. The authors did not explain it in the manuscript, either. 8. "The PCA results from both the fat body and whole-body transcriptomes indicated that H. venatoria transcriptome at 19 weeks of starvation was markedly distinct from that at other stages (Fig. 2A, B). Consequently, we conducted a differential analysis of the transcriptome at 19 weeks." Please clarify how the comparative transcriptomes were conducted for differential analysis. 9. The language should be polished.

  8. Mar 2025
    1. Editors Assessment:

      As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree

      This evaluation refers to version 1 of the preprint

    2. AbstractMotivation The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes.Results We introduce CompactTree, a lightweight header-only C++ library for traversing ultra-large trees that can be easily incorporated into other tools, and we show that it is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages.Availability CompactTree can be accessed at: https://github.com/niemasd/CompactTreeContact niema{at}ucsd.eduSupplementary information Supplementary data are available at Bioinformatics online.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.152). These reviews (including a protocol review) are as follows.

      Reviewer 1. Jeet Sukumaran

      Is the documentation provided clear and user friendly? Yes. Excellent documentation. A pleasure to read. Are there (ideally real world) examples demonstrating use of the software? No.

      Reviewer 2. Ziqi Deng

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I'm able to run all the tests and used CompactTree c++ correctly except for encounter issue installation installing Python Wrapper via pip install CompactTree.

      Are there (ideally real world) examples demonstrating use of the software? Yes. CompactTree has provided examples of simulated trees for testing comparing to other peer packages. In the meanwhile it mentioned its ability to load the ~22M nodes greengenes2 tree. It would be great to see the test workflow so users can verify.

      Additional Comments: CompactTree is aimed at a very specific task, that of loading large phylogenetic trees with millions of nodes. The result shows that it is significantly faster than the other peer tools not only in loading but also in traversing trees, with less peak memory usage. It also includes the test workflow for users to repeat the test in comparison with other peer tools.

      Reviewer 3. Giorgio Bianchini

      Is the language of sufficient quality? Yes. It is slightly confusing that the paper is written using plural pronouns ("We"), when there is a single author.

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. The statement of need is present; however, it does not clearly explain what kinds of problems the software will be able to solve, beyond generic statements about addressing scalability issues. The aims of the library should be explored in more detail: as noted by the author, this library offers great speed and efficiency, but at the cost of reduced flexibility and functionality compared to other tools. Speed and efficiency are always good things, but what does the library actually do? A very fast library that does nothing is not particularly useful. So, what specific analyses does CompactTree allow, that would be impractical using other tools? For example, they could select a case study from the literature, where the analyses were limited by the algorithm, and use their library to extend the analysis to a larger dataset. The author mentions clustering, ancestral state reconstruction, and transmission risk prediction as examples of analyses that involve tree traversals, so they could start here (although I am not convinced that the efficiency of the tree representation is the computational bottleneck in these cases). The results should also be briefly mentioned in the abstract. Furthermore, the author mentions a number of packages used to analyse trees, but these are all Python packages. Since CompactTree is presented as a C++ library, this seems odd; other tools and programming languages should be mentioned/compared. For example, “ape” and “phytools” are very popular R packages, while “Bio++” is another C++ library; a literature review (or a simple web search) may reveal other such libraries. Also, the reference given for bp (“[4]”) is incorrect.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. Everything works fine if the header is included in a single source file, but if multiple distinct files contain the #include statement, a compilation error will occur due to the multiple definitions. In a real-world application, the library would reasonably need to be included in multiple source files, so this should be fixed.

      Is the documentation provided clear and user friendly? Yes. The documentation "Cookbook" is very nicely organised.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No. While the author compares CompactTree to a number of Python packages, no comparison is made against tools that use other programming languages. In particular, the author states that there is no C++ library for loading and traversing phylogenetic trees; however, as I mentioned, at least Bio++ exists and appears to be reasonably well cited. Furthermore, the memory plot does not consider the baseline memory usage. This is evident in the first two datapoints (n=100 and n=1000) for each tool, which show a very small difference, despite the leaf count increasing by an order of magnitude. If the first datapoint is subtracted from all subsequent datapoints, the memory plot looks quite similar to the other plots. If you re-run the benchmarks to include other tools, I would suggest including a “control” datapoint with a very small n (or even, loading the library without opening a tree), and subtracting this from all other datapoints; this will provide an estimate of the memory actually used to load the trees.

      Are there (ideally real world) examples demonstrating use of the software? No. As I mentioned above, having at least one example demonstrating an analysis that is significantly improved by the use of this library would be beneficial. Discussion of the improvements should also consider usability trade-offs in a real-world scenario.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.

      Additional Comments: The library looks promising and is reasonably well documented, the only two things that are really missing are a real-world practical application and a comparison with other relevant alternatives (especially Bio++). A large portion of the manuscript is spent describing how the library could be improved, rather than what it can currently do. This could be summarised in just one or two sentences, thus leaving more space for describing the real-world example.

    1. Editors Assessment:

      The Visayan spotted deer (Rusa alfredi), is a small, endangered, primarily nocturnal species of deer found in the rainforests of the Visayan Islands in the Philippines. The present study reports the first draft genome assembly for the species, addressing a critical gap in genomic data for this IUCN-redlisted cervid. Using Illumina sequencing, the resulting genome assembly spans 2.52 Gb in size with a BUSCO completeness score of 95.5% and encompasses 24,531 annotated genes. Phylogenetic analysis suggests a close evolutionary relationship between R. alfredi and Cervus species suggesting that the genus Rusa is sister to Cervus. Peer-review teased out more benchmarking results and the annotation files, demonstrating this genomic resource is useful and usable for advancing population genetics and evolutionary studies, thereby informing conservation strategies and enhancing breeding programs for the critically threatened species. Providing whole genome sequences for other native species of Rusa could further provide genomic resources for detecting hybrids, which will also help the management and monitoring of these species, especially for the reintroduction of captive populations in the wild.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTThe Visayan Spotted Deer (Rusa alfredi) is an endangered and endemic species in the Philippines facing significant threats from habitat loss and hunting. It is considered as the world’s most threatened deer species by the International Union for Conservation of Nature (IUCN) thus its conservation has been a top priority. Despite its status, there is a notable lack of genomic information available for R. alfredi and the genus Rusa in general. This study presents the first draft genome assembly of the Visayan Spotted Deer (VSD), Rusa alfredi, using Illumina short-read sequencing technology. The RusAlf_1.1 assembly has a 2.52 Gb total length with a contig N50 of 46 Kb and scaffold N50 size of 75 Mb. The assembly has a BUSCO complete score of 95.5%, demonstrating the genome’s completeness, and includes the annotation of 24,531 genes. Phylogenetic analysis based on single-copy orthologs reveals a close evolutionary relationship between the R. alfredi and the genus Cervus. The availability of the RusAlf_1.1 genome assembly represents a significant advancement in our understanding of the VSD. It opens opportunities for further research in population genetics and evolutionary biology, which could contribute to more effective conservation and management strategies for this endangered species. This genomic resource can help in assuring the survival of Rusa alfredi in the country.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.150). These reviews (including a protocol review) are as follows.

      Reviewer 1. Endre Barta

      Are all data available and do they match the descriptions in the paper? No. The authors provided only the assembly in Fasta and GenBank format and the contigs (scaffolds?) in GenBank format. Neither the annotation nor the raw Illumina reads are available.

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. In the cases where the data is uploaded, the provided metadata is consistent.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? The exact parameters used during the processing are completely missing. For example, it is unclear how the RagTag-based correcting and scaffolding were carried out.

      Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise

      Is the validation suitable for this type of data? No. Without having the raw Illumina reads and the exact command line parameters used, it is not possible to validate the provided results.

      Additional Comments:
      

      Assembling the reference genomes of endangered species is a task of immense importance, with the potential to significantly advance our understanding and conservation of these species. This work provides an initial genome assembly based on Illumina short-read sequencing. The correction and scaffolding of the contigs were made with the RagTag program using the red deer PacBio-based chromosome-level assembly. The potential benefits of this work are vast, from gaining knowledge to initiating and furthering population studies to preserve the species. According to the annotation and the BUSCO analysis, the final assembly seems especially good, considering that it is short-read based. However, there are some concerns about the methodology and the provided data. 1. The Illumina short reads and the annotation data (GFFs, VCFs) are not available. 2. The methods used are not reproducible because the descriptions of the exact parameters are missing. 3. It seems that the authors did not use the ‘-r’ parameter during the scaffolding, which resulted in inserting 100bp Ns instead of the actual size insertion based on the red deer reference genome. 4. There is no K-mer based genome size estimation. 5. The chromosome number is not known. Is there any chromosomal rearrangement between the red deer and the Visayan Spotted Deer? 6. It is not justified why the protein- and mitochondria-based trees are drawn as cladograms and not as phylograms. This way, the actual distances between the different species cannot be seen. 7. Although the short reads were mapped back to the assembly, no variation data is provided. 8. Is it necessary to include these high number (46104) short (1000>) contigs in the assembly? 9. Although the red deer assembly was used for the correction and scaffolding, the annotation was compared to the mule deer.

      Re-review: I thank the authors for their efforts to address the concerns raised. I broadly agree with the answers, but three further details need clarification: 1. Calculating the raw reads and the resulting genome size yields a coverage of about 62x. The authors mapped the raw reads back to the resulting reference genome sequence, which gave 47x coverage. However, both Genomescope and Merqury K-mer analysis showed 22x coverage. What is the reason for this discrepancy? 2. The K-mer analysis does indeed, and a bit strangely, show what appears to be a haploid genome. However, the 0.302% heterozygosity measured by GenomeScope is not remarkably low. To have an accurate picture of this, it would be important to count the number of heterozygous sites based on the raw reads mapped back at 47x coverage. 3. Although we do not know the exact chromosome number, fitting the reference to the red deer reference could be interesting. It would show how many scaffolds fit more than one red deer chromosome. Of course, this could be either due to chromosome rearrangement or because the contigs' scaffolding or assembly was incorrect.

      Reviewer 2. Haimeng Li

      Are all data available and do they match the descriptions in the paper?

      No. The genomic annotation file is not publicly available.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. Genomic annotation information and protein sequence information were not found in the NCBI database.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No.

      Additional Comments:

      The manuscript, 'Draft Genome of the Endangered Visayan Spotted Deer (Rusa alfredi), a Philippine Endemic Species,' contributes to the field of conservation genomics. The study presents the first draft genome assembly of the Visayan Spotted Deer, utilizing Illumina short-read sequencing technology to generate valuable genomic resources for this endangered species. Here are some questions and comments.
      

      Q1. Why was gene annotation conducted using only homology-based annotation? It is recommended that the annotation approach includes de novo, RNA-based, and homology-based methods. Combining these approaches would provide a more comprehensive gene set, particularly for species with limited genomic resources. Please revise the method section to include these additional annotation strategies. The authors have stated that due to sampling limitations, RNA-based experiments could not be conducted. RNA extraction might be performed using the tissue samples that were previously collected for genome assembly. In Lines 167-172 Q2. Before proceeding with genome assembly, it is essential to conduct a genome survey. This initial step provides crucial information about the genome's size, complexity, and composition, which is vital for planning the assembly strategy and selecting appropriate sequencing technologies and bioinformatics tools. The survey should include estimates of genome size, GC content, repetitive elements, and ploidy level. Additionally, the result could be used to assess the completeness of the assembly. Please include a section on the genome survey in the Method section. Q3. To enhance the quality and contiguity of the assembly, utilizing another species as a reference genome for scaffolding might introduce errors due to discrepancies in karyotype. It is essential to ascertain whether there is a definitive karyotype study that verifies the consistency of the karyotype between the Visayan Spotted Deer and the reference species, indicating the absence of chromosomal fission or fusion events. In Lines 236-238 This information is crucial for the reliability of the scaffolding process. Q4. Although the length of scaffold N50 is long, the high number of scaffolds and contigs suggests fragmentation. Have you addressed redundancy in the assembly? In Line 238 Q5. Have you used software like Merqury to detect assembly errors and assess the completeness of the assembly? This is useful for evaluating the quality of the genome sequence and identifying potential issues that may need to be addressed. Q6. Are the species divergent, which might explain the low number of orthologous genes? Is this an annotation issue or does it reflect true biological divergence? Further investigation into the annotation process and comparative genomic analyses may be warranted to understand the extent of divergence and the implications for the study. In Lines 313-317 Q7. Please standardize the format of numbers throughout the manuscript to maintain consistency in the number of significant figures. In Lines 224, 225, 227, 239, 245

      Re-review: Q1:Why is the estimated genome size from the genome survey much smaller than the assembled genome size? Q2:In the method section, I did not see a description of the de novo method for gene structure annotation. Q3:I am concerned about using a reference genome with unclear karyotype relationships for scaffolding. Q4:Are there other published comparative genomic studies on deer that have identified such a small number of homologous genes?

    1. Editors Assessment:

      Teinturier grapes produce berries with pigmented skin and flesh, and are used in red wine blends, as they provide a deeper colour. This paper presents the genomes of two popular teinturier varieties (Dakapo and Rubired); sequenced, assembled, and annotated to provide additional resources for their use in breeding. Combining Nanopore and Illumina sequencing for Dakapo, scaffolding to the existing grapevine assembly to generate a final assembly of 508.5 Mbp and 36,940 gene annotations. For Rubired PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long and 56,681 genes annotated. Peer review has helped validate their high quality, these genomes hopefully enabling more insight into the genetics of grapevine berry colour and their other traits like frost and mildew-resistance.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTBackground Teinturier grapevine varieties were first described in the 16th century and have persisted due to their deep pigmentation. Unlike most other grapevine varieties, teinturier varieties produce berries with pigmented flesh due to anthocyanin production within the flesh. As a result, teinturier varieties are of interest not only for their ability to enhance the pigmentation of wine blends but also for their health benefits. Here, we assembled and annotated the Dakapo and Rubired genomes, two teinturier varieties.Findings For Dakapo, we used a combination of Nanopore sequencing, Illumina sequencing, and scaffolding to the existing grapevine genome assembly to generate a final assembly of 508.5 Mbp with an N50 scaffold length of 25.6 Mbp and a BUSCO score of 98.0%. A combination approach of de novo annotation and lifting over annotations from the existing grapevine reference genome resulted in the annotation of 36,940 genes in the Dakapo assembly. For Rubired, PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long. The diploid genome has an N50 scaffold length of 24.9 Mbp and a BUSCO score of 98.7%, and both haplotype-specific genomes are of similar quality. De novo annotation of the diploid Rubired genome yielded annotations for 56,681 genes.Conclusions The Dakapo and Rubired genome assemblies and annotations will provide genetic resources for future investigations into berry flesh pigmentation and other traits of interest in grapevine.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.149). These reviews (including a protocol review) are as follows.

      Reviewer 1. Camille Rustenholz

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Overall, the authors give enough details except for the haplotypes of Chardonnay, Pinot noir, Cabernet sauvignon and Cabernet franc that were used for Figure 3.

      Is the validation suitable for this type of data? No. Overall, the authors provide accurate validation for this type of data except for the inversion that was identified on chromosome 10 of Dakapo assembly. In my opinion, more evidences need to be provided as Dakapo contigs were anchored using PN40024 12X.v2 assembly version. There is indeed a heterozygous region at the beginning of chromosome 10 in PN40024 genome which makes its assembly and scaffolding quality quite doubtful at that exact location and especially for this assembly version. I would suggest to check it using the latest PN40024 T2T version (Shi et al., Hort Res 2023) and to show some Dakapo short read alignments against its own assembly to validate the borders of this inversion, even though some wet lab validation would be even more convincing.

      Additional Comments: The authors provided the assemblies and gene annotations of the genomes of two teinturier varieties, Dakapo and Rubired. Dakapo was assembled using a combination of Nanopore and Illumina reads whereas Rubired was assembled using PacBio HiFi reads. Even though both assemblies are of high quality, quality metrics are better for Rubired assembly than for Dakapo assembly, in terms of contiguity and of phasing. I would have liked the authors to comment and explain these differences more extensively maybe in a dedicated paragraph in the Discussion section: - Why Dakapo assembly could not be phased? - Are these differences in terms of quality due to the sequencing technologies (Nanopore versus PacBio HiFi) used? Or to different year of dataset acquisition? Or to assembly methods? Both assemblies were also annotated: 36,940 genes in the Dakapo assembly and 56,681 genes in the diploid Rubired. I assume that 56,681 is the sum of the number of genes annotated on haplotype 1 and haplotype 2 of Rubired. If so, it needs to be clearly stated line 328 otherwise it can be confusing for the reader who will think that Rubired has 20,000 more genes than Dakapo. Also, the authors used two different annotation pipelines, which complicates the gene content comparison and the synteny analysis later on. I would have liked the authors to comment and explain it: - Is it due to the difference in the quality of the assemblies? If so, the authors need to highlight the limits of their annotation pipeline regarding assembly quality. - Any other explanation? Some minor suggestions : - Line 74: please use the word “clone” in the sentence for a matter of clarity. - Line 292-293: PN40024.v4 assembly is not the most recent but the PN40024 T2T is (Shi et al., Hort Res, 2023) The quality of the assemblies and annotations are very good and the resources of the paper will be very valuable for the grapevine community, especially to study the anthocyanin production in grapevine.

      Reviewer 2. Andrea Gschwend

      Are all data available and do they match the descriptions in the paper? No. The supplementary files were not made available to me for review.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      I recommend including additional details for the programs used for the Rubired genome assembly and annotation in this manuscript, though.

      Is there sufficient data validation and statistical analyses of data quality? No. It is unclear from the manuscript if the large Dakapo inversion was validated experimentally. See additional comments from the uploaded word document https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTQ1L1JpdHRlcl9ldF9hbC5fMjAyNF9HaWdhYnl0ZV9yZXZpZXdlcl9jb21tZW50c184LTIzLTI0LmRvY3g=

      Reviewer 3. Yongfeng Zhou and Kekun Zhang

      Are all data available and do they match the descriptions in the paper? No. Is there sufficient data validation and statistical analyses of data quality? No. Is there sufficient information for others to reuse this dataset or integrate it with other data? No. Additional Comments: My main concerns: 1. Please explain why different sequencing methods were chosen for the genome assembly of Dakapo and Rubired, given that HiFi sequencing is currently mainstream and provides more accurate assembly? 2. Recently, the T2T level genome of many grape cultivars has been assembled including the reference genome PN_T2T and the teinturier grape Yan73, Please align with the latest complete reference genome PN_T2T in Line 172, and add the genome information about PN_T2T and Yan73 in Table 1. ( DOI10.1093/hr/uhad061, DOI10.1093/hr/uhad205 ) 3. Line 387-389: How did you verify the correctness of this inversion? Is it contained within a single contig without orientation or assembly errors in the Dakapo genome? Have you identified any other genomes with this inversion? 4. Line 255: can you explain why is the contig N50 so low? 5. Line 328: whether the total number of annotated genes in the two Rubired haplotypes are all 56,681? it would be more appropriate to describe them separately. 6. The phenotypes of these two grapes should be included, not just in the pattern diagram. 7. The sequence difference in Figure 2 should be verified using other methods, such as PCR results and Sanger sequencing.

    1. Editors Assessment:

      The accuracy of basecalling of nanopore sequencing still needs to be improved. With recent advances in deep learning this paper introduces SqueezeCall, a novel end-to-end tool for accurate basecalling. This uses Squeezeformer-achitecture which integrates local context extraction through convolutional layers and long-range dependency modeling via global context acquisition. Testing and peer review demonstrated that SqueezeCall outperformed traditional RNN and Transformer-based basecallers across multiple datasets, indicating its potential to refine genomic assembly and facilitate direct detection of modified bases in future genomic analytics. Future work is ongoing that will focus on training on highly curated datasets, including known modifications, to further increase research value. SqueezeCall is MIT licensed and available from GitHub here: https://github.com/labcbb/SqueezeCall

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTNanopore sequencing, a novel third-generation sequencing technique, offers significant advantages over other sequencing approaches, owing especially to its capabilities for direct RNA sequencing, real-time analysis, and long-read length. During nanopore sequencing, the sequencer measures changes in electrical current that occur as each nucleotide passes through the nanopores. A basecaller identifies the base sequences according to the raw current measurements. However, due to variations in DNA and RNA molecules, noise from the sequencing process, and limitations in existing methodology, accurate basecalling remains a challenge. In this paper, we introduce SqueezeCall, a novel approach that uses an end-to-end Squeezeformer-based model for accurate nanopore basecalling. In SqueezeCall, convolution layers are used to down sample raw signals and to model local dependencies. A Squeezeformer network is employed to capture the global context. Finally, a connectionist temporal classification (CTC) decoder generates the DNA sequence by a beam search algorithm. Inspired by the Wav2vec2.0 model, we masked a proportion of the time steps of the convolution outputs before feeding them to the Squeezeformer network and replaced them with a trained feature vector shared between all masked time steps. Experimental results demonstrate that this method enhances our model’s ability to resist noise and allows for improved basecalling accuracy. We trained SqueezeCall using a combination of three types of loss: CTC-CRF loss, intermediate CTC-CRF loss, and KL loss. Ablation experiments show that all three types of loss contribute to basecalling accuracy. Experiments on multiple species further demonstrate the potential of the Squeezeformer-based model to improve basecalling accuracy and its superiority over a recurrent neural network (RNN)-based model and Transformer-based models.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.148). These reviews (including a protocol review) are as follows.

      Reviewer 1. Tao Jiang

      In this study, Zhongxu ZHU presents a novel approach combining the Squeezeformer architecture with masking techniques for nanopore basecalling, demonstrating meaningful improvements over existing methods. However, several concerns need to be addressed before publication. 1. The rationale behind the chosen hyperparameter values (e.g., mask_time_prob = 0.05 and mask_time_length = 5) is unclear. Did the authors experiment with other hyperparameter settings? If so, please provide results or justification for selecting these specific values. 2. The signal preprocessing methodology would benefit from a more detailed explanation. Specifically, the current description should clarify whether standard signal normalization techniques were applied to the raw current signals and detail any FFT preprocessing steps. Since nanopore sequencing signals can vary significantly between different species and experimental runs, explaining how SqueezeCall handles these variations would help other researchers implement and potentially improve upon this work. The author could consider including a flowchart or detailed pseudocode of the preprocessing pipeline. 3. A more detailed analysis of the model's error handling would strengthen the paper. Specifically, how effectively does SqueezeCall address key challenges in nanopore sequencing, such as homopolymer errors? 4. The manuscript requires attention to detail in presentation,such as: I) In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which should be corrected for consistency. II) On page 12, line 19, there is an unnecessary "e.g." before "SqueezeCall," which should be removed. 5. Instances of "Error! Reference source not found" are present in the manuscript. Please resolve these citation errors to ensure clarity and credibility.

      Re-review: The revised manuscript addresses most of my concerns; however, I have a few additional suggestions before recommending it for publication: 1) The newly added experimental Mask module presents only the results. Charts should be included to provide a more intuitive and visual representation of these results. 2) The images included in the Response should also be incorporated into the main text or published as supplementary materials alongside the manuscript. 3) The formulas in the manuscript are missing corresponding numbers. It is recommended to add numbers to each formula for clarity and ease of reference.

      Reviewer 2. Ximei Luo

      This manuscript describes a tool called SqueezeCall, designed for accurate nanopore basecalling. The authors compare SqueezeCall with four existing basecalling methods across 11 different datasets and report that it outperforms them in terms of basecalling accuracy. However, the study has several shortcomings and requires further clarification. Below are my comments. 1) The current discussion and conclusion section lacks sufficient analysis of the scientific and practical value of the proposed algorithm for nanopore sequencing. To strengthen the manuscript, consider expanding the conclusion section to provide a detailed discussion on the practical applications of the tool in real-world nanopore sequencing workflows. Additionally, include potential directions for further improvement of the algorithm to inspire future research and development in this area. 2) The figures in the manuscript are blurry and should be improved for clarity. Additionally, the layout requires better structuring and alignment, ensuring that the borders are neat and consistent. Efforts should be made to enhance the visual appeal of the figures, and the accompanying descriptions should provide sufficient detail to enable readers to understand the content by reviewing the figures alone. 3)To enhance the showcasing of SqueezeCall's superiority, it is advisable to include one or two of the latest methods for comparison.

      Minor comments: 1) There are instances of missing punctuation marks in sentences throughout the article. For example, the sentence on page 3, line 9, is missing a period at the end. 2) Address the "Reference not found" issues that appear in several places in the manuscript. 3) Number all formulas in the manuscript for easier reference and citation. 4) Verify that all references are complete and formatted according to the target journal's guidelines. 5) Some areas in Table 1 that necessitate emphasis through bold formatting are inaccurately labeled. 6) Certain content in Figure 1 and Figure 2 appears redundant; consolidation is recommended to streamline the visuals.

      Reviewer 3. Yongtian Wang

      The manuscript presents SqueezeCall, an innovative approach that combines Squeezeformer architecture with masking techniques for nanopore basecalling. The work demonstrates promising accuracy improvements through comprehensive evaluation across multiple datasets, including human, lambda phage, and nine bacterial datasets. The architecture thoughtfully integrates convolution layers for signal downsampling, employs a Squeezeformer network for capturing global context, and introduces a novel masking technique inspired by Wav2vec2.0. While the research direction and initial results are valuable, several aspects could be strengthened to enhance the work's impact: 1. Several formatting inconsistencies in the manuscript require attention for improved clarity. In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which affects the table's readability. On page 12, line 19, the redundant "e.g." before "squeezecall" should be removed. The citation system needs review as multiple instances of "Error! Reference source not found" appear throughout. 2. The mask hyperparameter selection (mask_time_prob = 0.05 and mask_time_length = 5) requires empirical justification. Including ablation studies showing model performance with different masking probabilities (e.g., 0.01, 0.03, 0.07, 0.1) and lengths (e.g., 3, 7, 10) would provide valuable insights. This analysis could reveal whether the chosen values are optimal or if there's room for improvement. A visualization of how different masking parameters affect model performance could be particularly instructive. 3. The error analysis could be expanded to provide deeper technical insights. The author should particularly analyze the distribution of skip and stay errors in homopolymer regions (e.g., AAAAA or GGGGG) where nanopore basecalling typically struggles. 4. The manuscript would benefit from exploring modified base calling capabilities. The author could train and evaluate the model on datasets containing known DNA modifications (e.g., 5mC, 6mA). This could start with synthetic sequences containing known modifications and extend to well-characterized genomic regions. Even if full modified base calling is beyond the current scope, preliminary results or architectural considerations for future extension would be valuable.

    1. AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer #2: Anuradha Wickramarachchi

      Overall comments.

      Authors of the manuscript have developed an iterative overlap graph construction algorithm to support genome assembly. This is both an interesting and a demanding area of research due to very recent advancements in sequencing technologies.

      Although the text in the manuscript is interesting, grammar must be rechecked and revised. At some point it is difficult to keep track of the content and references to supplementary to make sense out of the content.

      Specific comments

      Page 1 Line 13: I believe the authors are talking about assembly sizes and not genome sizes. The sentences here could be a bit short to make them easy to understand.

      Page 2 Line 19: Theoretical time complexity O(m2n2) is bit of an overstatement due to the heuristics employed by most assemblers. For example, mash distance, minimisers and k-mer bins are there to prevent this explosion of complexity. Either acknowledge such methods or provide a range for the time complexity. I would be interesting to know the time complexities of the methods expressed in sentence starting Line 15.

      Page 5 Line 11: Was this performed with overlapping windows of 1gb? Otherwise, simulations may not have reads spanning across such regions.

      Page 5 Line 14: It seems you are simulating 9 + 4 + 4 datasets. This is unclear, please make this into bullet points or separate paragraphs and explain clearly. Include simulator information in the table itself by may be making it landscape (in supplementary).

      Fig 2: I believe authors should expand their analysis to more recent and popular assemblers. For example, wtdbg2 is designed for noisy reads and not specifically for more accurate R10/ HiFi reads. So please include, HiFi-asm, Flye where appropriate. Flye supports ONT out of the box and in my experience does produce good assemblies.

      Although, you are evaluating read overlaps, it is hard to ignore assemblers themselves just because they do not produce intermediate overlaps graphs.

      Page 5-9: In the benchmarks section, please include how True Positives and False Positives were labelled. Was this from simulation data?

      Page 11: Use of xRead has been evaluated on genome assemblies. This is a very important and it is a bit unfortunate that existing assemblers are not very flexible in terms of plugging in new intermediate steps. It might be worth exploring into creating a new assembler using the wtpoa2 cli command of wtdbg2.

      Page 16: What will happen if you only capture reads from a single chromosome due to longer length? I believe the objective is to gather longest reads capturing as much as possible covering the whole genome. Please comment on this.

      Page 19: In the Github Readme the download URL was wrong. Please correct it to the latest release

      Correct: https://github.com/tcKong47/xRead/releases/download/xRead-v1.0.0.1/xRead-v1.0.0.tar.gz Existing: https://github.com/tcKong47/xRead/releases/download/v1.0.0/xRead-v1.0.0.tar.gz

      Make command failed with make: *** No rule to make target main.h', needed bymain.o'. Stop.

      It seems the release does not have source code, but rather the compiled version. Please update github instructing how to compile code properly with a git clone.

    2. AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer #1: Antoine Limasset

      The manuscript describes xreads, a novel method that enables resource-efficient overlap graph computation based on new strategies to compute it quickly and with controlled memory usage. The authors introduce several quality metrics to assess the quality of the overlap graph and integrate their tool into NextDenovo, improving its resource usage.

      The manuscript is overall clear, although the section order can make it hard to read as concepts are defined backward. Some typos and minor phrasing issues should be corrected.

      Remarks:

      The manuscript spends a lot of time evaluating the quality of the overlap graph, which is a very commendable approach and is often overlooked. I thank the authors for this contribution. However, I have issues with the definition of ground truth overlap. Even if two reads do not come from successive parts of the genome, if they share, let's say, a very large perfect overlap, they should indeed overlap in the graph. Considering that the actual biological overlap is necessarily the best one found in the reads is a greedy strategy that could harm the final assembly. Because of this definition, I am not fully convinced by xreads' performance, which seems to employ an overall very greedy strategy.

      A key selling point of the abstract is the ability of xreads to work with controlled memory usage at the expense of time and external memory usage. Showing some results on this feature would be very interesting, such as a plot showing the time performance depending on memory usage, for example. Also, the amount of external memory used should be discussed.

      As far as I understand, the end goal of xreads is to perform efficient de novo assembly. The assembly results should be the primary results of the manuscript and not relegated to the supplementary section. The assembly benchmark should include other assemblers and not only NextDenovo. The assembly results and justification are not quite convincing since the proposed assembler is slightly more resource-efficient at the cost of degraded assembly quality. While the case studies are interesting, it is hard to avoid concluding that the overall quality is degraded compared to regular NextDenovo.

    1. AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer #2: Subhradip Karmakar

      I read with interest the manuscript " Genomic evidence for hybridization and introgression between blue peafowl and endangered green peafowl and molecular Foundation of peafowl white plumage" by Lujiang et al. . This is a well-drafted, well-executed study that investigated the effect of introgression in shaping the genomic diversity landscape of peafowl. I am glad the authors undertook this much-needed study which is so critical from an evolutionary point of view. I have few queries and clarifications needed : 1. Fig S21 : Manhattan Plot : What is the loci on Chr 4 & Chr 6 that showed above threshold? What are the consequences of IL12b and IL25 ? 2. Page 50, Line : 929 : " The genes (IGF2BP3, TGBR1, ISPD, MEOX2, GLI3 and MC4R) related to body size in blue peafowl were also found to have introgression areas from green peafowl" What is the evidence for this ? Were these genes absent before the introgression events in blue peafowl? What are the modifications of IGFBP after introgression? Is it under positive selection? If yes why 3. There is not much discussion on Fig S 22 ( Suppl) on the KEGG Pathway hits. What is the significance of ribosome biogenesis? Protein processing in ER, etc 4. The white peafowls were homozygous for the mutant (A/A), resulting in the loss of EDNRB2 transcript. What is the reason for this mutant gene's fixation in white plumage birds? 5. The images, almost all of them, appear very hazy and blurry. It may be an issue with my computer. Please recheck 6. Please elaborate on the significance of IL6 and other immune-related genes in the discussion.

    2. AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer #1: Huirong Mao

      The authors had finished very systematic and comprehensive research. They obtained a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence by combined several sequencing technologies(HiFi sequencing and Hi-C sequencing). Based on this, they further confirmed the evidence of introgression between blue peafowl and green peafowl. In addition, it is finding the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene as the causative mutation for white feather color of blue peafowl that identifies an important gap on the genetic mechanism of the white plumage in the peafowl. Overall, The results and resources obtained from this study are valuable further comparative genomic studies in birds. The analyses are also sound and comprehensive. However, before considering acceptance, there are some questions and clarifications needed from the authors to fully substantiate the findings and their implications. i) The "Results" section of the paper contains extensive analysis and discussion, which overlaps significantly with the "Discussion" section. It is recommended to consolidate and streamline these sections. ii) The authors used 'white feather' peafowl throughout the manuscript. Actually there are scientific terms about these color abnormality, for instance, leucism or albino plumage. Please define whether your samples from leucitic or albino populations. Also please change the term 'white feather' throughout the manuscript. iii) The authors used three types of data (one-to-one orthologs datasets, four-fold degenerate sites datasets and mitochondrial sequence datasets) to study the genetic relationships between peacocks, chickens, and turkeys, and proved that the genetic distance between peacocks and chickens is closer (See Line 859-862). However, from the results section, in Figure 1C, the pattern of tree3 shows that the genetic distance between peacocks and turkeys appears to be closer, suggesting a certain contradiction between the results and the discussion sections. iv) Why were individuals with the "pied" phenotype not selected as controls for the corresponding transcriptomic study to validate the molecular mechanisms of feather formation in blue peacocks using RNA-Seq results? v) The statement in the sentence "Compared with the peafowl, the ROH length of all peafowl populations is short and the total is small (see Line 624-625)" seems to be incorrect. vi) The entire paper still needs further improvement in terms of writing norms and grammar. (eg. Line 642, "as an outgroup", Line 647 "The mitochondrial phylogenetic" etc )

    1. AbstractBackground Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.Results We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an AUC of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.Conclusions TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Markus Müller

      The authors present TransHLA, a deep learning tool to predict whether a peptide is an HLA binder or not. They use the ESM2 language model to create peptide embeddings for structural and sequence features and then use transformers and CNNs for the binding prediction. The article is well-written and clear. However, the authors must better justify the choice of their model and its potential application.

      Major comments:

      1) In personalized medicine, the HLA alleles of a patient can be obtained via WES and there is no need for such a HLA agnostic binding predictor. Could you briefly outline the most important medical applications where your TransHLA predictor could be most useful?

      2) Could you give more information about your IEDB training set? What are the frequencies of the HLA alleles, and the number of peptides per allele? How did you perform the splits into training, validation, and test sets? Were peptides from the same allele all present in all 3 sets? How does TransHLA perform for peptides binding to alleles not present in the training set compared to peptides binding to alleles present in the training set? How does the performance depend on the number of peptides of the allele in the training set? Is the model biased to these frequent alleles?

      3) Peptides are processed by many steps before being presented on HLA molecules. These include cleavage in the proteasome, transport via TAP to the ER, cleavage by ERADs and finally loading on the HLA complex. Why don't you perform your study on extended peptide sequences, where you take into account several amino acids before and after the peptide termini? Like this, you could also include the other processing steps. It would be interesting to see whether this sequence extension would improve prediction.

      4) Could you compare your approach with a 'simpler' approach, where you calculate all biopython features (such as flexibility), ev. choose the n most informative ones by feature selection, and use a standard classifier such as logistic regression or XGBoost to predict the HLA binding. This method has the advantage that it tells you directly which features are most relevant.

      5) Please provide the results of the ablation study in a table in the main text, where you compare the ablated models to the base model.

      6) Could you briefly explain what the different terms in the TIM loss are and why they are important?

      7) Does the flexibility depend on the length of the peptides? Peptides longer than 10 often bulge out of the binding groove, and naively one would expect them to be less stiff than peptides of length 8 or 9.

      Minor:

      1) In Equation 10, please define ^p_k. In the text, you use T for the number of classes, in the formulae K.

    2. TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      **Reviewer 1: Georgios Fotakis **

      1) General Comments In this manuscript, the authors present TransHLA, a hybrid transformer model that integrates a transformer-based language model with a deep Convolutional Neural Network (CNN) module. The transformer encoder module leverages a pre-trained large language model (Evolutionary Scale Modeling - ESM2) to extract global features using a multi-head attention mechanism. The feature extraction is further enhanced by two consecutive CNN modules, maximizing the mutual information between query features (sequences) and their label predictions (epitope/non-epitope) through a modified Transductive Information Maximization (TIM) loss function. TransHLA is designed to collectively consider all HLA sites across all alleles and is the first neoantigen prediction tool of its kind, since it does not require HLA alleles as input. The authors also present benchmark study results, showcasing the increased predictive accuracy of TransHLA and its potential as a valuable pre-screening tool.

      The computational method presented in this manuscript demonstrates a strong scientific foundation and shows promise for future refinement and extension, suggesting significant potential for meaningful research output. However, there are some conceptual and technical concerns that need to be addressed.

      2) Specific comments for revision a) Major Manuscript: i) Introduction - The authors distinguish between two categories of models: those that need only epitopes as input and those that require both epitopes and HLA alleles as inputs. However, the basis for this classification is unclear. For instance, MHCNuggets and DeepSeqPanII, cited as examples of the first category, actually require both an allele and an epitope to predict neoantigens. This is supported by the algorithms' manuals and the supplementary material provided by the authors, where they specify the need for HLA alleles to execute the commands.

      • The authors state: "Considering that TransHLA is the first epitope prediction software that does not impose restrictions on HLA alleles" This needs clarification, as all available "pan-allele" models do not impose restrictions on HLA alleles (the models are trained on nearly all sequenced HLAs). Perhaps the authors meant that TransHLA does not require HLA alleles as input?

      ii) Results - The reason for conducting two separate benchmarks (case study and validation) with different HLA binding affinity predictors is unclear. For instance, it is not explained why netMHCpan/netMHCpanII were not included in the first benchmark and only used in the validation part.

      • It would be very informative if the authors were able to include other widely used HLA binding affinity predictors in their benchmarks, such as mixMHCpred and mixMHCpred2.

      • The authors state: "the details information of alleles used in each tool can be found in the Supplementary File" However, no information about the alleles used in this study is provided (or at least it was not made available to me at the time of reviewing this version of the manuscript).

      • The "protein structural flexibility" should be briefly explained and properly cited (Vihinen et al., 1994, Proteins, 19(2), 141-149).

      iii) Conclusion and Discussion - The authors claim that TransHLA alleviates "the restrictive requirement of knowing the specific HLA alleles." However, this is not typically a restriction, as serological typing of HLA is routinely performed in clinics, and samples usually come with relevant metadata. Additionally, HLA typing can be easily performed with RNAseq and/or WES data, the same data usually required to produce the putative epitopes initially, with high accuracy (e.g., OptiType can reach 93.5% [CI95: 91.8-95.1%] accuracy for HLA class I). Therefore, this information is generally readily available for processing. While the authors effectively demonstrate the accuracy of TransHLA, they fail to clarify the context in which this computational tool could be utilized.

      • To the best of my knowledge, in the research field of personalized medicine, neoantigen vaccines are typically produced at the patient level, taking the patients' HLA alleles into consideration. Binding affinity, by definition, can quantitatively differentiate between strong (low IC50) and weak (high IC50) binders. Thus, binding affinity predictions are a pivotal step for neoantigen prioritization. Given that the authors suggest TransHLA as an "alternative for filtering potential epitopes", how would TransHLA perform in such situations? To enhance clarity, the authors should elaborate on a scenario where TransHLA would be a superior choice compared to high-performing HLA binding affinity predictors in this context.

      • The authors mention in the introduction that TransHLA can be used to "expedite the precise screening of peptides". Additionally, in their GitHub repository it is stated that TransHLA "can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity", which is quite accurate. They might consider incorporating this concept into their concluding remarks as well.

      Implementation: - Since neoantigen prediction is typically carried out using computational pipelines, it would be very helpful if the authors could provide instructions for end-users to install the software and its dependencies in isolated (contained) computational environments. To enhance clarity, I am attaching the files I used to create these environments via Conda (transhla_env.yaml), Singularity (TransHLA.def), and Docker (Dockerfile).

      • Following the previous point, the authors should consider providing a CLI (similar to the "train.py" and "inference.py" scripts in their GitHub repository) to enhance the software's usability in computational pipelines. As an example, I am attaching the script I used to test the software (TransHLA.py).

      b) Minor - It would enhance the clarity (especially for readers who are not familiar with artificial intelligence) if the authors would briefly explain each technical term and then use the abbreviations. For example, "Evolutionary Scale Modeling (ESM2)" and so on.

      • Additionally, the manuscript and its supplementary material contain several grammatical and spelling errors that need to be rectified.
  9. Feb 2025
    1. In this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Alice Dennis

      I previously reviewed this paper previously for Peer-Community-In-Genomics and you can read these comments via the PCI-review page here: https://genomics.peercommunityin.org/articles/rec?id=333]

      I actually did three rounds form PCI and was more than happy with the result. I'm attaching them all here in case they didn't all make it to you.

      The original preprint linked to the PCI-review is here: https://doi.org/10.1101/2023.11.30.569369.

      I have no other concerns on the manuscript. Glad to see it published on GigaScience.

    2. AbstractIn this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.

      After initial review in PCI-Genomics (see https://genomics.peercommunityin.org/articles/rec?id=333), a version of this preprint has now been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae116), where the paper and peer reviews are published openly under a CC-BY 4.0 license. The PCI-Genomics reviewers were consulted if they had any additional comments and these were as follows.

      Reviewer 1: Henrik Lantz

      I previously reviewed this paper previously for Peer-Community-In Genomics and you can read these comments via the PCI-review page here:

      https://genomics.peercommunityin.org/articles/rec?id=333

      The original preprint linked to the PCI-reviews is here:

      https://doi.org/10.1101/2023.11.30.569369.

      I am satisfied with the latest version of manuscript.

    1. Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering

      This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Cody Bumgardner

      We are just beginning to get a glimpse into the ways that large language models (LLMs) might advance biomedical informatics. The framework you have described I would consider a serious contribution to the state-of-the-art in the area of bridging LLMs and structured data. The use of LLMs for code generation and interpretation within the same request is also innovative. The application of your framework to MeSH (https://www.nlm.nih.gov/mesh/meshhome.html) and other broader linked ontologies would be very interesting. You might also consider integrating tool calling as well (which in a way you are with subgraphs), to either further reduce the demential space or accessing data that does not otherwise have a graph structure. In this case, the content of your subgraph nodes might be the result of a function call. Congratulations on your work, it is a real contribution to our community.

    2. Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering.

      This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Linhao Luo

      Summary: This paper proposes a novel framework called KGT that integrates Large Language Models (LLMs) with Knowledge Graphs (KGs) for pan-cancer question answering. The KGT framework can effectively retrieve knowledge from KGs and improve the accuracy of LLMs for question answering. Moreover, it can provide interpretable and faithful explanations with the help of structured KGs. Comments: 1. This paper construct a new dataset denoted as PcQA form a customized KG called SOKG for the evaluation of pan-cancer question answering. This is a great contribution to the community. However, it is unclear how to constuct such a dataset. More details about the construnction process and statistics of the final datasets should be disscussed in the paper. For example, how to generate the natural language questions and answers? How to link the question with relatived KG information (i.e., entity and relation)? How many questions can be answered by the KGs (i.e., answer converage rate). How many questions have been generated? What is the ratio of each quetion types defined in Table 2? 2. In Table2, the author define 4 reasoning types. How about other reasoning types such union and negation? Can we incorpate these tpes into the datasets? 3. The propsed method is novel and interesting. However some details are unclear. In the candidate path search, do we want to search reasoning paths or relational chains? The definition of these two paths are also unclear. Please give clear definition of them in prelimary. If is the reasoning paths, do we only keep the type information during the BFS? 4. I do not understand why we need to generatea cypher query to retrieve subgraph then construct relation paths from KG. We can directly retrieved relational paths from KGs by BFS. What are the benefits and motivations of using this two-stage pipeline? 5. What are the meanings of the X and √ in the figure. How to get them? 6. In experiments, other advanced KGQA methods can be compared, e.g., RoG [1] and ToG [2]. 7. The analysis of used token, time, and cost should be disscussed in the paper. 8. Can we apply the proposed metod to other KGs (i.e., SynLethKG, and SDKG) or KGQA tasks (MetaQA, and FACTKG) to show the generability. [1] LUO, L., Li, Y. F., Haf, R., & Pan, S. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In The Twelfth International Conference on Learning Representations. [2] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., ... & Guo, J. (2023). Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph. arXiv preprint arXiv:2307.07697

    1. Editors Assessment:

      DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and validation are reusable and provide convincing results. As nanopore sequencing has enabled reading of these modified bases, it is timely making them applicable as extra letters in the molecular alphabet for DNA data storage

      This evaluation refers to version 1 of the preprint

    2. AbstractDNA molecular is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, the feasibility of the strategy is challenging due to the difficulty in synthesizing and the complex structure of non-natural DNA sequences. Here, we described a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet by introducing 5-methlcytosine(5mC). We also demonstrated the experimental validation by encoding one representative file into several 1.3~1.6 kbps in vitro DNA fragments for nanopore sequencing. The results show an average data recovery rate of 98.97% and 86.91% with and without reference respectively. This work validates the practicability of 5mC in DNA storage systems, with a potentially wide range of applications.Availability & Implementation R+ is implemented in Python and the code is available under the MIT license at https://github.com/Incpink-Liu/DNA-storage-R_plus

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.147). These reviews (including a protocol review) are as follows.

      Reviewer 1. Abdur Rasool

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? However, the Git links have a typo; the working code is available at https://github.com/Incpink-Liu/DNA-storage-R_plus

      Is the code executable?

      Unable to test. Complete execution of the given code requires time and resources.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test. Additional Comments: This manuscript focuses on DNA data storage based on an expanded molecular alphabet. In view of the challenges of non-natural bases in synthesis, sequencing, and compatibility, the manuscript proposes a DNA data storage scheme containing 5-methylcytosine based on the theory that modified bases can replace non-natural bases as extra molecular letters and develops an adaptive transcoding algorithm named R+ for corresponding experimental validation. The high data recovery rate obtained from sequencing analysis demonstrates its practicability.

      This manuscript provides a simple but relatively universal transcoding algorithm for DNA data storage that introduces additional molecular letters. The proposed DNA data storage scheme outperforms conventional DNA data storage in the potential development of information density. Considering the anticipated decrease in future synthesis costs and the expected advancements in relevant transcoding algorithms, my outlook remains optimistic regarding the potential application of this scheme. I suggest that the manuscript could be accepted after a few minor revisions listed below:

      1. Figure 3 in the paper could be further modified, specifically minimizing the excess white space on both sides of Subfigure A to make it more aesthetically pleasing.
      2. The subfigures A, B, and D in Figure 2 and Figure S2 both demonstrate the difference between poem.txt/program.py and the other four files. However, the manuscript lacks an explanation for this phenomenon. Is it relevant to the file size?
      3. The 8 nt adaptors play a key role during the sequence assembly in the experimental validation, so I suggest supplementing the specific generation process of these linkers. Text descriptions or flow charts are acceptable.
      4. It’s better to add the silico simulation to the Methods to make its structure more complete.
      5. For the practicality of DNA storage, I suggest to cite https://onlinelibrary.wiley.com/doi/10.1002/smtd.202301585 and https://academic.oup.com/bib/article/25/5/bbae463/7759103.
      6. Provide the correct URLs of GitHub links for reproducibility.

      Reviewer 2. Bi Kun

      Are there (ideally real world) examples demonstrating use of the software?

      No. Additional Comments:

      In this study, a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet is proposed to increase the information density. The experimental validation demonstrates the practicability of DDS-5mC and highlight the enormous potential of modified bases represented by 5mC in the field of DNA data storage. Overall, the methods and results look appropriate and promising, but it has minor issues that need to be addressed currently.

      1.Please indicate the proportion of substitution: insertion: deletion in the error rates of Fig. 4C and D. 2.What is the meaning of the vertical axis of Fig. 2B? Is it the number of homopolymers per sequence, the longest length of homopolymers, or something else? 3.Line 304, please add s, "References" 4.The last sentence of the Abstract: "This work validates the practicability of 5mC over other non-natural bases in DNA storage systems". Please correspond it with the last paragraph of Results (151-154). 5.If necessary, according to the guideline of this journal, section Conclusion can be added or not.

      Reviewer 3. Lifu Song

      This manuscript explores the application of 5-methylcytosine (5mC) as an additional molecular letter in DNA data storage systems, expanding the molecular alphabet to increase information density. The authors present a novel transcoding scheme (R+) and validate it with both in silico and experimental data. The study explores GC content, homopolymer distribution, and data recovery rates under various conditions, offering detailed insights into practical applications. Experimental validation with nanopore sequencing demonstrates real-world feasibility. By improving storage density and ensuring compatibility with nanopore sequencing, the study addresses significant challenges in incorporating non-natural bases into DNA storage systems. Overall, the manuscript is well-structured and addresses a highly relevant topic in DNA data storage, offering valuable contributions to the field. I recommend it for publication, subject to minor revisions to enhance clarity and precision.

      Suggested minor revisions: 1) Although substitution errors, particularly between C and 5mC, were discussed, the manuscript does not provide a detailed explanation of how these errors affect long- term storage or large-scale applications—both of which are critical for archival storage, the primary use case of DNA data storage technology. 2) The manuscript could benefit from a broader comparison with other high-density DNA storage strategies, such as composite DNA letters, to contextualize the benefits and limitations of 5mC. 3) The discussion could be expanded to address practical challenges, such as strategies to reduce synthesis costs and improve sequencing accuracy for modified bases like 5mC, to provide a more holistic perspective on the technology's scalability.

      Protocol Review: I have taken a look at the experiment protocol associated with this manuscript in the website of protocols.io. The protocol looks sensible. I don't have any additional comments about it and am happy for it to go live.

      See: https://dx.doi.org/10.17504/protocols.io.q26g7mr78gwz/v1

  10. Jan 2025
    1. This evaluation refers to version 1 of the preprint

      This work presents the genome of Cardamine chenopodiifolia, an amphicarpic plant (developing two fruit types, one above and another below ground) in the mustard (Brassicaceae) family. Cardamines also known as bittercresses and toothworts. As an octoploid species it has been challenging to create a genome reference for this species, and in this case the authors finally managed to achieve this using PacBio HiFi long-reads and Omni-C technology to assemble a fully phased, chromosome-level genome. Obtaining a 597Mb genome assembled into 32 phased chromosomes (plus mitochondrial and plastid genomes), and only having one gap in the centromeric region of chromosome 9. Peer review asked for additional QC and benchmarking, helping demonstrate the genome quality was very high, with only one gap and a N50 of 18.80Mb. The data presented here potentially helping to develop this species as an emerging model organism in the Brassicaceae for studying the development and evolution of amphicarpy by allopolyploidy.

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground Cardamine chenopodiifolia is an amphicarpic plant that develops two fruit morphs, one above and the other below ground. Above-ground fruit disperse their seeds by explosive coiling of the fruit valves, while below-ground fruit are non-explosive. Amphicarpy is a rare trait that is associated with polyploidy in C. chenopodiifolia. Studies into the development and evolution of this trait are currently limited by the absence of genomic data for C. chenopodiifolia.Results We produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using high-fidelity long read sequencing with the Pacific Biosciences platform. We successfully assembled 32 chromosomes and two organelle genomes with a total length of 597.2 Mbp and an N50 of 18.8 kbp (estimated genome size from flow cytometry: 626 Mbp). We assessed the quality of this assembly using genome-wide chromosome conformation capture (Omni-C) and BUSCO analysis (97.1% genome completeness). Additionally, we conducted synteny analysis to infer that C. chenopodiifolia likely originated via allo-rather than auto-polyploidy and phased one of the four sub-genomes.Conclusions This study provides a draft genome assembly for C. chenopodiifolia, which is a polyploid, amphicarpic species within the Brassicaceae family. This genome offers a valuable resource to investigate the under-studied trait of amphicarpy and the origin of new traits by allopolyploidy.

      Reviewer 1. Rie Shimizu

      This manuscript deciphers the complicated genome of an octoploid species, Cardamine chenopodiifolia. They successfully assembled a chromosome-level genome with 32 chromosomes, consistent with the chromosome counting. They evaluated the quality of the genome by several methods (mapping Omni-C reads, BUSCO, variant calling etc.). All benchmarks ensured the high quality of their assembly. They even tried to phase the chromosomes into four subgenomes, and one subgenome was successfully phased thanks to its higher divergence compared to the other three sets. Despite their intensive effort, the other three subgenomes could not be phased, suggesting the relationship originated from the same or closely related species. As a whole, the manuscript is very well written and describes enough details, and the genome data looks like it is already available in a public database. They even added a description of the biological application of this assembly about the amphicarpy.

      I only found a few minor points for which I kindly suggest reconsideration/rephrasing before publication, as listed below. *As the review PDF does not contain the line numbers, I suggest the original description at the first line and then write my comments.

      –C. chenopodiifolia genome is octoploid …, suggesting that its genome is octoploid. They compare the 8C peak of C. hirsuta and 2C peak of the target, but considering the genome size variation among Cardamine species, I do not think this is an appropriate expression. The pattern may mean ‘consistent’ with the expectation from C. hirsuta peaks but does not ‘suggest’ octoploidy. -C. chenopodiifolia chromosome-level genome assembly PacBio Sequel II platform. Here and nowhere, they do not mention the mode of sequencing (only found in method and the title of a table). Maybe ‘HiFi’ could be added here to make the method clearer. -Table 2. It would make more sense to overview the genome quality if the N90 and L90 (or similar, if it is already fragmented at L90) values are added. (maybe the same for Table 1). Otherwise Nx curves would be also fine for the same purpose. -We obtained only 20800 variants,…as expected for a selfing species. It might be partially due to selfing in wild habitat, but also by selfing (5 times) in the lab. This should be mentioned here to avoid misleading. -Table 4 The unit of each item (bp, number, frequency…?) should be suggested. In addition to the points listed above, I appreciate more Information about the phased chromosomes set: Total subgenome sizes of this set and the other three sets?(1:3 or imbalanced?) It would be even better with a synteny plot in addition to the colinear plot as Fig 3C. (e.g. by GENESPACE or something similar, including phased and unphased chenopodiifolia chromosome sets and C. hirsuta)

      Reviewer 2. .Qing Liu

      This manuscript “Polyploid genome assembly of Cardamine chenopodiifolia” produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using highfidelity long read sequencing with the Pacific Biosciences platform with two organelle genomes with a total length of 597.2 Mb and an N50 of 18.8 Mb together with BUSCO analysis (99.8% genome completeness), and phased one of the four sub-genomes. This study provides a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. The manuscript is suitably edited and significant data for amphicarpy breeding of C. chenopodiifolia except for the below revision points. The major revision is suggested for the current version of the manuscript.

      1 Please elucidate “an N50 of 18.8 Mb”, which is Contig or Scaffold N50 length. 2 Please elucidate “originated via allo- rather than auto-polyploidy”, which is “originated via allopolyploidy rather than autopolyploidy”. 3 Please substitute the word “understudied trait” using alternative sensible word. 4 “to phase this set of chromosomes by gene tree topology analysis”, it is suggested to be “to phase this set of chromosomes by gene phylogeney analysis”. 5 In the first section of Resuts, Cardamine chenopodiifolia genome is octoploid is suggested. 6 Could Table 1 and Table2 be combined as one table to present the sequencing and assembly characterization of C. chenopodiifolia genome. 7 Could the entromere locations be predicted in Table 5, which is the 32 chromosome summary of C. chenopodiifolia genome. 8 In Table 2, assembly 32 chromosomes including two organelles, which is not close related with the C. chenopodiifolia genome, from my point of view, two organelle genome assembly do not critical section of manuscript. 9 Could all figure numbers are ordered below each group figures, for example the below figure should be numbered before the Figure 2A (according group figure presence order). I wonder it is Figure 2, authors want to elucidate the chromosome number 2n=42, while I can’t count out 42 chromosomes from present format.Could authors using alternative clear figure to show the cytological evidence of C. chenopodiifolia chromosome number. 10 In Figure 5A, it is difficult to point out the clear meaning for first-diverged chromosome from gene tree, which is a phylogenetic meaning tree or just framework, could author redraw this Figure 5A in order to reader got what you mean.

      Reviewer 3. Kang Zhang.

      The paper produced a chromosome-scale assembly of the C. chenopodiifolia genome in the Brassicaceae family, and offers a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. I have the following comments which can be considered to improve the ms.

      Major points. 1.The introduction states that Cardamine is among the largest genera within the Brassicaceae family. The octaploid model species C. occulta and the diploid C. hirsuta have been sequenced. Therefore, I propose that a description of the evolutionary relationships among various species be included here. Additionally, the significance of the amphicarpic trait in the study of plant evolution and adaptation could be highlighted when discussing their octoploid characteristics. 2.The paper omits a detailed description of genome annotation and significant genomic features, which are essential for clearly illustrating the characteristics of the genome. To enhance this aspect, it would be beneficial to include a circular chart that displays fundamental components such as gene density, CG content, TE density, and collinearity links, among others. 3.The authors employed various techniques to differentiate the four subgenomic sets within the C. chenopodiifolia genome and ultimately managed to isolate a single sub-genomic set. The paper references the assembly of the octaploid genome of another model plant, C. occulta, within the same genus. Could it be utilized to compare with C. chenopodiifolia to achieve improvements? In addition, I suggest the authors to examine the gene density differences among these subgenomes, which could be helpful in distinguishing them. 4.Little important information were included in Table 1, 3, and Figure 4. These tables and figures should be moved to Supplementary data. 5.Evidence from Hi-C heatmap should be provided to validate the structural variations among different sets of subgenomes, such as those in Figure 3.

      Minor points. 1.Figure 5B, please change the vertical coordinate ‘# gene pairs’ to ‘Number of gene pairs’. The fonts in some figures are a little bit small. I suggest to adjust them to make it easy to read.

    1. Editors Assessment:

      Among hot topics in coral reef research, the difference between anemonefish and other damselfish is currently a popular area of research. In this study the authors provide a new high-quality non-anemonefish genome, which will be of high relevance to further the depth of such analyses. In this case of the sapphire damselfish Chrysiptera cyanea, a widely distributed damselfish in the Indo-Pacific area, often studied to elucidate the roles of various environmental controls on their reproduction, and investigate related hormonal processes To further the potential of biomolecular analyses based on this species, this study generated the first genome of a Chrysiptera fish from a male individual collected in Okinawa, Japan. Using PacBio and HiFI long-read sequencing with 94.5x coverage, a chromosome-scale genome was assembled and 28,173 genes identified and annotated. Peer review gathered more parameters and details on the quality, and the final assembly comprised of 896 Mb pairs across 91 contigs, and a BUSCO completeness of 97.6%. This reference genome should therefore be of high value for future genetic-based approaches, from population structure to gene expression analyses.
      

      This evaluation refers to version 1 of the preprint

    2. AbstractThe number of high-quality genomes is rapidly growing across taxa. However, it remains limited for coral reef fish of the Pomacentrid family, with most research focused on anemonefish. Here, we present the first assembly for a Pomacentrid of the genus Chrysiptera. Using PacBio long-read sequencing with a coverage of 94.5x, the genome of the Sapphire Devil, Chrysiptera cyanea was assembled and annotated. The final assembly consisted of 896 Mb pairs across 91 contigs, with a BUSCO completeness of 97.6%. 28,173 genes were identified. Comparative analyses with available chromosome-scale assemblies for related species identified contig-chromosome correspondences. This genome will be useful to use as a comparison to study the specific adaptations linked to symbiosis life of the closely related anemonefish. Furthermore, this species is present in most tropical coastal areas in the Indo-West Pacific and could become a model for environmental monitoring. This work will allow to expand coral reef research efforts and highlights the power of long-read assemblies to retrieve high quality genomes.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.144). These reviews are as follows.

      Reviewer 1. Darrin T. Schultz

      Are all data available and do they match the descriptions in the paper?

      No. The genome is also not yet on NCBI, but it would be good to upload it.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. I suggest later that there should be more information about the HiFi library preparation details, as the manuscript lacks them and it appears to be a non-standard (large insert size) library.

      Is the data acquisition clear, complete and methodologically sound?

      No. See above comment-

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. No parameters are provided for the genome assembly software, for read trimming, or for other software used.

      Is there sufficient data validation and statistical analyses of data quality?

      No. See extended comments - the read data could use more QC, as well as the genome assembly.

      Is the validation suitable for this type of data?

      No.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. There is a degree of information missing about the data, but another researcher could use them for their study.

      Additional Comments:

      Thank you for the opportunity to review the work, The genome of the sapphire damselfish Chrysiptera cyanea: a new resource to support further investigation of the evolution of Pomacentrids, by Gairin and colleagues. In this manuscript, the authors collect an individual of the pomocentrid fish, Chrysiptera cyanea, in Okinawa, Japan. After isolating DNA, the sequencing center at OIST prepared and sequenced a SMRT sequencing library. Additionally, the authors generated some bulk RNA-seq data and sequenced it on the Illumina platform. The authors assembled the genome with two assemblers, and performed some comparisons of the C. cyanea contigs aligned to the chromosome-scale scaffolds of closely related pomacentrids. Given my background, I will mostly comment on the genomic analyses.

      I appreciate the authors' diligence in exploring different genome assembly methods and their efforts in running BUSCO and QUAST to QC the assemblies. The DNA sequencing data and assembly produced contigs that align well with the chromosomes of closely related species (which is convenient for comparative genomics!), and the manuscript presents a solid foundation for better understanding the chromosomal evolutionary history of the Pomacentridae.

      While this work represents an important step toward providing a new genomic resource for Chrysiptera cyanea, I see a few areas where the manuscript could be refined to enhance it as a community resource:

      (1) More information about data generation: Including additional details about the HiFi library preparation, specifically the chemistries used, the number of SMRT cells sequenced, and the bioinformatics steps used to generate the HiFi reads, would improve the manuscript's clarity and reproducibility. I have some questions regarding whether these libraries were prepared for HiFi sequencing: the reported mean read length of 25kbp is 10kbp longer than the standard HiFi library insert size; and the reported amount of bases in the reads, 84 Gbp, is more data than one would expect from a single CCS-processed SMRT cell, but could be the amount of data produced from one CLR run. Characterizing the quality score vs read length distribution could be helpful to characterize the read data. Clarifying these steps taken before the genome was assembled would strengthen the reliability of these reads as a resource.

      (2) Incorporating a few more important quality control (QC) steps would better clarify the completeness of the genome assembly. For instance, an estimate of genome size from the HiFi reads could be performed with jellyfish and GenomeScope, taking advantage of the k-mer fidelity of HiFi reads. This would provide a more conclusive estimate than the current comparison. Additionally, steps such as checking for contamination and providing an explanation for decisions like haplotig removal would make the assembly process more transparent. Lastly, supplementing the QC analysis with Merqury will provide a reliable answer to how complete the assembly represents the information in the individual HiFi reads in a way that complements BUSCO and QUAST.

      (3) The initial analyses of chromosome structure are a promising look into some yet-unexplored chromosomal changes in the Pomacentridae, and I think that incorporating a deeper phylogenetic analysis would build on this strength. Situating the chromosomal findings within a phylogenetic framework could provide stronger support, or actually resolve, the evolutionary interpretations presented. Doing this analysis likely could also help resolve whether the structures seen are genome misassemblies, or instead reflect lineage-specific chromosomal changes. The authors could supplement their beautiful figures using other tools that leverage whole-genome alignments and chromosome visualization to help answer these questions. One tool to try for two-genome comparisons, that the authors may have explored already in place of their ggplot script, is D-GENIES.

      Overall, this is a valuable resource, and I commend the authors for taking the steps to analyze the chromosomal evolutionary history within the pomacentrids. I look forward to seeing the authors’ future contributions to the field of genomics and chromosome evolution.

      Minor Points Line 125: Sharing the specific Trimmomatic settings used would enhance the reproducibility of the RNA-seq data processing. The parameters for genome assembly should also be added. Line 212: Are there any replicates for the RNA-seq data? Line 294: Consider uploading the assembly to NCBI for broader visibility and accessibility.

      Reviewer 2. Yue Song.

      Are all data available and do they match the descriptions in the paper?

      No. The authors have provided clues for accessing the data in public databases such as NCBI, but it seems that the data has not been released; At least, I haven't been able to obtain available data using the provided accession number (e.g. PRJNA1167451). I'm not sure if I've missed any information, but I believe it would be better if the data could be easily accessible to the public.

      Is the data acquisition clear, complete and methodologically sound?

      No. The authors used PacBio's third-generation sequencing technology for genome sequencing, which has become a "necessary option" for obtaining high-quality genomes in current genomic research. However, they did not further advance on the path of "assembling a chromosome-level genome" based on this version. Providing a chromosome-level genome would likely be more meaningful.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. Regarding the genome assembly and annotation process, the method described by the authors is overly simplistic and lacks detailed information on the parameters and procedures used. This makes it difficult for other researchers to effectively replicate the results described in the article.

      Is there sufficient data validation and statistical analyses of data quality?

      No. The authors have calculated the N50 of contigs and the completeness of BUSCO genes, which are indeed two commonly used indicators for assessing the quality of genome assemblies. However, it is still challenging to gain a clear understanding of the assembly quality based solely on these two indicators. Could other measurements be added, such as comparing the continuity and completeness of the assembly with those of closely related species or other comparable species' genomes? Additionally, there is a point that is difficult to understand: the authors report a BUSCO completeness of approximately 94% for the genome, yet a BUSCO completeness of 97% for the gene set. It is puzzling how BUSCO genes that are not annotated in the genome can still be present in the gene set.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. As I mentioned earlier, the authors did not provide detailed information about the processing procedures and parameters, which makes it difficult for other researchers to replicate their results.

      Additional Comments: It is recommended that the authors provide a detailed description of the methods and easily accessible data retrieval methods. It would be even better if the authors could further provide a chromosome-level genome, as T2T (telomere-to-telomere) level genomes are becoming increasingly popular.

    1. Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **David Burstein ** Version: Revision 1

      Review content: The authors thoroughly answered all my questions and addressed all the raised concerns. I have no further comments, and I congratulate them on a well executed study.

    2. Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Nguyen Quoc Khanh Le ** Version: Revision 1

      Review content: No further comments to authors.

    3. Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **David Burstein **

      Review content:

      In this paper, the authors introduce "PlasGO," a language model for GO annotation of plasmid proteins. The model takes into account two levels of representation: (1) the amino acid level, producing embeddings of the analyzed proteins based on a foundation protein language model, and (2) the plasmid gene level, where the aa-based embeddings are considered as part of a language model representing each protein in the genetic context in which it is encoded. This approach leverages the modular organization of different functions on plasmid genomes. Benchmarking performed by the authors against other deep-learning GO annotation algorithms demonstrates a considerable improvement of PlasGO over existing methods. The research is timely, well-performed, and clearly explained. Main issues: 1. The authors acknowledge that only a relatively small portion of the proteins in their database have GO term annotations, which may limit the model's ability to learn plasmid patterns effectively. As they correctly point out, an iterative approach could be useful to improve performance. Specifically, high-confidence GO annotations predicted by PlasGO could be used as input for another round of prediction, and this process can be repeated until no new reliable predictions are produced. Given that the authors have all the data and models required to run such an iterative search, I would warmly recommend doing so and reporting if and how the predictions improve. 2. The gLM model (Hwang et al.) is highly similar to PlasGO in terms of the general approach, combining protein embedding (ESM2 in gLM) with genomic contextual data. Discussing the differences between the approaches and comparing their performances would provide important context and highlight the novelty of PlasGO. 3. The agreement of the PlasGO prediction with the GO terms retrieved from sequence databases ("ground truth") was determined by calculating the ratio of terms shared between the high-confidence predictions and ground truth, divided by the number of high-confidence predictions. This measure is asymmetrical and might generate over-optimistic results. At the extreme, if the algorithm produces a very large number of predictions, this value will tend to be very high just because there are many more GO terms predicted than GO terms in the ground truth. I strongly recommend using a symmetrical measure, such as the Jaccard index. 4. The methodology for calculating average precision and recall is potentially skewed. The authors compute average precision over proteins with at least one annotation, ignoring proteins lacking annotation (instead of counting these as misclassifications). This approach makes sense given that numerous plasmid proteins lack GO annotations. However, the average recall is calculated across all proteins (N). For unannotated proteins, the correct classification is not defined. Since these cases are also considered in the measure of recall, I assume PlasGO high-confidence predictions were considered correct. This seems like a problematic assumption that might lead to skewed results. I would therefore suggest that unannotated proteins be omitted from the recall calculation, as was done in the precision calculation. 5. The authors identify and filter out "elusive" GO terms that are difficult to predict. This is reasonable in the scope of this paper, but since it is still a central limitation of PlasGO, I would suggest discussing (even if not implementing) approaches to improve the predictions in these challenging cases. 6. In Figures 8 and 9, a perfect AUPR of 1 is reported in several cases. Such perfect classification performances are highly unusual and warrant an examination to double-check this result and if it persists discuss the underlying reasons for these perfect results. 7. The masking approach during training is not entirely clear. If I understand correctly, annotated proteins are being masked during prediction. This is expected to lead to the loss of a lot of contextual information. On the other hand, during training, the unannotated proteins are masked, losing potentially informative sequence data. I would suggest splitting complete plasmids between train/test/validation sets, and if needed, performing cross-validation to cover the entire dataset. This way for each plasmid the entire protein sequence and context information will be used. 8. There seems to be somewhat of a contradiction between the two following statements appearing in the paper: (1) "CaLM, despite being a pre-trained PLM, did not surpass the top three tools using ProtTrans, which is consistent with the results reported in CaLM's paper" and (2) "Experimental results demonstrate that the protein representations derived from CaLM outperform other PLMs in the classification of GO terms." Furthermore, other PLMs, such as ESM, performed better at GO annotation prediction according to the CaLM paper. These might have been more appropriate for this task. CodonBERT, a codon-based PLM also based on ProtTrans, could also have been a suitable alternative.

      Minor issues:- To improve the reading flow of the paper, consider reordering the ablation section to precede the "Performance on the RefSeq test set" section, since the ablation studies section provides the rationale for the choices of architecture and foundation protein language model.- "We initially downloaded all available plasmids from the NCBI RefSeq database" - I would suggest specifying the query or approach used to acquire all plasmids from RefSeq.- I would recommend using the term "protein embedding" instead of "protein token," which may be misleading. The term "token embeddings" used in Figure 3 is more accurate than "protein token," and yet "protein embeddings" is probably the most accurate term in this case.- Figure 1: To provide an accurate depiction of representative plasmids, I suggest including unannotated genes in Figure 1.- Figure 4: "Global average pooling" was misspelled.- Figure 10: "The prediction precision of PlasGO is determined by calculating the ratio of the number of proteins in set A that are also present in set B to the total number of predicted high-confidence proteins (|A|)". If I understand the figure correctly, it should be "number of GO terms" instead of "number of proteins" in both cases.- A figure (or supplementary figure) depicting one of the plasmids with some of the high-confidence predictions in the case study section (along the same lines as Figure 1 but with a distinction between previously known and unknown annotations) could enhance the clarity and impact of the results.

    4. Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Nguyen Quoc Khanh Le **

      Review content: 1. The manuscript introduces PlasGO, which leverages a hierarchical architecture for GO term prediction in plasmid-encoded proteins. However, the novelty of the approach could be questioned. While the combination of protein language models and BERT for GO prediction is innovative, similar methods have been applied in other contexts. 2. The study heavily relies on data from the RefSeq database, yet there is limited discussion on the quality and completeness of this data. The manuscript should address potential biases introduced by incomplete or incorrect GO annotations in the database. Moreover, the study uses protein sequences of up to 1K length, which might exclude relevant longer sequences, potentially limiting the model's applicability to all plasmid-encoded proteins. 3. The manuscript claims that PlasGO can generalize well to novel proteins, but this claim is based on a specific dataset. The model's generalizability should be tested on more diverse and independent datasets, including plasmids from different bacterial species or environmental contexts. 4. While the model's performance is quantitatively evaluated, the interpretability of the results remains unclear. The study should include an analysis of how well the model's predictions align with known biological functions and pathways. Additionally, it would be helpful to include examples where PlasGO provides novel insights that other models do not, thereby demonstrating its practical utility. 5. The manuscript does not provide detailed information on the computational resources required to train and run PlasGO. Given the complexity of the model, there are potential concerns about its scalability, particularly for larger plasmid datasets or in settings with limited computational power. 6. The manuscript compares PlasGO with several state-ofthe-art tools, but the comparison might not be fully exhaustive. Additionally, statistical significance tests for performance differences should be provided to support the comparative analysis. 7. Language models have been used in previous bioinformatics studies i.e., PMID: 37381841, PMID: 38636332. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership. 8. The study should discuss any ethical considerations related to the use of public datasets, particularly regarding data privacy and consent if any sensitive data is involved. Furthermore, the potential commercial implications of the PlasGO tool, especially if it is used for proprietary research, should be addressed. 9. While the manuscript mentions that PlasGO's code will be made available, it is crucial to ensure that all aspects of the research are fully reproducible. 10. The hierarchical architecture and the use of extensive training data might lead to overfitting, especially given the high dimensionality of the input features. The manuscript should discuss the measures taken to prevent overfitting, such as regularization techniques, dropout, or cross-validation strategies. 11. The study could benefit from a more detailed discussion on the practical implications of using PlasGO in real-world plasmid research. How can this tool be integrated into existing workflows for plasmid function prediction? What are the potential limitations in practical applications?

  11. Dec 2024
    1. Editors Assessment:

      Coded and written up as part of the African Society for Bioinformatics and Computational Biology (ASBCB) Omicscodeathons, NeuroVar is a new tool for visualizing genetic variation (Single nucleotide polymorphisms and insertions/deletions) and gene expression data related to neurological diseases. The open source R-tool is available as an online Shiny Application and a desktop application that does not require any computational skills to use. Initial validation and case studies for the tool present analyses of biomarkers in ALS, exemplifying its relevance in personalized medicine and genomic discovery. Being an Open Source project, after peer review more detail has been added in paper and GitHub repo on how to contribute, report issues or seek support. Alongside some improved installation guidelines. The paper states future developments will expand its dataset beyond the ClinGen database to encompass new databases and broader genetic inquiries.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractBackground The expanding availability of large-scale genomic data and the growing interest in uncovering gene-disease associations call for efficient tools to visualize and evaluate gene expression and genetic variation data.Methodology Data collection involved filtering biomarkers related to multiple neurological diseases from the ClinGen database. We developed a comprehensive pipeline that was implemented as an interactive Shiny application and a standalone desktop application.Results NeuroVar is a tool for visualizing genetic variation (single nucleotide polymorphisms and insertions/deletions) and gene expression profiles of biomarkers of neurological diseases.Conclusion The tool provides a user-friendly graphical user interface to visualize genomic data and is freely accessible on the project’s GitHub repository (https://github.com/omicscodeathon/neurovar).

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.143). These reviews are as follows.

      **Reviewer 1. Joost Wagenaar **

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      Yes. There is a clear statement of need, but the audience is not very targeted. The investigators outline the need for tools to help users identify phenotypic subtypes of disease and describe how the tool would help with this. Although the investigators mention that the tool will allow users to analyze biomarker data, the scope of the types of analysis that can be performed is relatively small. I think that it would benefit the tool to better define the targeted users (clinicians, data scientists, enthusiasts?) and develop specifically towards a single audience.

      The tool leverages several existing R packages to run the analysis over the data and the provided tool can be described as a user-friendly wrapper around these libraries. The interface allows users to submit a file, and plot the results of the analysis within the app.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No. I did not see any guidelines for contributing to the project in the paper, or in the associated GitHub repository.

      Is the documentation provided clear and user friendly?

      Yes, the investigators did a great job providing documentation and installation instructions. [also video demo: https://youtu.be/cYZ8WOvabJs?si=DnxVuL65yr0wYYjq]

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      Yes, the investigators provide a clearly-stated list of dependencies and instructions on how to install them prior to running the application. Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?

      Yes. The paper, and GitHub repository point to a public dataset that can be used to test the application.

      Are there (ideally real world) examples demonstrating use of the software?

      Yes. The investigators provide a video highlighting the use of the application and provide a use-case where they use the app to validate some existing knowledge.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      No. The application is sufficiently small that no automated testing or manual testing would necessary be required beyond validating that the application works.

      Additional Comments:

      The proposed application provides a nice tool that makes visualization of vcf data and analysis easier for users who are not comfortable working within R directly. It provides a nice demonstration how the scientific community can wrap scientific tools into deployable applications and tools that can be easily understood. A question remains on the target audience for an application like this as most people who are interested in these type of analysis and visualizations are, in fact, familiar enough with R, or other programming languages to directly leverage the libraries and plot the results.
      

      That said, as data integration and multi-omics visualization becomes more complex and the app provides more ways to visualize the data in meaningful ways, I do strongly believe that applications like this can provide a meaningful addition to the scientific tools that are available.

      Reviewer 2. Ruslan Rust

      Is the language of sufficient quality? Yes. The language quality of the document is of sufficient quality. I did not notice any major issues.

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      Yes. Yes, authors provide a statement of need. Authors mention that there is the need for a specialized software tool to identify genes from transcriptomic data and genetic variations such as SNPs, specifically for neurological diseases. Perhaps authors could expand on how they chose the diseases. E.g. stroke is not listed among the neurological diseases. Perhaps authors could expand a bit on the diseases they chose in the introduction.

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?

      Yes the source code is available in github under the following link: https://github.com/omicscodeathon/neurovar. Additionally authors deposited the source code and additional supplementary data in a permanent depository with zenodo under the following DOI: https://zenodo.org/records/13375493. They also provided test data https://zenodo.org/records/13375591. I was able to download and access the complete set of data

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No. I did not find any way to contribute, report issues or seek support. I would recommend that the authors add this information to the Github README file.

      Is the code executable?

      Yes, I could execute the code using Rstudio 4.3.3

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Yes. I could follow the installation process, but perhaps authors could add few more details how to download from Github in more detail. As some scientist may have trouble with it. Also perhaps an installation video (additionally to the video demonstration of the Neurovar Shiny App might be helpful.

      Is the documentation provided clear and user friendly?

      Yes. The documentation is provided and is user friendly. I was able to install, test and run the tool using RStudio. Authors may consider to offer also a simple website link for the RshinyTools if possible. This may enable the access also for scientists that are not familiar with R.Especially, it is great that authors provided a demonstration video. I was able to reproduce the steps. However, I would recommend to add more information into the Youtube video. E.g. reference to the preprint/ paper and Github link would be helpful to connect the data.Perhaps authors could also expand a bit on the possibilities to export data from their software. And provide different formats e.g., PDF / PNG /JPEG. I think this is important for many researchs to export their outputs e.g., from the heatmaps.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      Yes, dependencies are listed and are installed automatically. It worked for me with Rstudio version 4.3.3. In the manuscript and in the repository.

      Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?

      Yes the authors provide test data with this doi: https://doi.org/10.5281/zenodo.13375590

      Are there (ideally real world) examples demonstrating use of the software?

      Yes, authors use the example of Epilepsy, focal epilepsy and the gene of interest DEPDC5. I replicated their search and got the same results. However, I find that the label in Figure 1 in the gene’s transcript could be a bit more clear. E.g. it is not clear to me what transcript start and end refers to. It might also be more helpful if authors provide an example dataset for the Expression data that is loaded in the software by default.Furthermore authors use a case study results using RNAseq in ALS patients with mutations in FUS, TARDBP, SOD1, VCP genes.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      No. Automated testing is not used as far as I can access it.

      Additional Comments: The preprint version of this paper was also reviewed in ResearchHub: https://www.researchhub.com/paper/7381836/neurovar-an-open-source-tool-for-gene-expression-and-variation-data-visualization-for-biomarkers-of-neurological-diseases/reviews

      My expertise: I am assistant professor in neuroscience and physiology at University of Southern California and work on stem cell therapies on stroke. We are particularly interested in working with genomic data and the development of new biomarkers for stroke, AD and other neurological diseases.

      Summary: The authors provide a software tool NeuroVar that helps visualizing genetic variations and gene expression profiles of biomarkers in different neurological diseases.

    1. algorithm are used to train a KNN classifier that predicts the demultiplexing classes of unassigned or uncertain cells. We benchmark demuxSNP against hashing (HTODemux, cellhashR, GMM-demux, demuxmix) and genotype-free SNP (souporcell) methods on simulated and real data from renal cell cancer. Our results demonstrate that demuxSNP outperformed standalone hashing methods on low quality hashing data, improving overall classification accuracy and allowing more high RNA quality cells to be recovered. Through varying simulated doublet rates, we show genotype-free SNP methods are unable to identify biological samples with low cell counts at high doublet rates. When compared to unsupervised SNP demultiplexing methods, demuxSNP’s supervised approach was more robust to doublet rate in experiments with class size imbalance.Conclusions demuxSNP is a performant demultiplexing approach that uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct. Unassigned cells (negatives) with high RNA quality can be recovered, making more cells available for analysis, especially when applied to data with low hashing quality or suspected misassigned cells. Pipelines for simulated data and processed benchmarking data for 5-50% doublets are publicly available. demuxSNP is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.demuxSNP).

      Reviewer 2: Haynes Heaton Reviewer Comments: demuxSNP is a tool for combining the demultiplexing capabilities of hashtagging and SNP based genotype demultiplexing of scRNAseq with cells from individuals mixed for cost savings and batch effect reduction. The authors test this method in comparison with other methods for either hashtag demultiplexing or genotype based demultiplexing individually and show improvements in recovering cells not confidently assigned via hashtagging as well as overcoming cases where genotype demultiplexing fails.comments on results:For figure 2 this is mostly this is good for recovering low hash quality cells. Although because the low quality hashing has been simulated in order to have a ground truth to compare to, it is unclear if this simulation method or amount realistic? Does it compare to % unassigned from real datasets?For figure 3 my main issue is why would souporcell out perform demuxSNP at any % doublets? Souporcell is using strictly less information than demuxSNP because it does not assume hashtags. Ideally this would be fixed or at the very lease an adequate explanation given.Comments on methods:"SNPs are filtered to those located within genes expressed across most cells in thedataset" and "SNPs with few reads across cells in the dataset are removed." -- Can I get numbers on this? If you require say 50% of cells to express a SNP locus, it will throw out a huge amount of the still informative SNPs. I find that utilizing as much of the data as possible is generally better. I assume this is done because of the KNN method which will require high overlap in SNPs between cells being compared."Labels from high confidence singlets along with simulated doublets used to train KNNclassifier and predict negative/uncertain cells." Why a KNN model here. Genotype data is not euclidean. Each SNP locus for each cell should be drawn as a binomial with underlying p=0+some error (homozygous ref) p=0.5+/- some error (heterozygous), or p=1.0-some errorA statistical model would be more appropriate for this."To leverage classification techniques applicable to binary data, SNP status is recoded toabsent/present (1,0) and k-nearest-neighbour classification (KNN) [31] is performed usingJaccard coefficient." Ah, so you force the data to be euclidean, but this does not take full advantage of the data. One problem with this will be when two individuals are related. For SNP loci of a parent/child there are many cases where this potentially could have disambiguated them but wont because one individual is heterozygous (so snp present) and the other is homozygous alt (still snp present).General comments: these are small nitpicksThe primary failure modes of genotype demultiplexing in no particular order are 1. small number of cells in a minority cluster 2. large number of individuals multiplexed together. and 3. large number of doublets causing lots of noise in the statistical models. The authors have adequately addressed improvements in 1 and 3. However, I think the paper would be stronger if it also did experiments with >30 individuals multiplexed together. For 3, I think further discussion is merited on the tradeoffs of hyperloading scRNAseq protocols including the # of quality singletons vs loading rate and multiplet rate and how many multiplets escape detection. Experiment designers want to maximize the number of singletons while minimizing the number of doublets that escape detection and harm downstream analyses. 10x genomics gives the ballpark doublet % to be expected as 1% per 1000 cells recovered. But this is a poisson loading process, so the true effect is not linear. The authors test up to 50% doublets (which is good to test), and some experimenters do attempt to load enough to recover 50k cells from a single lane, but I doubt that would be a recommended loading level for downstream analysis unless the doublet detection is highly effective.

    2. Background Multiplexing single-cell RNA sequencing experiments reduces sequencing cost and facilitates larger scale studies. However, factors such as cell hashing quality and class size imbalance impact demultiplexing algorithm performance, reducing cost effectivenessFindings We propose a supervised algorithm, demuxSNP, leveraging both cell hashing and genetic variation between individuals (SNPs). The supervised algorithm addresses fundamental limitations in demultiplexing with only one data modality. The genetic variants (SNPs) of the subset of cells assigned with high confidence using a probabilistic hashing

      Reviewer 1: Lei Li Reviewer Comments: Lynch et. al developed demuxSNP, a supervised demultiplexing approach for single-cell cell hashing data in a multi-modal (hashtag expression and SNP profiles) fashion. They utilized a probabilistic method to infer sample identities of cells using cell hashing modality, and then build a KNN model using SNPs of high cofinance ones from previous step. They then use this KNN model to predict cell identities for cells assigned as uncertain or negative by cell hashing.They have demonstrated the performance through a comparison with existing single-modal methods using both real data and simulated data. They have published an R package for the research community. It is interesting and encouraging to see another study focuses on multi-modal demultiplexing for cell hashing data. Below are some major and minor points from my side:1. I am not surprised that a multi-modal demultiplexing beats single-modal methods across both real and simulated datasets. To my knowledge, there are at least two groups proposed multi-modal demultiplexing approach for cell hashing data. Both were uploaded to bioRxiv last year and get published recently. One called hadge (https://link.springer.com/article/10.1186/s13059-024-03249-z ), and another called HTOreader hybrid (https://academic.oup.com/bib/article/25/4/bbae254/7686601), which is discussed by this study. Hadge is a comprehensive framework that integrated popular cell hashing-based and SNP-based methods, allowing for a joint deconvolution by combining best method from each modality. HTOreader hybrid proposed an improved demultiplexing method for cell hashing signals, and then also integrates demultiplexing results from both modality for a better deconvolution in a hybrid fashion. Indeed, this work has implemented different method for the same purpose. I tried both methods, and there're some major updates between bioRxiv version and published version. Thus, even one of them has been discussed, I think it's still necessary to include these two published methods into comparison, to reveal pros and cons of different methods, therefore provide useful information for users to select the method according to their specific experiment configurations.2. demuxSNP method picked top N commonly expressed genes for SNP calculation. In the tutorial on Github, the N was set to 100. I am wondering in a more heterozygous dataset, the N = 100 still sufficient or not. Is there a way for users to determine the N for their specific dataset more systematically? Or the authors can show some data to demonstrate that N = 100 is robust across different datasets?3. The dataset GSE267835 is private. Please provide reviewer token in the Data Availability statement during submission process.4. Color of uncertain cells in Fig1-B is a bit misleading cause in Fig1-A the same color was used to represent "background staining". Even A and B and different panels, however, a big black arrow makes readers thought they're the same data. Therefore, change the color of uncertain cells into another color would be good to avoid confusions.5. In Fig2-A and B, what are the units for the X axis? Are they log2 or log2 hashtag counts? Please add that information to the figure and legend.6. For Fig-2 C and D, please use the formal spell of names of existing methods like you did in Fig2E.7. Please add line numbers to the draft for reviewers' convenience8. Some minor format issues exist. For example, the "Result" section should a header format instead of normal text.

    1. performance of stMMR in multiple analyses, including spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. In chicken heart development, stMMR reconstruct the spatiotemporal lineage structures indicating accurate developmental sequence. In breast cancer and lung cancer, stMMR clearly delineated the tumor microenvironment and identified marker genes associated with diagnosis and prognosis. Overall, stMMR is capable of effectively utilizing the multi-modal information of various SRT data to explore and characterize tissue architectures of homeostasis, development and tumor.

      Reviewer 2: Hongzhi Wen Reviewer Comments: The paper introduces stMMR, a multi-modal graph learning method designed to integrate gene expression, spatial location, and histological information for accurate spatial domain identification from spatially resolved transcriptomics (SRT) data. The method employs graph convolutional networks (GCN) and self-attention modules, along with cross-modal contrastive learning, to enhance feature integration and representation.Strengths:1. Using GCN to capture local spatial dependency is natural and effective. Introducing attention mechanism for capturing global relations intuitively make senses, however, need more justification. Contrastive learning for cross-modal feature fusion is also a natural choice in multimodal learning. Overall, the methodology is novel and solid.2. Extensive benchmark analysis across various types of spatial data and tissues demonstrates superior performance of the method in spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. The empirical evidence is very convincing.3. The method's application to chicken heart development, breast cancer, and lung cancer showcases its potential in reconstructing spatiotemporal lineage structures and delineating tumor microenvironments, highlighting its value in clinical research.Weaknesses:1. In Figure 4, SpaceFlow is the only baseline for the case study. However, the performance of SpaceFlow is not topranked in other experiments. There should be a justification for why SpaceFlow is highlighted here.2. The contribution of the global attention mechanism to the whole framework is not very clear. The authors may provide more intuition and empirical justification (e.g., ablation study) if they would like to highlight this design.3. By introducing the hyperparameters $\alpha$, $\beta$ and $\gamma$ in Eq. (11), the method has a significantly larger search space than other methods. It is important to note how these hyperparameters are chosen in practice, more importantly, whether the test performance is referred when adjusting these hyperparameters. This might result in an unfair evaluation.

    2. AbstractDeciphering spatial domains using spatially resolved transcriptomics (SRT) is of great value for the characterizing and understanding of tissue architecture. However, the inherent heterogeneity and varying spatial resolutions present challenges in the joint analysis of multi-modal SRT data. We introduce a multi-modal geometric deep learning method, named stMMR, to effectively integrate gene expression, spatial location and histological information for accurate identifying spatial domains from SRT data. stMMR uses graph convolutional networks (GCN) and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. Comprehensive benchmark analysis on various types of spatial data shows superior

      Reviewer 1: Shihua Zhang Reviewer Comments: In this paper, the authors developed a multi-modal deep learning method for identifying spatial domains from ST data by integrating gene expression, spatial location and histological information. This method adopts the graphconvolutional networks and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. They did several typical analysis to valid this this method. Generally, the wiring of this paper is OK. More specific comments:1. Spatial domain has been overwhelmingly studied recently. The authors need to pay more attention to why it is needed to introduce a new method. The novelty of the current method should be carefully clarified. For example, how the histological information help to improve the performance? Does the "geometric" deep learning really help?2. This method has been applied to some stereotypical data. The authors should applied it to some recently generated data by some new ST techniques.3. Figure 3 stMMR enhances spatial gene expression profiles. It is hard to see how the method enhance the spatial gene expression (e.g., LPL).4. With the accumulation of multi-slice spatial transcriptome data, the integration and alignment of spatial transcriptome data will be essential. Can this method be extended for this situation like STAGATE (Nat Comput Sci.2023 Oct; 3(10):894-906)? This will be valuable for ST analysis.5. The scalability of this method should be carefully explored.6. The authors should provide a detailed tutorial for users.

    1. Conclusions The chromosome-level genome of piauçu exhibits high quality, establishing a valuable resource for advancing research within the group. Our discoveries offer insights into the evolutionary dynamics of Z and W sex chromosomes in fish, emphasizing ongoing degenerative processes and indicating complex interactions between Z and W sequences in specific genomic regions. Notably, amhr2 and bmp7 are potential candidate genes for sex determination in M. macrocephalus.

      Reviewer 2: Changwei Shao Reviewer Comments: The authors reported the M. macrocephalus reference genome with a highly degenerated ZW sex chromosome and analyzed the expression pattern of sex chromosomes. In a word, this work extends our understanding of the mechanisms of sex chromosome evolution of fish species. The interpretation of the results is sound for the most part, and gives enough proof to verify their results. I just have few concerns as followed.1.On line 54, please confirm it. In the tongue-sole, the size of Z chromosome (21.91Mb) is larger than the W chromosome(16.45Mb).2.On line 88, 89 and 116, the numbers mentioned do not correspond with the results in Figure 1A. Please confirm it.3.In the section on "Gene Prediction and Annotation", a more comprehensive prediction of gene structure can be achieved by combining three methods: de novo prediction, transcriptome prediction, and homology prediction. The results obtained from these three approaches can be integrated using the EVM software, followed by annotation assessment with BUSCO. The method section is somewhat vague and lacks clear logic. For protein prediction, it is advisable to utilize multiple databases, such as SwissProt, InterPro, and Nr, to corroborate evidence from various sources.4.On line 210, there is an error in the caption of Figure 3. Figure 3B should be a colinearity map of the linkage groups and chromosomes.5.The SNP sites identified in females may include those from the Z chromosome, linkage group 23 (LG23) will contain SNP information from both the Z and W chromosomes. This could potentially affect the demarcation of the region of sex conflict.6.On the sex chromosomes, are there candidate genes related to sex differentiation in regions with a high enrichment of specific SNPs? please provide a detailed explanation.7.What is the distribution of genes in the Z and W chromosome-specific regions, and what is the gene loss rate?

    2. AbstractBackground Megaleporinus macrocephalus (piauçu) is a Neotropical fish within Characoidei that presents a well-established heteromorphic ZZ/ZW sex-determination system and thus, constitutes a good model for studying W and Z chromosomes in fishes. We used PacBio reads and Hi-C to assemble a chromosome-level reference genome for M. macrocephalus. We generated family segregation information to construct a genetic map, pool-seq of males and females to characterize its sex system, and RNA-seq to highlight candidate genes of M. macrocephalus sex determination.Results M. macrocephalus reference genome is 1,282,030,339 bp in length and has a contig and scaffold N50 of 5.0 Mb and 45.03 Mb, respectively. Based on patterns of recombination suppression, coverage, Fst, and sex-specific SNPs, three major regions were distinguished in the sex chromosome: W-specific (highly differentiated), Z-specific (in degeneration), and PAR. The sex chromosome gene repertoire was composed of genes from the TGF-β family (amhr2, bmp7) and Wnt/β-catenin pathway (wnt4, wnt7a), and some of them were differentially expressed.

      Reviewer1: Yusuke Takehana Reviewer Comments: The authors assembled a chromosome-level genomic sequence and identified the sex chromosomes of the fish Megaleporinus macrocephalus. This manuscript is potentially interesting because evolution of sex chromosomes and sex-determining genes are one of the most fundamental and popular topics in the evolutionary biology. However, the conceptual advance and the novelty of this study are quite limited. It is another paper adding now one more species to the list of assembled genomes in this fish family. In addition, there is nothing new about the description of the sex chromosomes such as their degenerative signature. Such studies have already been conducted many times and similar conclusions have been reported. Furthermore, the experimental evidence presented appears rather preliminary and is not sufficient to support the claims and interpretations presented in discussion. I am therefore afraid that I have to say that the manuscript does not provide new insights into evolution of sex chromosomes, and thus will not be of sufficient interest to the readers of Gigascience.1. Overall, the paper was very difficult to read due to a lack of logic structure and many errors, such as confusing between males and females, between chromosomes and linkage groups, and so on.2. The introduction is not logically written. It is unclear what is known and to what extent, and why the genome of this species is being determined.3. I did not understand why the authors concluded that Chr13 is the W chromosome and not the Z chromosome. They should assemble the Z and W chromosomes separately and confirm them from different information. It is also unclear how they rule out the possibility that the sequences are chimeric. If they really want to reveal the evolutionary process of sex chromosomes, they should use all the data (Hi-C, linkage analysis, Pool-seq, gene information) to compare the structure of Z and W in detail, including synteny with closely related species.4. The analysis on sex chromosome gene candidates is too poor. Basic analyses have not been conducted on whether these genes are W-specific, whether they are in both Z and W, whether they have paralogs or not on autosomes, how much sequence variation there is, when and in which cells they are expressed, etc.5. All of the discussions are superficial and lacking in logic, and it is unclear what they want to discuss.6. The figures legends are poorly explained, and contain incorrect information, so I don't understand the meaning of the data at all.7. This manuscript contained many grammatical errors leading to many confusing statements, and some sentences that were grammatically correct but awkward meaning. I strongly recommend that the authors seek advice of someone with a good knowledge of English, preferably a native speaker.

    1. Conclusions We applied CAT Bridge to experimentally obtained Capsicum chinense (chili pepper) and public human and Escherichia coli (E. coli) time-series transcriptome and metabolome datasets. CAT Bridge successfully identified genes involved in the biosynthesis of capsaicin in C. chinense. Furthermore, case study results showed that the convergent cross mapping (CCM) method outperforms traditional approaches in longitudinal multi-omics analyses. CAT Bridge simplifies access to various established methods for longitudinal multi-omics analysis, and enables researchers to swiftly identify associated gene-metabolite pairs for further validation.

      Reviewer2: JITENDRA KUMAR Barupal Reviewer Comments: To the authors,Thank you for the opportunity to review the manuscript GIGA-D-24-00083. The authors created a tool to predict association between genes and metabolites using various algorithms. The authors provide the tool as a web application, and as a python package. To get the reciprocal relationship between gene and metabolites, i.e. which metabolites can change which gene or vice versa, this tool can be a toolkit for the biologist or bioinformatician.The tool has application specially the relationship between changes in genes and metabolites is not direct, many complex mechanisms exist e.g. epigenetic or polymorphism. So the tool can be alternate to other available tools.Also, the manuscript brings the community focus on causal relationships instead of just correlation based relationships. The tool used temporal causality algorithms for predicting relationships between genes and metabolites.However, I recommend major revisions before publication. Here are my reasons and comments for the revisions:General issues with web accessibility and package installation :1. There are concerns about web accessibility, as indicated by web browsers flagging the connection as insecure. This may stem from geographical restrictions or the absence of HTTPS certification. Addressing these issues would ensure secure access to the server.2. Despite successful initiation of the client application from the git repository as a python module, no results were generated upon launching. It is suggested that the authors distribute the tool as a Docker image to facilitate seamless usage, eliminating concerns regarding dependencies and version compatibility.Other comments :1. There are inconsistencies regarding data preprocessing. While the manuscript mentions that the tool will handle preprocessing, it also indicates that users need to provide processed files. Clarification is needed on whether preprocessing is required. It seems, the tool required preprocessed data.2. For clarity use "causality and correlation" instead of "causality/correlation" algorithms.3.Can the tool process any new temporal numerical data series, or does it specifically filter for genes? For instance, if I provide a list of proteins along with a list of genes, will I receive the association between them? It is suggested to include this in the discussion section.4.Does the tool offer the capability to generate a causal diagram or network from these vectors, thereby providing visual support for their assertion regarding the causal relationship between metabolites and genes? If the author is working in this direction, it is suggested that information can be added in the discussion section.5. What definition of causal relationship did the author use, and could they provide a citation for their definition. Predictability or any other criteria were used for causal relationships. Please include the definition or criteria in the introduction and method section.6. What are the minimum or maximum time points (interval) for input files? e.g. will the tool work if I provide only two times points or If I provide 48 times points. Please include the information in the method section.7. What is the influence of the number of time points on the vector relationship presented in the paper? Have any studies by the authors addressed this question? Please include the results and discussion.8. Could the authors clarify which heuristic algorithm was employed for ranking the genes? Additionally, can they elaborate on how their approach to gene ranking is heuristic rather than relying on mathematical optimization or algorithmic methods? Clarification on the term "heuristic" would be beneficial.9. Could the authors offer an example from studies conducted on yeast, E. coli, or other simple organisms, demonstrating how changes in gene sequences have readily been observed to affect metabolite levels? Please include that in the results section.10. Does the tool generate a vector indicating many-to-many relationships or one-to-one relationships? In other words, does it reveal whether one gene is associated with many metabolites, and vice versa, or if it establishes a single genemetabolite relationship? Please include this in the results section. Also, in the discussion section please include examples of application of these relationships in various fields e.g. metabolic engineering or cancer metabolism.11. Table 1 compares the features of CAT Bridge with other available methods. It should encompass features provided by other tools that are not available in the author's tool, such as knowledge-driven integration or integration with a third-party database. Additionally, it should address the limitation posed by the requirement of time series data, which is not just a strength but also a challenge, particularly for epidemiology studies where multiple time series for gene expression may not be feasible.12. Please use alternative phrases to "Self-generated data," such as "experimentally obtained data," to clarify that the author is utilizing data acquired in the lab to validate the tool. (e.g. line 42, 223, and 492).

    2. AbstractBackground With advancements in sequencing and mass spectrometry technologies, multi-omics data can now be easily acquired for understanding complex biological systems. Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the non-linear and multifactorial interactions within cellular networks. The complexity arises from the interplay of multiple genes and metabolites, often involving feedback loops and time-dependent regulatory mechanisms that are not easily captured by traditional analysis methods.Findings Here, we introduce Compounds And Transcripts Bridge (abbreviated as CAT Bridge, available at https://catbridge.work), a free user-friendly platform for longitudinal multi-omics analysis to efficiently identify transcripts associated with metabolites using time-series omics data. To evaluate the association of gene-metabolite pairs, CAT Bridge is a pioneering work benchmarking a set of statistical methods spanning causality estimation and correlation coefficient calculation for multi-omics analysis. Additionally, CAT Bridge features an artificial intelligence (AI) agent to assist users interpreting the association results.

      Reviewer 1: Tara Eicher Reviewer Comments: The authors introduce a useful tool (CAT Bridge) for integrating multiple causal and correlative analyses for multi-omics integration, which also includes a visualization and LLM component. The authors further provide two case studies (human and plant) illustrating the utility of CAT Bridge. I believe that this work should be published, as it contributes to the field of multi-omics analysis.However, I am very concerned about the lack of description regarding the LLM. As explained by Mittelstadt et al (https://www.nature.com/articles/s41562-023-01744-0), LLMs do not always provide factual answers. The authors need to justify the use of the LLM to determine the relevance of a gene-metabolite association. In particular, the authors should add to the main text (or at least the supplementary) a detailed description of the prompt construction and should justify why this prompt is expected to result in factual information. Furthermore, the authors should discuss the caveats of using LLMs in this context, starting with the linked article above. I believe that the manuscript will only be publishable once this concern is addressed.In addition, the authors are recommended to address the following more minor concerns:Implementation:1. Your "example file" links at https://catbridge.work are broken. Please fix this.Abstract:1. Line 32: "Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the complexity of cellular networks." This is not a clear statement. What about the complexity of cellular networks presents challenges in determining the associations?2. Make sure you are using present tense consistently, not past tense (Line 39).3. Please use the scientific name with the common name in parentheses as follows: Capsicum chinense (chili pepper). Use only the scientific name throughout the rest of the document (Line 41).Background:1. Line 56: "Background" should not be plural.2. Lines 59-60: More comprehensive than what? Please elaborate here.3. In Line 60, please include and familiarize yourself with the following reference: Eicher, T., G. Kinnebrew, A. Patt, K. Spencer, K. Ying, Q. Ma, R. Machiraju and E. A. Mathé (2020). "Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources." Metabolites 10: 202.4. Lines 67-68: Citation needed.5. Line 72: Please use the scientific name with the common name in parentheses.6. Lines 74-77: Citations needed.7. Lines 77-78: Give an example of biologically naïve conclusions from purely data-driven strategies.8. Line 78: Discuss how the machine learning models could address the drawbacks of the correlation models and vice-versa.Materials and Methods:1. It seems that CAT Bridge needs to be run on one metabolite at a time. In this case, I would not use the term "gene-metabolite pair association" in Line 104, but rather "associations between genes and the target metabolite".2. Line 115: Clearly state which of these methods are non-linear and which address the lag issue.3. Line 136: Your figures are out of order (Figure 1B comes after Figure 2B).4. Please take a look at the Minimum Standards Reporting Checklist (https://academic.oup.com/gigascience/pages/Minimum_Standards_of_Reporting_Checklist). In particular:a. In the section starting at Line 153, list the number of seedlings used.b. Were all timepoints collected from all seedlings? List the total number of samples.c. How many mg were collected per sample (can use a range here)?d. 3 biological replicates per seedling? Give more detail here.e. What machine was used for the ultrasonic process? If frequency settings are permitted by the machine, list the settings used.f. How many of the 28 younger and 54 older adults had both transcriptome and metabolome data?5. Line 209: "Younger" and "older" are better terms.Results:1. Line 248: How does the AI agent analyze the functional annotations?2. Lines 281-282: "This illustrates the advantage of causal relationship modeling methods over traditional methods".3. Line 290: Please also include the updated IntLIM paper (IntLIM 2.0): Eicher, T., K. D. Spencer, J. K. Siddiqui, R. Machiraju and E. A. Mathe (2023). "IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements." Bioinformatics Advances 3(1): vbad009.4. Make sure the colors are consistent in Table 1.5. Line 156: The scientific name of the pepper species is inconsistent with other areas of the text.Figures:1. S1 should be provided as a table, not a figure.2. Please make S2 larger. It is difficult to read.3. S3 needs labels (x axis, y axis, legend).

  12. Nov 2024
    1. AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 3. Fabio Cortesi

      In their manuscript, Frazer et al. developed a machine-learning approach to predict the spectral sensitivity of a visual pigment based on the gene/amino acid sequence of the opsin protein. First, they created a visual opsin database based on heterologously expressed genes from the literature. They then used deepBreaks, an ML tool developed to explore genotype-phenotype associations, to run several different models and test how well ML could predict spectral sensitivity. Their main findings are that the larger the dataset for training and the more diverse (both in opsin sequences themselves and phylogenetic breadth they were derived from) the dataset, the better the predictions will become. However, there is a plateau for the number of training sequences that should be used as a minimum (~ 200), with a slight gain afterwards. As such, the suggested ML approach works well for larger datasets but needs refining for smaller datasets. There are also several drawbacks to the approach that need to be carefully considered when interpreting the results, including the fact that ML cannot accurately predict the effect on phenotype if confronted with a new mutation or a new combination of mutations not used during training.

      I found the study to be well-written and easy to follow. The results support the conclusions, and as far as I can tell, the ML and associated analysis were performed accurately. All the code and the database are readily accessible, too. It is great to see that we are at a point now where computational power has reached a level that can be used to predict gene-phenotype relationships accurately. The use of ML to study the function of (visual) opsins, i.e., spectral sensitivity, especially if additional parameters can be included, will undoubtedly be of great help to the field and welcomed by the community. As such, I have no major concerns and only a few minor comments I recommend addressing before publication.

      Minor comments

      Introduction - Please add a sentence to explain that a visual pigment consists of an opsin protein bound to a chromophore/retinal and that the two units together lead to the 'spectral sensitivity' phenotype. You cover it in the discussion, but it would be helpful for the reader to have this information upfront.

      • Please provide a reference for the following statement: '[…], and purification of heterologously expressed opsins followed by spectrophotometry [REF]'.

      • You say, 'Despite opsins being a well-studied system with an extensive backlog of published literature, previous authors expressed doubts that sequence data alone can provide reliable computational predictions of λmax phenotypes [37-40]'.

      I agree that the spectral sensitivity predictions from sequences have been criticised in the past as they were sometimes oversimplified (including some of our work). However, spectral sensitivity predictions based on computational modelling, albeit not using ML, have previously been attempted successfully several times, e.g., by Jagdish Suresh Patel and colleagues, and should be mentioned here.

      • You say that: 'The extensive data on animal opsin genotype-phenotype associations remains disorganized, decentralized, often in non-computer readable formats in older literature, and under-analyzed computationally'.

      Again, I agree that the opsin data can profit from a centralised databank like the one you created. However, there have been several previous attempts at summarizing opsin data in recent years (although not specific for heterologously expressed opsins), for vertebrates at least. For example, work by Schweikert and colleagues on fish visual opsins and recent work on frog opsins by Schott et al. These studies should be mentioned and cited appropriately here, as tremendous work went into collating the datasets in the first place.

      Results

      • The use of MWS opsin is somewhat confusing. I presume this refers to vertebrate lws genes that are mid-wavelength shifted? Why have these as a separate group? Ancestrally, there are five sub-families of visual opsin genes in vertebrates: sws1 & sws2 (SWS), rh1, rh2 and lws (MWS & LWS). The MWS range in Figure 1 should be part of a larger lws derived grouping.

      • This part reads like a discussion. It also needs a reference for the age of T1 opsins: 'The similar levels of performances between T1 and invertebrate models were unexpected, especially considering it has a training dataset five times larger than the invertebrate model. One possible explanation is that the very old age of T1 opsins [REF] might have led to a higher complexity of genotype-phenotype associations that are not yet well sampled enough to allow good predictions.'

      • These two sentences could also be weaved into the discussion rather than the results section: 'These equations do not account directly for taxonomic, genetic, or phenotypic diversity, as the number of genes is on the x-axis. Therefore, one should be cautious about applying them to predict model performance based on training data size alone.'

      • Table 1: What do MAPE and RMSE stand for, and what do those numbers mean? Maybe also include a short explanation of the acronyms and their meaning in the main body of the text.

      • This should also be mentioned in the discussion: 'Until the models are trained with more invertebrate (r-opsin) data, we do not put high confidence in the estimates of λmax.'

      • Figure 2 legend: Third line, why 'Mutant predictions …'? Aren't the predictions for all sequences?

      • Figure 3 legend: It says 547 mutant sequences here and 546 sequences in Table 1.

      • Provide a reference for the following sentence: 'The WT SWS/UVS model similarly highlighted p113, a site functionally characterized as the counterion in the retinal-opsin Schiff base interaction for all vertebrate opsins.

      • Figure 4 legend: Please provide references for the following sentence: 'Positions 181, 261 and 308 are highlighted because they are among the highest scoring sites and have all been previously characterized as functionally important to opsin phenotype and function.'

      Discussion

      • Please simplify and do not overstate the first sentence. I suggest: 'To better understand methods to connect genes and their functions, we initiated VPOD, a database of opsin genes and corresponding spectral sensitivity phenotypes.'

      • Section: The important relationship between data availability and predictive power.

      You mention that ML could not accurately predict spectral sensitivity if mutant genes were excluded, especially if smaller datasets are used. This was to be expected since ML is not per-se 'smart' but learns from patterns in the underlying dataset. However, it is a significant drawback of the approach, and I encourage you to state this more clearly. My main concern is that future users will take the ML predictions as absolute truth instead of verifying or experimentally verifying the predictions.

      • Provide a reference for the following sentence: 'One consequence of leaf-based tree construction is that due to its faster convergence/training time, it can be more prone to overfitting, as it constructs trees on a 'best-first basis' with a fixed number of n-terminal nodes.'

      • You should include some information regarding the assumptions in the Introduction and the Methods section. For example, information about what chromophore interaction was modelled should be in the methods, and the basic information about how visual pigments are formed and what different chromophore types are being used by which species should be in the Introduction: 'We also assume the photopigment uses 11-cis-retinal, as all heterologously expressed opsins in VPOD were reconstituted using this chromophore. However, this assumption is violated in some organisms because they use 13-cis-retinal as the in-vivo chromophore [71-73], which is associated with a red-shift in λmax [32,71].'

      Conclusion

      • I recommend being more cautious about the predictive power for epistatic effects since you tested it only on three samples and the predictions were severely restricted by the training dataset containing the single mutant samples.
    2. AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.

      Reviewer 2. Nikolai Hecker

      The authors compiled a collection of opsin/rhodopsin proteins and their associated light absorption properties from literature; using the measured wavelength of maximum absorbance as a proxy. In addition, they include multiple sequence alignments (MSA) of the proteins including subsets of vertebrate and invertebrate sequences. The data is provided as tab-seperated, comma-seperated, and FASTA files. This is a valuable resource for studying opsins and vision related phenotypes. The authors then use gradient-boosting, random forests, and bayesian ridge regression to predict the wavelength of maximum absorbance from the protein sequence MSAs. Furthermore, they investigate whether their models can be used to identify amino acid changes that impact the wavelength of maximum absorbance and epistasis. This is based on a small set of opsin mutants that have been reported in literature. The manuscript is well structured and written. I have some concerns regarding the analysis, description and presentation of the data.

      1. A traditional cross-validation by random sampling can be inadequate for phylogenetically related sequences. If closely related species are part of the data training and test sets may contain nearly identical sequences. Excluding entire lineages instead of random sequences during training would circumvent this issue.

      2. Based on Fig. 3, Fig. 2, and p. 6, the models do not generalize well given that they only predict mutants well which exhibit a similar wavelength of maximum absorbance as the wild type. Based on the plots (Fig. 2 and Fig. 3) it does not look to me like the model trained on mutants+WT performs substantially better than the WT model for mutants with large wavelength shifts. This would be in contrast to p.16 "Particularly illustrative of these ideas are our analyses with and without experimentally mutated opsins". The authors should either show statistics regarding the improved performance for mutants with large shifts or change the corresponding parts.

      3. The data set description should be more detailed in parts. It should be shown how the opsins/rhodopsins classes (UVS, SWS, MWS, Rhodopsins, LWS) are distributed across the vertebrate and invertebrate phylogeny, for example by a phylogenetic tree and their number per species. Are the mutated opsins/rhodopsins derived from a small set of species or do they reflect most of the vertebrate phylogeny?

      4. How importance scores are estimated for the different models should be explained.

      5. The "ML often predicts the effects of epistatic mutations" section needs some clarifications. Why were only three sequences investigated? Do none of the other double mutants show epistasis when compared with the corresponding single mutations? In this paragraph, it is not always clear whether wavelengths and additive wavelengths are obtained from predictions or actual measurements.

      6. The description in git repository (https://github.com/VisualPhysiologyDB/visual-physiology-opsin-db) is very sparse. The content of the different files and how they relate to each other should at least be briefly explained in a README. It would also be helpful to add gene names, and the source of the sequence to the meta files.

      Minor comments

      1. For Fig. 4A, since MSAs are already computed it would be interesting to indicate the conservation amino acids per position. Are important amino acids correlated with sequence conservation?

      2. In Tab. 1, R2 is used to compare different models which are based on different subsets, and also pot. differently sized MSAs. An adjusted R2 might be more suitable to account for different numbers of features.

      3. It would be helpful to add a Docker image to the github repository to make it easier to use.

      Re-review: The authors have addressed the majority of my concerns and improved the manuscript. However, there are still some remaining issues that the authors should address before I would recommend the publication of the manuscript.

      1. Dependence between data points is not a novel problem for data analysis and machine learning in a broad range of subjects. While I appreciate that the authors added a paragraph discussing the issue of phylogenetic relatedness, the setup of the cross-validation and how the data is presented make it difficult to assess to which extend their models over-fit to the data. Referring to their previous reply, lineage-/group-based cross-validation should not be arbitrarily chosen but based on the structure of the data. This is not always a trivial problem and magic solution, I agree. The authors should at least incorporate references to literature discussing the problem and potential solutions for dealing with phylogenetic relatedness at p.11 "While these performance metrics are impressive, it is important to remember that phylogenetic relatedness between sequences..." or in the discussion. For example, Roberts et al. provide a nice overview for cross-validation strategies in various settings including phylogenetic data (they call it "block cross-validation"):

      Roberts et al. (2017). Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929.

      1. Regarding the comparison between models trained on wild type (WT) and WT + mutant data (WDS), I find the comparison rather difficult to follow concerning repeatably leaving out 25 mutants. For comparing the models, the test set (or test sets) should be the same. This would mean to assess the predictions for the same 25 left out mutants by both the WT and the WDS model (for each 25 left out mutants). If this was done already I would recommend rephrasing the corresponding part in the methods and results to improve the clarity. In addition, a visualization, for instance, a violin plot of the WT model RMSEs vs. violin plot of the WDS model RMSEs would be useful for the readers.

      2. I would still recommed adding a brief summary of how feature importance scores are computed. So, the reader does not have to look up another manuscript. This does not have to be detailed. As I understand, the feature importance is just the normalized number of feature occurrences or the Gini importance for gradient boosting/random forests or the coefficient for regression models.

      Minor details:

      Fig. S10: the text at the leaves is not readable. It could be replaced, for instance, with the name of the gene family if that make sense, or removed.

      Fig. 4A: the bars at position 181, 261, and 308, could be indicated, for example, in red or another color, to easier compare A and B.

    3. Background Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae073). The peer-reviews are as follows.

      Reviewer 1. Robert Lucas.

      Frazer and colleagues set out to assess the ability of machine learning (ML) approaches to predict spectral sensitivity (lmax) of animal opsins from their amino acid sequence. To this end they first develop a database of phenotyped opsins (opsin sequences with known lmax), which they term Vpod1. They then explore how various factors of the ML process impact its ability to predict lmax. These include the nature of the input training dataset (size, phylogenetic and gene family diversity, inclusion of data from mutagenesis experiments) and the ML method. For comparison they include a phylogenetic imputation approach that predicts lmax based upon overall sequence similarity. They test the validity of their approach according to their ML pipelines' ability to predict: lmax for the training dataset; the outcome of mutagenesis; lmax for a test dataset extracted from the training dataset; known epistatic interactions; and established spectral tuning sites. In all cases, they report various degrees of success and conclude that the ML approach can be used to predict lmax (almost as well as phylogenetic imputation but with reduced computational cost) provided that the training dataset is sufficiently rich (it performs poorly for invertebrate opsins for which data are limited) and, ideally, benefits from mutagenesis datasets.

      I am no expert in machine learning and will leave others to comment on that aspect of methodology but in general this study represents an interesting addition to the literature. The idea of predicting lmax from amino acid sequence is not new e.g. as the authors acknowledge the '5 sites rule' for cone pigments is long established. Applying ML holds the promise of a more efficient process for achieving similar predictability for other branches of the animal opsin family. In that regard, the inherent limitation in the ML approach is highlighted - it is particularly valuable in branches of the family for which information is sparse (invertebrate opsins), but performs poorly in those branches without more starting information about structure:function relationships (which itself replaces the need for ML to some extent). Nonetheless, it certainly has the potential to be a valuable tool and this paper represents a sound exploration of its characteristics and one important feature of the paper is that it confirms that ML can allow fairly good predictions based solely on data from wildtype opsin sequences.

      I have relatively few suggestions for improvement. The most important is that the authors appear to have omitted one technology for the process of defining lmax (introduction, method and discussion). We and others have used heterologous action spectroscopy to describe lmax for a growing number of animal opsins. In this technique spectral sensitivity is defined using live cell assays of light response for opsins expressed in immortalised cell lines. Those data could be included in the Vpod1 dataset. It would also be appropriate to mention the approach as a tool for populating the training dataset as it has the advantage of being applicable to opsins that don't reliably form pigments in vitro (e.g. many invertebrate opsins) and does not rely on access to the animal itself but only to its genome sequence. The authors also may wish to relate Vpod1 to another recently published database of animal spectral sensitivities albeit collected for a different purpose (https://doi.org/10.1016/j.baae.2023.09.002).

      Some minor points. The authors note with surprise that ML performed poorly for the rod opsin dataset. Could this be because their metric (R2) is sensitive to the degree of variability in lmax in the training dataset, which is constrained in rod opsins? I found the pastel colours in Fig2 and 3 hard to discern, more separation on the colour pallet would be appreciated.

    1. AbstractThe exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae079). The peer-reviews are as follows.

      Reviewer 1. Xiaoquan Su

      In this work, authors proposed CoCoPyE to evaluate the genome quality constructed from metagenomes by a two-stage approach. In general, this work is valuable for the research works in this field, and some issues should be addressed before further consideration for publication.

      1. In section 2.1, how the threshold of 60% and 30% were determined?

      2. In the 2.1.6 section, there were two different prediction method, including linear and non-linear prediction, so in actual senses, how to choose a proper way?

      3. For the simulation, I also suggest to make some simulation for specific habitat metagenomes, e.g. human-associated habitats (gut, oral, etc.), or natural environments (soil, marine).

      4. For the online demo at https://cocopye.uni-goettingen.de/, a demo fasta input file can be useful for quick startup.

      Re-review: All my previous comments have been addressed and I have no more question.

    2. The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Robert Finn

      The paper by Birth et al describes CoCoPy, a two stage pipeline for the estimation of completeness and contamination of prokaryotic genomes, especially for the assessment of metagenome assembled genomes. The paper was well written and clearly outlined the aims of the software, the approach and the need for a two stage process. I also appreciate the candid nature of the discussion that CoCoPy should be considered as complementary to CheckM2. The performance in terms of time is a notable consideration why this tool should be considered by the field, and the benchmarks of completeness and contamination are encouraging. The main drawback of the tool is the need for a close reference genome for the second stage quality estimation, which will limit use for environmental metagenomics.

      Major comments

      While I appreciate the benefits in terms of speed offered by UProC, there are a number of questions that are not adequately addressed in the manuscript. The first is why the version of Pfam is so out of date, with version 24 and 28 being used in the feature classification. The authors remarked about the improvement between Pfam 24 and 28. Pfam is now on version 36, with a release produced about once a year. During this time, the Pfam entries have been expanded in number, increased in sequenced diversity and optimised in terms of boundaries. This is particularly pertinent now, with the use of AlphaFold models improving domain boundaries. Secondly, Pfam models have per model thresholds, but there was no discussion of thresholds used. Finally, Pfam Clans were introduced in Pfam version 18.0, as a way of modelling diverse families with multiple profile HMMs. While many of these families are unlikely to represent single copy marker genes, there is still the case that two families belonging to a same clan could be measured as a dissimilarity, when actually they are representing the same protein family. This is particularly important in the marker based estimates and count histogram ratio.

      It would also be beneficial for the reader to see the results from genomes simulate with a fragmentation profile that more closely represents that of MAGs, where there may be a few long contigs in the 100kbp range, and then quickly tails off to contigs in the 1000s bp range. Also, the authors should try and estimate the amount of blind contamination, i.e. contigs that have no single marker genes. This is an important metric which is typically overlooked by current tools. This particularly applies to those MAGs where they fail to be passed on to the second phase of contamination.

      The second stage of the CoCoPy should also be benchmarked against tools such as GUNC, which similarly uses features from reference genomes to estimate completeness and contamination. This would help guide the reader to understanding whether running CheckM2 with GUNC or CoCoPy would be advantageous.

      Minor In the introduction, the authors omit the part of the MIMAG standard that requires the presence of tRNAs and SSU/LSU also need to be present to refer to the genome as high quality, not simply based on completeness and contamination.

      In the "Reference database" section it would be informative to know the number of Pfam entries (and their accessions) that are considered single copy marker genes. Also, the best concept of completeness is having a closed, circular genome in RefSeq.

      In the construction of the test data it would be useful to provide a measure of taxonomic distance between the genomes in the training dataset and the test dataset. While this is difficult, a basic metric such as average branch length to nearest neighbour, or number of steps away from the nearest neighbour in the GTDB taxonomic tree, but some level of information would be informative, rather than simply not being the same taxID.

      How sensitive is the second stage to completeness? Conceivably, the use of MAGs to enrich the sequence space could improve the second stage, if strict completeness and contamination rules were applied?

      Re-review: I appreciate the authors effort in trying to update the version of the Pfam database. It is disappointing that new versions of the resource are not being considered down to a technical problem in the UProC implementation. While it is highlighted that the older versions of Pfam provide computational advantages, there are many potential solutions that could be found to overcome this, and the realms of memory requirements are not vastly different to CheckM1. While I understand the difference between HMMER and UProC, it is as much about the improvements in domain boundaries and increased coverage of sequence space that the more recent versions harbour that I would expect to improve the performance. The advent of AlphaFold has resulting is a large number of improvements to the Pfam boundaries. As the authors have a fix, it is slightly strange they have not included the results in the response to the reviewers comments. The fix may be limited in scope and not officially release, but it would be more convincing to show the results of Pfam v36 against v24/28, thus allowing an informed judgement. I look forward to the release of an updated version of UProC in the near future, as promised by the authors.

    1. Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 3. Jose Fernandez Navarro

      The authors present a novel computational method to integrate SRT datasets claiming that the method adjusts for batch effects while retaining the biological differences. The method provides the possibility to adjust the gene expression counts to be used for downstream analysis. The method was benchmarked against other methods that are available for integration of single cell and spatial transcriptomics datasets obtaining positive results. The manuscript is well structured and clear, it provides a robust motivation and the comparisons with other methods are clear and well defined. The method has the potential to make a contribution to the field, specially considering that it has been developed to be compatible with scanpy and that an open-source library has been made available on GitHub.

      Introduction:- In the following sentence: "batch effects caused by nonbiological factors such as technology differences and different experimental batches." the authors could have elaborated more and perhaps included some references.- In the following sentence: "In contrast, popular MNN-based methods such as Seurat v3[16] efficiently address batch effects in gene expression, but their limitation lies in the ability to align only two batches at a time, and they become impractical when dealing with many batches" I do not think the MNN-based term is correct in that context. Also, I do not entirely agree in the claim. One generally does not have many batches to correct for and the referred methods can perform batch correction in datasets with more than 2 batches.- In the following statement: "However, PRECAST only returns the corrected embedding space, and GraphST requires registering the spatial coordinates of samples first to ensure its integration performance; thus, their applications are limited in certain scenarios. "I'm not in total agreement, I understand PRECAST provides a module to obtain corrected gene expression counts for downstream analysis. Results:- I find the introduction to spatiAlign a bit long. It could perhaps be simplified and then leave the implementation details to the Methods section.- In the following sentence: "..spatial neighbouring graphs between cells/spots (e.g., cell‒cell adjacent matrix A), where the connective relationships of cells/spots are negatively associated with Euclidean distance." I find it a bit misleading, are the authors building the spatial graph using a fixed radius? Or euclidean distances in a manifold?- I could not find a detailed description on how the different datasets were processed with the others methods that they used to benchmark.- I believe to measure the power of the methods to retain biological differences, comparing consecutive sections of the same tissueis not enough. I would also include a comparison using sections from different individuals (same region).- In the MOB datasets comparison, by looking at the UMAP figures, the differences in performance it is not so clear in the cases of SCALED and BBKNN.In the Hippocampus dataset, I did not see information on how the clusters were annotated. It would have been nice to include the ABA figures of the same region. I found it difficult to understand the basis and interpretation of the spatial autocorrelation analysis with Moran's I. In the MOB embryo dataset, did the authors consider include a comparison with the other methods? Figures:I observed some of the supplementary figures are out of order or the labels do not match the panels, I encourage the authors to revise this. I also noticed some of the panels showing expression plots do not have a bar with the range of expression. The labels in some of the panels are hard to read and I miss some labels (f.e. the section/dataset in some of the panels).Some figures make reference to the ABA and/or the tissue morphology. For these, I could suggest including the HE images and/or IF images from the ABA. Figure 2a-c: the fonts are hard to read. Figure 2d is hard to read, perhaps the layout would be better by making it one column per method?. Figure 3g would be easier to read if the 3 datasets were arranged side by side. Figure S4, I find the clusters hard to see clearly.

      Datasets and documentation: The authors provide links to the original datasets but they do not provide access to the processed and annotated datasets, this makes it difficult to replicate the results and the examples provided in the documentation. The manuscript would benefit if the authors would provide better documentation and means to reproduce/replicate the analyses.

      Software: I was able to install the package with PyPy in a Conda environment but I had to manually install some dependencies to make it work.Major comments:- I would like to suggest the authors to revise the figures. The supplementary figures descriptions do not seem to match the content of the figures. Some of the figures are missing labels and color bars.- I would like to suggest the authors to correct for grammar and misspelling errors and perform a throughout proof reading of the manuscript for consistency.- I would like the authors to provide links to access the processed/annotated datasets.- I would like the authors to provide more details on how the datasets were processed with their method and the others method (hyperparameters, versions, etc..). This could be complemented greatly if the authors could provide notebooks or step-by-step documentation.- I would like to suggest the authors to include a comparison with true biological differences such as different phenotypes and/or genotypes.- I would like to suggest the authors to include some of other methods in the MOB (stereo-seq) comparison.- I would like to suggest the authors to check their claim that PRECAST does not provide "corrected" gene counts or that the other methods do not provide means to perform downstream analyses (DEG, trajectory inference, etc…).- I would like to suggest the authors to include normalized counts as well as raw counts in some of the comparisons (for example when performing the trajectory analysis or showing the spatial distribution of features). Minor comments:- I would like to suggest the authors to not use the term "expression enhacenment", to me the gene expression is corrected or adjusted but not enhanced.- I would like to suggest the authors to improve the documentation of the open-source package to provide more information on the different arguments and options. It would also be nice to provide documentation and/or notebooks to reproduce the analysis (or some) presented in the manuscript.- I would like to suggest the authors to improve the installation of the PyPy package since some dependencies seem to be missing.- I would like to suggest the authors to improve the layouts and font size of some of the for clarity and readability.

      Re-review: I acknowledge the efforts made by the authors to address the comments and provide answers. However, I still find the manuscript not ready for publication. These are my comments: Major:- The authors have included a new analysis (sup. figure 7) using a dataset (tumor liver) that lacks a stereotypical structure. While this is a good addition to the manuscript, I would still like to see the performance of spatiAlign in correcting technicaleffects while retaining true biological differences (f.e. disease and control). In addiction to this, a comparison using a imaging-based technology (f.e Merfish or CosMx) would make the manuscript stronger.- The authors have made an effort to provide Jupyter notebooks with code to reproduce the analyses. Unfortunately, this is uncompleted. None of the notebooks contain code to reproduce the spatiAlign analyses and only the notebook with the tumor liver dataset (sup. figure 7)includes the processing steps. For the other datasets they authors use hard-coded values. Moreover, I was unable to run some of the notebooks due to errors and missing files and/or dependencies. The authors should provide one notebook for each dataset including the processing and analysis and provide means to run the notebooks (environment files and/or docker files) in an easy way that enables reproduciblity. Ideally, these notebooks should also include the spatiAlign analysis.- I observed a strange effect in figure 2 where the UMAP manifolds of the BBKNN, Harmony and Combat are similar. I could identify the error causing this in one of the notebooks. I strongly suggest the authors to revise all the analyses and figures and to provide notebooks to reproduce these in an easy way as I mentioned before.- I find the MNN performance surprisingly bad. I wonder if this could be due to how the data was processed with this method. Did the authorstry to disable cosine normalization for the output?.

      Minor:- I think the manuscript would be stronger if the authors would include the normalized counts in the figures where they show the raw counts.- I still find inconstancies in the text (typos, grammatical and syntactical errors). The authors are still using the term enhanced (specially in figure legends).- In the MOB dataset, the authors claim that the Visium spots are 100mm but that cannot be true, visium spots are 50mm.- In figure 3 (panel f) use the same layout as figure 2 for consistency.- In figure 4 (panel g) the color bar and labels are missing.- In Sup. figure 3 (panel c) the color bar is out of place and the legend is missing.

      Re-review: The authors have made a great effort to improve the manuscript. The improvements on the documentation and open-source package will be appreciated by the community. I only have minor comments:- The grammar has improved but I could still see some errors (to cite a few):- line 96 "dimensional reduction"- line 346 "structure and MERFISH"- I still think that the authors have not been able to fully demonstrate the performance of their method to integrate datasets with true biological/phenotypical differences (f.e. disease and healthy). Supplementary figures 7 and 8 add value of the manuscript by integrating tumor cells from different patients but this is not exactly what reviewer 1 and Isuggested. I acknowledge the explanations that the authors provide in their response but I'm not in total agreementwith the statements. There are publicly available datasets that could suit this analysis. I will not request to amend such analysis to the manuscript but I could at least suggest to mention this in the manuscript as a limitation or future work.

    1. With emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 3. Rosalba Giugno

      Authors introduce StereoSiTE, which integrates open-source bioinformatics tools with the self-developed algorithm SCII. The aim is to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions and accurately infer functional cell-cell communications with quantitatively defined interaction intensity in ST data.

      The paper's objective is commendable, and the overall organization of the content, along with the obtained results, holds great promise. Nevertheless, certain aspects need to be addressed. The proposed approach's novelty is significantly anchored in the SCII software. However, the paper has notable drawbacks. It falls short in providing a theoretical and scientific comparison with other similar tools. Moreover, the comparison includes systems that do not incorporate spatial considerations, posing a limitation in assessing the method's uniqueness in a broader context.

      Give more details on which systems are you referring here: "To improve accuracy, we recommended using spatially resolved data at single cell resolution". Please provide your insights on the rationale for employing or abstaining from downstream analysis to comprehend the spatial distribution of gene expression in tissue, as https://doi.org/10.1093/gigascience/giac075 and https://doi.org/10.1038/s41467-023-36796-3. Additionally, consider discussing how this is associated with the prediction, validation of the functional enrichment or on step: Clustering bins into different cellular neighborhoods based on their cellular composition.

      Re-reviews The authors have solved my issues.

    2. AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest Statement

      Reviewer 2. Chenfei Wang

      In this manuscript, Xin. et al. provided a framework called StereoSiTE that incorporated the established methodologies with their developed algorithm to defined cellular neighborhood (CN) organized immune tumor microenvironment (iTME) based on cellular compositions, and to dissected the spatial cell interaction intensity (SCII) in spatial transcriptomics (ST). StereoSiTE has the following improvements compared to existing methods. First, SCII detects cell-cell communication using both cell space nearest neighbor graph and targeted L-R expression. Second, SCII taken the interaction distance account for different interaction classification such as secreted signaling, ECM receptor and cell-cell contact. Finally, StereoSiTE could avoided to detected the false positive interactions caused by limited reachable interaction.

      Although the authors performed comprehensive works to demonstrate the potential applications of StereoSiTE. This reviewer has strong concerns about the potential novelty and effectiveness of StereoSiTE over existing methods. The CN results were not mindful of the spatial information, and the labeled cellular neighborhood (CN) may mislead users. Additionally, although the L-R pair could be categorized into three classifications based on interaction distance, the SCII only uses different radius to infer cell communication without employing a different strategy for predicting interactions in distinct L-R pairs. I have the following detailed comments.

      Comments: 1. The authors fail to show the novelty and advantages of CN compared to other methods, such as DeepST, which integrates gene expression, spatial location and image information. The authors should provide the comparison with the several recent strategies that consider the effect of local niches including BANKSY, stLearn, Giott, and DeepST. 2. The authors should compare SCII with additional methods such as CellPhoneDB v3 and Cellchat v2, demonstrating its superior performance. 3. The method used for cell segmentation should offer more comprehensive information rather than solely citing "Li, M. et al. (2023)". 4. Format of the paper. The alignment inconsistency within the manuscript—with some paragraphs centered and others justified—should be corrected for uniformity. 5. The figures and manuscript containing 'Teff' and 'M2-like' cell types should provide a legend explaining the abbreviations for clarity. 6. The font size of the labels in Figures 5E-F is insufficient for easy reading and should be enlarged. Re-review: In the response letter, the author emphasizes the novelties of the StereoSiTE framework and demonstrates how the StereoSiTE software was specifically designed to address the question of "how iTME responds and functions under stimulation" using stereo-seq data. The author highlights notable enhancements to the self-development algorithm, including CN and SCII. The CN algorithm focuses on evaluating the cell composition in iTME, while SCII is designed to infer the intensity of spatial cell interactions. These advancements have been incorporated into the updated version of the manuscript. Notably, the SCII component of the framework combines spatial information and expression patterns to infer that cell-cell communication can limit reachable interactions, thereby reducing false positive interactions. The authors have also employed distinct strategies to predict different types of L-R pairs with varying interaction distances, encompassing secreted signaling, ECM-receptor, and cell-cell contact. In the case of secreted type L-R pairs, SCII enables the specification of varying radius thresholds to infer spatial cell communication. However, it is recommended that the authors consider the exponential decay of expression values, particularly when the radius exceeds 100 μm.

      The response also outlines the authors' claim that CN exhibits good performance compared to other tissue domain division methods (BANKSY and Giotto HMRF). However, upon reviewing the performance comparison results, it becomes apparent that BANKSY outperforms the other methods, although the CN method shows nearly consistent performance with BANKSY on the benchmark dataset STARmap. To substantiate the preference for CN over BANKSY, the authors are encouraged to provide evidence of its user-friendly interface, shorter run time, or lower memory usage. Overall, the revisions and enhancements made to the StereoSiTE framework significantly improve its functionality and analytical capabilities. The StereoSiTE software holds great promise in providing invaluable insights and support for potential users and researchers in the field.

    3. AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae078), and published as part of our Spatial Omics Methods series. The peer-reviews are as follows.

      Reviewer 1. Lihong Peng

      In this manuscript, the authors developed a computational framework named StereoSiTE to spatially and quantitatively profile the cellular neighborhood organized iTME by incorporating open source bioinformatics tools with their self-proposed algorithm named SCII. This study is very meaningful. However, it remains several problems.

      Major comments: 1. The authors incorporated several open sources bioinformatics tools. However, how to ensure their combination is the optimal to the spatially resolved cell-cell communication inference performance? For example, cell2location was used to deconvolute cellular composition and construct cellular neighborhood. Why to use cell2location for deconvoluting spatial transcriptomics data? why not use the newest deconvolution algorithms, for example, SpaDecon, Celloscope, POLARIS, GraphST, SPASCER, and EnDecon? No model can adapt to all data. The authors should first verify that cell2location is the best appropriate cell type annotation tool corresponding to iTME. If not, the subsequent analyses will be not appropriate.

      1. The authors claimed that they computed the decomposition losses of different combinations of the number of CN modules and CT modules. Which combinations? The authors should list them.

      2. When measuring spatial cell interaction intensity, the authors only simply summed up the ligand and receptor gene expression information of the sender and receiver cells. Why not consider existing classical intercellular communication intensity methods? The authors should compare other intercellular communication intensity measurement methods. Please refer to the following two cites: Cell-cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies, briefings in bioinformatics. CellDialog: A Computational Framework for Ligand-receptor-mediated Cell-cell Communication Analysis, IEEE Journal of Biomedical and Health Informatics. Deciphering ligand-receptor-mediated intercellular communication based on ensemble deep learning and the joint scoring strategy from single-cell transcriptomic data, Computers in Biology and Medicine.

      3. For protein-protein interaction analysis, the authors queried 628 significant up regulated genes in CN5 area of treatment samples from STRING. Can all obtained proteins be ligands or receptors? In addition, they labeled hub genes and key protein-protein interaction networks, what were these hub genes and key networks used for?

      4. Which ligand-receptor pairs could mediate intercellular communication within immune tumor microenvironment? Among these L-R pairs, which L-R pairs are known in existing databases and which L-R pairs are the predicted ones?

      5. "The enrichment analysis of individual CN showed that each CN had a dominant cell type with a spatial aggregation (Fig 2F), which was increasingly obvious than that in whole slide (Fig 2E)." What's a dominant cell type? How to define it?

      6. "To reduce the variance among open-sourced L-R databases, we unified L-R database in SCII by choosing L-R dataset in CellChatDB, which assigned each L-R with an interaction distance associated classification as secreted signaling, ECM receptor and cell-cell contact." How to unify L-R database? Did it allow for user-specified LR databases and/or add user-specified LR databases?

      7. In figure 3, how to confirm which L-R pairs mediate intercellular communication?

      8. StereoSiTE is composed of multiple modules, is it scalable? Can some of these modules (such as clustering and cell type annotation) be replaced with other more powerful modules?

      9. The authors claimed that "CellPhoneDB detected many false positive interactions". How to find these false positive LRIs? How to validate the LRIs be false positives? Please list the found false positive LRIs.

      10. In Figure 3, the authors should add comparison experiments between StereoSiTME and classical intercellular communication analysis tools.

      Minor comments: 1. The text in subfigure A, B, and C in Supplementary Figure 2 is obscure. The authors should revise Supplementary Figure 2. 2. In Section "Abstract", iTME should use full name when it first appears. 3. Which cites of "13 Li, M. et al. (2023)." is in the reference list?

      Re-review:

      In the revised manuscript, the authors conducted lots of revisions. However, it still remains many problems to solve:

      1. The authors have compared the performance of Cell2location with other cell type identification methods, Celloscope[10], GraphST[11], and POLARIS[12] on on both STARmap and stereo-seq dataset of liver cancer. How about its performance on other unlabeled datasets? Please compare it with "STGNNks: Identifying cell types in spatial transcriptomics data based on graph neural network, denoising auto-encoder, and 𝑘-sums clustering".

      2. Cell-cell communication is usually mediated by LRIs. The construction of high-quality LRI databases is very important to cell-cell communication. The authors should introduce these LRI data resources and potential LRI prediction methods and cite them, for example, PMID: 37976192, 37364528, 38367445.

      3. In Figure 4B, 4C, 4D, and 4F, Figure 5A and 5B, Figure 6B and 6C, the fonts are too small. Please enlarge the fonts.

      4. The organization and structure of this manuscript must be carefully revised. For example, The structure in Discussion is obscure. In the first paragraph in this section, the authors have introduced their proposed method, next, they described it in details. But the third paragraph elucidated the reason why to develop this reason. In addition, "Figure 3 highlights that the analysis without distance threshold may lead to false positive results, and SCII showed more superior performance than other methods." why to Figure 3? Did not the other results support their conclusion? The final paragraph in Discussion introduced their method again. It HAS NO logic.

      5. Where is the conclusion of this manuscript?

      6. The authors should analyze the limitations of this work for further work in the future.

      7. English is VERY POOR. This manuscript must be carefully revised. For example,

      "prove that spatial proximity is a must to guarantee an effective investigation.", is a must to do?

      Re-re-review: The authors have solved my issues.

    1. DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 3. Xiaoxu Sun

      The paper titled "MISIP: A Data Standard for the Reuse and Reproducibility of Stable Isotope Probing Derived Nucleic Acid Sequence and Experiment" presents a compelling argument for establishing a minimum information standard for stable isotope probing (SIP) experiments. The proposed MISIP standard aims to facilitate data reuse and ensure the reproducibility of results within the scientific community. The authors have meticulously considered the essential information required for MISIP, resulting in a well-articulated manuscript. However, I have a few suggestions that could further refine the proposed standard.

      To me, one critical aspect of MISIP is to ensure it provides necessary details of the SIP incubations. Although the authors have integrated some of this information, which can overlap with other existing standards like MIMS/MIMARKS (e.g., sample origin), there are additional elements that should be included in MISIP, either as mandatory or recommended information.

      Suggestion 1: Inclusion of Additional Substrates in Incubations

      The paper rightly identifies the isotopologue as a requisite detail for MISIP. However, I recommend expanding this requirement to include a mention of other substrates added during incubations, at least as a recommended piece of information. While specifying the primary substrate (e.g., 13C-labeled glucose) is often sufficient for studies targeting heterotrophic processes, the identification of autotrophic populations using substrates like 13C-bicarbonate necessitates the disclosure of electron donors/acceptors to clarify the targeted metabolic processes.

      Suggestion 2: Detailed Reporting of Incubation Progress

      Although incubation time is suggested as a recommended field, I propose that details regarding the progress of the specified reactions should also be documented, such as the incorporated dose. This is particularly relevant when different substrate doses are used, as these can yield varied outcomes. For instance, the rate of substrate utilization can significantly differ across inoculums at identical time points; coastal sediment might consume 1 mM of glucose in a day, whereas deep-sea samples might take longer. Therefore, merely reporting incubation time without context may not provide sufficient insight for readers to gauge the dynamics of potential cross-feeding or other relevant processes.

      In conclusion, integrating these suggestions into the MISIP standard could enhance its comprehensiveness and utility. By providing a more detailed framework, researchers can better interpret experimental setups and results, fostering a more robust foundation for data reuse and reproducibility in the field of stable isotope probing.

      Re-review: Nice work on addressing all the comments. All my concerns have been addressed.

    2. DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.

      Reviewer 2. Jibing Li

      In this study, the authors meticulously delineated the Minimum Information about Stable Isotope Probing (MISIP) data standard within the broader framework of the Minimum Information about any (x) Sequence (MIxS) data standard. By extending the scope of MIxS to incorporate SIP-specific metadata, the authors have provided invaluable guidance to SIP practitioners regarding the collection and reporting of essential metadata for SIP experiments. Adoption of the proposed MISIP data standards is poised to significantly augment the reusability of sequence data derived from SIP experiments, thereby fostering a deeper understanding of in situ biogeochemical processes and microbial ecology. While the manuscript presents novel insights, further refinement is necessary to optimize its impact.

      The MISIP data standard holds paramount importance in the realm of stable isotope probe (SIP) technology as it standardizes the collection and reporting of metadata essential for SIP experiments. This significance will be elucidated in the introduction to underscore the necessity and relevance of the MISIP framework.

      The "Excess Atom Fraction" (EAF) serves as a pivotal metric for evaluating the isotopic enrichment of specific taxa, genomes, or genes in SIP experiments. It plays a crucial role in quantifying the incorporation of isotopically labeled substrates into microbial biomass, thereby providing valuable insights into microbial community dynamics and functional gene expression.

      The introduction section will be expanded to provide a comprehensive background on DNA/RNA-stable isotope probing (SIP) technology, emphasizing the need for standardized data reporting through the MISIP framework. This contextualization will elucidate the motivation behind the development of MISIP and underscore its significance in promoting data reuse and reproducibility in SIP research.

      To enhance transparency and credibility, a detailed account of the development process of the MISIP data standard, including the methodologies employed and potential challenges encountered, will be incorporated. This supplementary information will provide readers with insights into the rigor and practicality of the standard.

      Specific application cases showcasing the efficacy of the MISIP data standard in actual research scenarios will be integrated into the manuscript. These case studies will serve to illustrate the practical utility of MISIP and bolster the persuasiveness of the article.

      A comparative analysis of the MISIP data standard with existing similar standards will be conducted to highlight its advantages and uniqueness. This comparative approach will furnish readers with a comprehensive understanding of the distinctive features and benefits of MISIP.

      The article will delve into the limitations of the MISIP data standard, explore potential avenues for future improvement, and delineate its application prospects in fields such as microbial ecology. This discussion will offer critical insights into the current state and future trajectory of MISIP.

      The manuscript will be supplemented with a thorough examination of the limitations of MISIP data standards, potential avenues for future enhancement, and its implications for microbial ecology and other relevant fields. This holistic approach will ensure that the article comprehensively addresses all facets of the MISIP framework.

      Re-review: Overall, the author addressed the questions I raised; however, the existing SIP research overlooked some representative authors I consider important, such as Thomas, F. (SME J. 2019;13:1814-30) and Luo, CL (Environ Int. 2023;180:108215). The author should include a more thorough review of the relevant literature to provide a well-rounded context for the study. Additionally, I identified several formatting errors in the manuscript, such as the incorrect citation in reference 15. These errors should be rectified to meet the journal's standards.

    3. This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae071). These reviews are as follows.

      Reviewer 1. Dayi Zhang The topic is quite interesting and important for microbiologists who are doing SIP work. However, there are some concerns about its quality and novelty. 1. 15N is widely used in SIP but the authors did not mention them in this work. As an important labelling isotope, it is not acceptable to exclude 15N work. 2. The authors have well designed and explained the catalog of MISIP, but how to standardize data from different sources are is not mentioned. In other words, there is only a method to put information together but no protocol to compare data from different studies or extract useful information from others' work. I think this is the most important expectation of this work. 3. As different protocols were used by different researchers to achieve SIP results, the authors should give criteria for their quality and the way to improve the quality for comparison. However, I cannot find such information. 4. For the reason above, I think this is only a very preliminary concept, and the datasets and methods should be further developed for practical purposes.

      ---Editors Comments--- This work was then rejected to allow more work and revision. and then resubmitted.

    1. Editors Assessment:

      Due to them being found in the landlocked, isolated habitat of Lake Baikal makes the Baikal Seal (Pusa sibirica) unique among all pinnipeds as the only freshwater seal. This paper presents reference-based assemblies of six newly sequenced Baikal seal individuals, one individual of the ringed seal, as well as the first short-read data of the harbor seal and the Caspian seal . This data aiding the study of the genomic diversity of the Baikal seal and to contribute baseline data to the limited genomic data available for seals. Peer review extended the description of the used tools and parameters in the revised manuscript, and provided some more information on the methods..This newly generated sequencing data hopefully now helps to extend the phylogeny of the Phoca/Pusa group on genome-wide data and can also broaden the view into the genetic structure and diversity of the Baikal seal

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground: The iconic Baikal seal (Pusa sibirica), the smallest true seal, is a freshwater seal that is endemic to Lake Baikal where it became landlocked some million years ago. It is a rather abundant species of least concern, despite the limited habitat. Until recently, research on its genetic diversity has only been done on mitochondrial genes, restriction fragment analyses, and microsatellites, before its reference genome has been published. Findings: Here we report the genome sequences of six Baikal seals, and one individual of the Caspian seal, ringed seal, and the harbor seal, re-sequenced from Illumina paired-end short read data. Heterozygosity calculations of the six newly sequenced individuals are like the previously reported genomes. In addition, the novel genome data of the other species contributed to a more complete phocid seal phylogeny based on whole-genome data. Conclusions: Despite the long isolation of the land-locked Baikal seal population, the genetic diversity of this species is in the same range as other seal species. The Baikal seal appears to form a single, diverse population. However, targeted genome studies are needed to explore the genomic diversity throughout their distribution.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.142). These reviews are as follows.

      Reviewer 1. Yaolei Zhang

      Overall, the newly generated data from this study are valuable, but the authors have not effectively analyzed and interpreted the data. The entire paper appears to be more like an undergraduate bioinformatics homework exercise, with the results resembling a middle school student's description of a picture. Additionally, there are several major issues: 1. Background investigation is not sufficient: Genomic data on the Baikal seal has been publicly available five years ago, including a chromosome-level genome assemby with much higher quality, such as contig N50, which is nearly ten times higher than the reference genome used by the author in this study. 2. Methodology is unclear: The description of the software and parameters used is incomplete. A proper methodological description should allow a basic bioinformatics analyst to quickly reproduce the results of the paper. However, with the current description, there are too many missing details in the methodology section. 3. Data issues: • a. For publicly available data, the authors did not provide detailed descriptions of the accession numbers. • b. For the newly generated data in this study, the author did not sufficiently describe the data quality to support their conclusions. • c. In the supplementary table, the author show 100% mapping rates of sequencing reads for all samples. Having worked on numerous resequencing projects, I have rarely encountered 100% mapping rates, especially when aligning to different species. The author should check this. 4. Basic analytical skill/experience is lacking: For example, the PSMC analysis, sequencing depth can directly affect the results, but the author did not consider this issue and proceeded to compare curves generated from different sequencing depths directly. Additionally, how was the mutation rate (μ) derived? The generation time is only mentioned as coming from IUCN, but values are not provided in the paper. Moreover, in the genetic diversity section, is calculating heterozygosity only sufficient to be considered a measure of genetic diversity? Hope the author to read some re-sequencing papers thoroughly Re-review: The authors carefully addressed most of my concerns. Although I still doubt about the mapping rate (I did no find the mapping report attached), I am happy to accept this manuscript.

      Reviewer 2. Stephen Gaughran

      Are all data available and do they match the descriptions in the paper? Yes. NCBI numbers should be added when available.

      Comments: I would recommend using a lower mutation rate for seals: de novo mutation rates around 7e-9 have been measured for a few pinniped species. Line 129: I think you mean kya here (not Ma). Line 160: I think this should be "an average value of 0.066%"

    1. Editors Assessment:

      The article presents strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA (Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence-Alignment). The Technical Release write-up presenting a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Demonstrating the performance gains achieved by vectorized SIMD technology and the application of threading. Testing and debugging a few errors, and adding some more background detail, demonstrating it can achieve faster comparison speed. Demonstrating TSTA's efficacy in pairwise sequence alignment and multiple sequence alignment, particularly with long reads, and showcasing considerable speed enhancements compared to existing tools.

      This evaluation refers to version 1 of the preprint

    2. AbstractsThe rapid advancements in sequencing length necessitate the adoption of increasingly efficient sequence alignment algorithms. The Needleman-Wunsch method introduces the foundational dynamic programming (DP) matrix calculation for global alignment, which evaluates the overall alignment of sequences. However, this method is known to be highly time-consuming. The proposed TSTA algorithm leverages both vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignments.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.141). These reviews are as follows.

      Reviewer 1. Xingbao Song and Baoxing Song

      Zong et al. implemented a TSTA package that integrated the difference method, the stripe method, SIMD, and multiple threading approaches to perform efficient sequence alignments. The TSTA toolkit could conduct pairwise and multiple sequence alignments. The memory cost of TSTA is comparable with the most efficient one. Overall, TSTA is a good package, and the manuscript is well-written. While I have a few suggestions: 1) The minimap2 should be mentioned in the section on "difference recurrence relation." It has a much broader range of users and implemented an algorithm that is slightly different from the one by Suzuki, etc. 2) The striped SIMD is also implemented in reads mappers, such as BWA. 3) Page 14, line 215 "1k bps", line 227 "1000 kbps", line 230 and table1 "100k". They should be consistent. 4) In Table 4, I am not sure I understood the second and third lines correctly. Please clarify. 5) I tried to compile TSTA from the source code. To compile the package, I had to copy 'seqio.h' into the 'msa' and 'psa' folders. Please fix it.

      Reviewer 2. Yuansheng Liu

      The article explores strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA. The paper provides a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Experimental results indicate that TSTA demonstrates significant speed advantages, particularly when handling long sequences and in the no-backtracking mode. However, the experiments on MSA are limited by the experimental environment, which does not fully address the needs of current sequencing technologies concerning long reads and depth. Specifically, the low number of sequences in MSA does not meet the requirements for downstream genomic analysis applications. While the algorithm is highly innovative, its performance on short sequences and during the backtracking phase still requires optimization. 1. In line 7, the TSTA algorithm utilizes vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignment. Why are there no experiments designed specifically to evaluate the global alignment performance of TSTA with vector-level parallelism? Or are there any other experimental designs that demonstrate the improved performance of TSTA when vector-level parallelism is employed? 2. In line 149, is the Active-F method used by the TSTA algorithm contributing to the excessive memory usage and access time overhead observed during the iterative process of PSA? Are there better optimization strategies from this perspective? If not, why does TSTA incur higher time costs in traceback as shown in Table 1? Why does bsalign result in lower time consumption? 3. Can you provide the time breakdown for each part of the parallel computation in TSTA for PSA (including at least CPU computation overhead, communication overhead, and I/O overhead) to clarify if there will be significant communication overhead issues with larger datasets and more threads? 4. Table 2 shows that both real and simulated datasets have issues with insufficient depth and short reads. In real MSA processes, it is common to encounter comparisons with depth over 60X and lengths exceeding 100 kbps for long reads. The results under the current experimental conditions seem to perform poorly for such data scenarios. Can you address this? 5. Gene data often includes repetitive regions that affect the accuracy of alignment algorithms. Can you design experiments to verify how TSTA performs in aligning long repetitive regions? Specifically, how accurately does TSTA align sequences in such regions compared to other methods? 6. Besides repetitive regions, sequencing errors produced by ONT R10 chips can also impact alignment accuracy. Alignment algorithms used in genome correction often struggle to detect such errors. How does TSTA handle such issues during MSA? Can the algorithm be designed to address these sequencing errors more effectively? Re-review: After thoroughly reviewing the revised manuscript and testing the TSTA tool, I cannot endorse the manuscript for publication in its current form. I encourage the authors to address the following issues thoroughly and consider re-submitting after significant improvements. Efficiency Concerns: In the context of multiple sequence alignment (MSA), I find that TSTA does not demonstrate a significant advantage in terms of efficiency. I conducted a test with approximately 2G of homologous diploid reads (not too large data), and the tool has been running for around 29 hours. Despite this extensive runtime, the process remains incomplete. This is far from the efficiency one would expect from a tool designed for large-scale sequence alignment. Functionality Issues: There are still unresolved issues with the tool's functionality. The -f parameter does not appear to work as intended, and there are also problems with the -o parameter. Such issues need to be addressed to ensure the tool's reliability and usability.

    1. Editors Assessment:

      The crested gecko (Correlophus ciliatus), is a lizard species endemic to New Caledonia, and a potentially interesting model organism due to its unusual (for a gecko) inability to regenerate amputated tails. With that in mind here is presented a new reference genome for the species, assembled using PacBio Sequel II platform and Dovetail Omni-C libraries. Producing a genome with a total size of 1.65 Gb, 152 scaffolds, a L50 of 6, and N50 of 109 Mb. Peer review making sure more detail was added on data acquisition and processing to enhance reproducibility. In the end producing potentially useful data for studying the genetic mechanisms involved in loss of tail regeneration.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe vast majority of gecko species are capable of tail regeneration, but singular geckos of Correlophus, Uroplatus, and Nephrurus genera are unable to regrow lost tails. Of these non-regenerative geckos, the crested gecko (Correlophus ciliatus) is distinguished by ready availability, ease of care, high productivity, and hybridization potential. These features make C. ciliatus particularly suited as a model for studying the genetic, molecular, and cellular mechanisms underlying loss of tail regeneration capabilities. We report a contiguous genome of C. ciliatus with a total size of 1.65 Gb, a total of 152 scaffolds, L50 of 6, and N50 of 109 Mb. Repetitive content consists of 40.41% of the genome, and a total of 30,780 genes were annotated. Assembly of the crested gecko genome provides a valuable resource for future comparative genomic studies between non-regenerative and regenerative geckos and other squamate reptiles.Findings We report genome sequencing, assembly, and annotation for the crested gecko, Correlophus ciliatus.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.140). These reviews are as follows.

      Reviewer 1. Anthony Geneva and Cleo Falvey

      In their revised manuscript Gumangan and colleagues have addressed each of the comments we made on the original manuscript via substantial revisions. We appreciate the improvements the authors have made but feel there are a few remaining issues that require attention, detailed below. Those issues notwithstanding, this new assembly and annotation represent a very nice contribution to the field and will certainly be widely used.

      Specific comments: Pages 2 and 6: Each time L50 or L90 statistics are reported they are listed with the units “bp”. These values are counts of scaffolds are are typically simply reported as integers without units. Page 3. “Furthermore, C. ciliatus is the only non-regenerative lizard species capable of hybridizing with regenerative relatives, specifically C. sarasinorum, Mniarogekko chahoua, and Rhacodactylus auriculatus.” This statement is very interesting but requires a reference or at least attribution of some kind (eg - personal observation by one of the co-authors). Page 3: “Genomic DNA was sequenced using the Illumina Novaseq 6000 platform. 185.8 gigabase-pairs of PacBio CCS reads were used as inputs to Hifiasm v0.15.4-r347 [8] with default parameters.” The sequencer listed here for generating long reads seems to be an error and should be some PacBio platform (Sequel, Sequel IIe, etc). Page 6: “The contig/scaffold N50 is 109 Mb, and the largest scaffold had a length 1169 Mbp (Table 1)”. 1169 should be 169.

      Reviewer 2. Zexian Zhu

      Review comments are in the following link: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTU3L3Jldmlldy5kb2N4

      Reviewer 3. Chaochao Yan

      Are all data available and do they match the descriptions in the paper? No. In the section "Availability of supporting data," it is stated that "supporting datasets, including annotation, are available at GigaDB." However, I was unable to locate these datasets during my search. Could you please provide a direct link or the accession number to access these resources?

      Is the data acquisition clear, complete and methodologically sound? No. The manuscript currently lacks detailed information regarding the samples and data used to assemble and annotate the reference genome. For instance, it does not specify how many samples or libraries were used for RNA-Seq or whole-genome sequencing. I suggest including a table that provides comprehensive information on the samples and sequencing data. Additionally, while the manuscript mentions that "Genomic DNA was sequenced using the Illumina Novaseq 6000 platform," the corresponding Illumina data are not described. I am unclear about how the PacBio CCS reads were produced. Could you please provide more details or clarify the methodology used to generate these reads?

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Some methods described in the manuscript lack sufficient detail, particularly for tools such as BLAST, BlobTools, HiRise, and BWA. To ensure reproducibility, I recommend providing the specific parameters used for these analyses.

    1. Editors Assessment:

      This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.

      This evaluation refers to version 1 of the preprint

    2. AbstractUnderutilized sheep and goat breeds have the ability to adapt to challenging environments due to their genetic composition. Integrating publicly available genomic datasets with new data will facilitate genetic diversity analyses; however, this process is complicated by important data discrepancies, such as outdated assembly versions or different data formats. Here we present the SMARTER-database, a collection of tools and scripts to standardize genomic data and metadata mainly from SNP chips arrays on global small ruminant populations with a focus on reproducibility. SMARTER-database harmonizes genotypes for about 12,000 sheep and 6,000 goats to a uniform coding and assembly version. Users can access the genotype data via FTP and interact with the metadata through a web interface or programmatically using their custom scripts, enabling efficient filtering and selection of samples. These tools will empower researchers to focus on the crucial aspects of adaptation and contribute to livestock sustainability, leveraging the rich dataset provided by the SMARTER-database.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.139). These reviews are as follows.

      Reviewer 1. Ran Li

      The authors presented an online SMARTER-database, which collected a large number of genotype data for sheep and goats. The resources are of great importance for the community.

      My major concerns: 1) The below link is not accessible: webserver.ibba.cnr.it 2) For sheep, the database support reference genome assembly of Oar3 and Oar4, but actually Oar 3 is rarely used. Instead, the current ovine reference genome assembly (ARS-UI_Ramb_v3.0) is not supported. 3) For the presentation of metadata (https://webserver.ibba.cnr.it/smarter/breeds?species=Sheep), I suggest additional columns indicating the region and country should be provided. 4) For the datasets (https://webserver.ibba.cnr.it/smarter/datasets), references are needed to know where the data are from.

      Re-review:

      My comments have been properly addressed. The manuscript is acceptable for publication.

      Reviewer 2. Hans Lenstra and Johannes A. Lenstra

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is implicitly clear and does not need to elaborate upon.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? No. This does not to seem necessary.

      Is the code executable? unable_to_test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? unable_to_test Is the documentation provided clear and user friendly? Yes. I did not test this.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? No. I did not see such a list, but I would not be able to assess this.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? not_applicable

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No. I did not find any of this but it does not seem to be essential.

      Additional Comments: This manuscript describes a highly useful database of sheep and goat genome-wide SNP genotypes from several sources, supplemented with phenotypes and geographic locations. I recommend this manuscript for publication in Gigascience after a revision. There is some missing information, whereas the presentation should become less cryptic to readers who are less familiar with the bioinformatic terminology. Missing info. 1. The title and abstract do not mention that SMARTER focuses on SNPs that are genotyped on bead arrays or related technologies. The focus on the genome-wide (GW) SNP genotypes, which only partially represents the total genomic diversity, should already be clear from the Title and the Abstract. 2. Nowadays there are more publications on WGS data, T2T sequences and pangenomes than on GW SNP genotypes, so people may wonder if the GW SNP genotypes still are useful. It may be emphasized that bead-arrays allow an affordable analysis of many animals and that genotypes derived from WGS data contain many false homozygote scores if not sequenced at a very high coverage. 3. Figures 2 and 3 give an idea of the geographic coverage, but what is the unit of the numbers that are visualized in the heat map (0 to 2300 for sheep, 0 to 1100 for goats)? 4. It is not clear which published data have been used or not. We recommend presenting a supplemental table describing the current contents: country, breed, number of animals, number of SNPs (at least 50K or HD), reference. 5. Is there an organized effort to update the database, which ideally should contain all published GW SNP databases? 6. To my experience for most HW SNP datasets only the filtered data after quality control (typically 45 to 49K, less than 42K if sheep 50K and HD genotypes are combined) are available. How is this handled? 7. It may be mentioned that after omission of A/T and G/C SNPs the TOP strand consists only of A/C and A/G SNPs. 8. The problematic SNPs are mentioned twice within the last paragraph of the section Data Composition. 9. Does SMARTER allow to store phased datasets and show the variant haplotypes? These can now be generated by long-read sequencing and are needed for several downstream analysis options. 10. Table 1: OAR3 = Oar_v3.1 and OAR4 = Oar_v4.0? Please use the official codes. 11. Are there options to convert the data to newer assemblies? For instance, the sheep ARS-UI_Ramb_v3.0 is superior to Oar_v4.0. I have used an NCBI tool for conversion of Oar_v1.0 (most popular for 50K datasets) and Oar_3.1 (used often for sheep HD datasets) to Oar_v4.0, but this tool has probably been discontinued and was not available for goat assemblies. 12. I repeatedly found that most published or unpublished databases contain several errors such as duplicates and outliers by mislabeling or crossbreeding. Because these are better removed prior to downstream analysis, data curation would be desirable, for instance by an inspection of a NJ tree of individuals. This also shows the degree of breed-level differentiation, for instance the relationships of different populations of a transboundary breed. These caveats should at least be mentioned. 13. Another caveat: is there a systematic check on the validity of the merging of datasets by testing if breeds sampled independently by different institutes cluster closely together? Presentation. 14. Abbreviations should not be used in abstract. What is “REST API”? These abbreviations of course are in the list, but what is “Representational State Transfer”? And “JSON Web Token”? 15. Figure 1 needs more guidance via the legend. The boxes show alternative formats? What are “str”, “dict “? 16. Figure 5 is useful and seems to retrieve data for the goat Alpine and Bionda dell'Adamello breeds. It would also be useful to show other “API-URL” (this is user input?) while describing in plain language what is being accomplished. 17. Figure 6: bold indicates the user input? What is exactly a “array [string]” (give an example)? A few other examples may be most instructive and familiarize the reader with the logic of SMARTER. 18. In the section “The SMARTER-database project”: what is a mongoengine? 19. In the same section: “Finally the VariantSpecie abstract class is inherited by . . .”: this sentence is difficult to understand. 20. In the section Reproducibility: please give a short description of what is the use of the Conda and Docker programs. 21. Same section: “Raw data undergoes initial exploration”, “structure and potential issues”: can you be more specific? The last part of this section is also difficult to follow.

      Re-review: This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.

    1. Editors Assessment:

      This paper presents NucBalancer, a R-pipeline and Shiny app designed for the optimal selection of barcode sequences for sample multiplexing in sequencing. Providing a user-friendly interface aiming to make this process accessible to both bioinformaticians and experimental researchers, enhancing its utility in adapting libraries prepared for one sequencing platform to be compatible with others. Important now with the introduction of additional sequencing platforms by Element Biosciences (AVITI System) and Ultima Genomics (UG100) increasing the diversity and capability of genomic research tools available. NucBalancer’s incorporation of dynamic parameters, including customizable red flag thresholds, allows for precise and practical barcode sequencing strategies. This adaptability is key in ensuring uniform nucleotide distribution, particularly in MGI sequencing and single-cell genomic studies, leading to more reliable and cost-effective sequencing outcomes across various experimental conditions. All the code is available under an open source license, and upon review the authors have also shared the code for the Shiny app.

      This evaluation refers to version 1 of the preprint

    2. AbstractRecent advancements in next-generation sequencing (NGS) technologies have brought to the forefront the necessity for versatile, cost-effective tools capable of adapting to a rapidly evolving landscape. The emergence of numerous new sequencing platforms, each with unique sample preparation and sequencing requirements, underscores the importance of efficient barcode balancing for successful pooling and accurate demultiplexing of samples. Recently launched new sequencing systems claim better affordability comparable to more established platforms further exemplifies these challenges, especially when libraries originally prepared for one platform need conversion to another. In response to this dynamic environment, we introduce NucBalancer, a Shiny app developed for the optimal selection of barcode sequences. While initially tailored to meet the nucleotide, composition challenges specific to G400 and T7 series sequencers, NucBalancer’s utility significantly broadens to accommodate the varied demands of these new sequencing technologies. Its application is particularly crucial in single-cell genomics, enabling the adaptation of libraries, such as those prepared for 10x technology, to various sequencers including G400 and T7 series sequencers. By facilitating the efficient balancing of nucleotide composition and the accommodation of differing sample concentrations, NucBalancer plays a pivotal role in reducing biases in nucleotide composition. This enhances the fidelity and reliability of NGS data across multiple platforms. As the NGS field continues to expand with the introduction of new sequencing technologies, the adaptability and wide-ranging applicability of NucBalancer render it an invaluable asset in genomic research. This tool addresses the current sequencing challenges ensuring that researchers can effectively balance barcodes for sample pooling regardless of the sequencing platform used.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.138). These reviews are as follows.

      Reviewer 1. Aamir Khan

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      Yes. The tool has novel features not reported in previous tools for barcoding.

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      Yes. The tool is available as an R script as well as a shiny app.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I would suggest mentioning a few features that are novel or superior to other tools. Perhaps adding a table specifying these novel features that are not part of existing tools will add value to MS.

      Is the documentation provided clear and user friendly?

      Yes. The documentation is provided in a clear and user-friendly way. The input file formats are given in the GitHub page. It would be better to add an example to the shiny app page.

      Yes. Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. Dependencies are mentioned on the tool documentation page and can be installed if R is already installed.

      Additional Comments: The authors have a well-written MS describing the NucBalancer tool. The tool adds value for sequencing by pooling samples and will be useful as we make technological advancements in the sequencing space.

      Reviewer 2. Hugo Varet

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      Yes. The manuscript explains the constraints to be satisfied when looking for barcodes but more details about the context (Illumina chemistry for instance) would be appreciated. Moreover, is the software compatible with dual-indexing?

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      Yes. The source code of the program is available on GitHub as a R script, but the source code of the Shiny application is not available.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      Yes. Support can be asked by email to the authors as stated at the end of the README on GitHub.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Yes. The example command line works well. However, the R script needs shiny and xtable packages to be loaded even if none of their functions is actually called in the script.

      Is the documentation provided clear and user friendly?

      No. A detailed documentation would improve the application proposed. In particular, more details about the different chemistries used by Illumina, MGI... and the constraints to find compatible barcodes.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      No. The strategy used to find barcodes seems very simple, but more details would improve the manuscript.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      No. The manuscript cites several packages developed to find compatibles sequencing barcodes but the performances are not compared. Moreover, we do not know if NucBalancer still work with a high number of samples/barcodes.

      Are there (ideally real world) examples demonstrating use of the software?

      No. A real world example would be appreciated to illustrate the software, especially in a scenario where the other cited solutions were not able to find compatible barcodes.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      No.

      Additional Comments: I would suggest the authors to improve the design of the Shiny app as (at the moment) it only runs a R script and prints the result. Moreover, I think the quality of the R code could be easily improved (e.g. loops with strange counters or comparisons with booleans).

      Re-review: I thank the authors for the improvements they made on this new version of the manuscript. At this stage, I'm not totally satisfied for the following reasons: - authors tell the source code of the Shiny app is now available on GitHub, but I have not been able to find it. - in the manuscript, the sentence "The tool does not have any dependency other than the utilities from the base R package" is no longer true as the tool now uses optparse. - in table 1, checkMyIndex is referenced with no web interface available white it actually exists (https://checkmyindex.pasteur.fr/). Moreover, the proposed web interface could still be improved. For instance: - it would be great to add something to show the algorithm is currently looking for a solution. - check the input files have a valid structure to be used. - display the input files when they are loaded to make sure the user uploaded the correct file.

      Reviewer 3. Wen Yao

      The authors reported a new tool for barcode sequences design. This tool is developed using R/Shiny and is available for using online. Below are my comments for further improvement of the manuscript and the tool. 1. Please provide a “load example data” button in the Shiny app. With this button, the example data can be easily loaded by the users for testing NucBalancer. 2. This URL (http://146.118.68.98:8888/) for NucBalancer should also be given in the manuscript. 3. The “Download Table” button is not working. 4. Format of the input data should be checked, as input data in wrong format caused the NucBalancer to crash. 5. The authors should compare NucBalancer with published similar tools in this field. More details are required.

      Re-review: The authors have addressed all my concerns.

    1. The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.

      Reviewer: Shilpa Garg

      V-pipe 3.0 is introduced as an advanced computational pipeline tailored for the analysis of nextgeneration sequencing data from short viral genomes. Designed to meet the challenges posed by the vast and diverse datasets generated by these technologies, V-pipe 3.0 emphasizes reproducibility, scalability, adaptability, and transparency. It achieves this by adhering to Snakemake's best practices, allowing easy swapping of virus-specific configuration files, and providing thoroughly tested examples online.

      The utility of V-pipe 3.0 is showcased through its application in two extensive data analysis projects, proving its efficacy in sustainable viral genomic data science. Central to V-pipe 3.0 is its capacity for estimating viral diversity from sequencing data. A versatile benchmarking module has been developed to continuously assess various diversity estimation methods, accommodating the rapid advancements within this field. The pipeline simplifies the inclusion of new tools and datasets, supporting both synthetic and real experimental data. However, challenges in global haplotype reconstruction highlight the need for scalable methods that can accurately reflect the complex population structures of viruses and manage the uncertainties in the results.

      Some additional clarification in the manuscript would be appreciated. 1) I'm curious about how the efficiency is attained. 2) Is it possible to utilize V-pipe for analyzing other microorganisms? 3) The authors might consider directing readers to the following review article for reference: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9 4) Identifying specific genes or genome regions with high polymorphism across different populations would be fascinating. How does V-pipe handle analysis in these highly variable regions?

    2. AbstractThe large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.

      Reviewer: Fotis Psomopoulos

      The manuscript showcases a computational pipeline designed for analyzing next generation sequencing data of short viral genomes, namely V-pipe 3.0. After an overview of the challenge the tool is addressing, i.e. the necessity of continuous benchmarking of various methods due to their diverse performance across different scenarios,the paper continues with a detailed listing of the results, highlighting the key elements of Reproducibility, Scalability, Adaptability and Transparency.

      The next section provides some details on the three applications / demonstrations of V-Pipe 3.0, i.e the Swiss SARS-CoV-2 Sequencing Consortium, the Swiss surveillance of SARS-CoV-2 genomic variants in wastewater and the Global haplotype reconstruction benchmark. This is followed by a comprehensive comparison of V-Pipe 3.0 το other relevant viral bioinformatics pipelines for within sample diversity estimation, focusing on functionalities and sustainability, and specifically nf-core/viralrecon, HAPHPIPE and ViralFlow, as well as a section discussing the main advantages of V-Pipe 3.0 as well as the rationale for some of the identified drawbacks.

      The paper concludes with a thorough description of the underlying methods of V-Pipe 3.0 as well as on the data used. Overall the paper gives a very good presentation of V-Pipe, and makes a strong case about its use and value in a real-world challenge. An overall comment is that there is some confusion on the role of V-Pipe 3.0 as a workflow - i.e. whether it's a dynamic system that uses different tools per step based on user input, or if it's an automated systems that benchmarks the analysis using (e.g.) synthetic data as the baseline. In either case, there are also a few unclear points in the manuscript itself that could be further improved.

      Specifically: -- It is not clear how V-pipe 3.0 differs from V-pipe. Although there is an indication of significant differences, an overview of the new features implemented in this version and/or a small introductory paragraph would be useful. -- In the "Results" section, lines 130 - 225 appear to refer to the implemented methodology and might be better served as part of the "Methods" section -- In the "Results" section, lines 135 - 138 implied that GitHub Actions are used to ensure Reproducibility of the workflow. Some more elaboration on this would be very useful, as GitHub actions are commonly used to automate processes (such as testing, conflict resolution etc). In particular, an reproducibility issue that might not be resolvable by GitHub actions are dependency conflicts that are specific to the particular system that is being tested. -- In the "Results" section, lines 139 - 146, it's not clear how the benchmark study contributes to the overall reproducibility of V-pipe 3.0. Some further explanation of the rationale would be very useful here. -- In the "Results" section, lines 179 - 183, it is not clear how Git and GitHub ensure adaptability of any new features that are implemented. Usually a version control system/automation system, can facilitate the integration of new features, but it's not readily evident how it supports/ensures/facilitates adaptability. Maybe a definition of "adaptability" in this particular context could also help. -- In the "Applications" section, it is not clear which version of V-Pipe was used for the overall analysis (V-pipe or V-pipe 3.0), especially in the wastewater use case. -- In the section "Comparison to other workflows" it is not very clear which tools are implemented within V-pipe 3.0, which differences there are with previous version (V-pipe) and how these differ to other pipelines. A table that is summarizing these details and highlighting the differences would be very useful here. Moreover, there are a few minor points that would enhance the readers' understanding: -- (minor) In the Section "2.1 Reproducibility", it's mentioned that all software dependencies are defined in Conda environments, making V-pipe 3.0 portable between different computing platforms. Is there a particular reason why V-Pipe itself isn't implemented as a conda package directly? -- (minor) More often that not, the pandemic is named as COVID19, in contrast to the virus that is named "SARS-CoV-2". It may be useful to amend/update the references to the "SARS-CoV-2 pandemic" accordingly.

  13. Oct 2024
    1. Editors Assessment:

      This paper reports the establishment of the International Cannabis Genomics Research Consortium (ICGRC) web portal leveraging the open source Tripal platform to enhance data accessibility and integration for Cannabis sativa (Cannabis) multi-omics research. With the aim of bringing together the wealth of publicly available genomic, transcriptomic, proteomic, and metabolomic data sets to improve cannabis for food, fiber and medicinal traits. Tripal is a content management system for genomics data, presenting a ready-to-use specialized ‘omics modules for loading, visualization, and analysis, and is GMOD (Generic Model Organism Database) standards-compliant. The paper explaining how this was put together, what data and features are available, and providing a case study for other communities wanting to create their own Tripal platform. Covering their setup and customizations of the Tripal platform, and how they re-engineered modules for multi-omics data integration, and addition of many other custom features that can be reused. Peer review fixed a few minor bugs and added clarifications on how the platform will be updated.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractGlobal changes in Cannabis legislation after decades of stringent regulation, and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host Cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing.The Tripal platform was used to host public genome assemblies, gene annotations, QTL and genetic maps, gene and protein expression, metabolic profile and their sample attributes. SNPs were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service API, developed on top of existing Tripal modules, returns generic tables of sample, property, and values. Use-cases demonstrate the API’s utility for various -omics analyses, enabling researchers to perform multi- omics analyses efficiently.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.137). These reviews are as follows.

      Reviewer 1. Weiwen Wang

      Is the code executable?

      Unable to test.

      This manuscript is about an online platform, and I am not sure how to test the code.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Same as above.

      Additional Comments:

      With the increasing legalization of cannabis in many countries today, exploring this crop has become a hot topic of research. This manuscript by Mansueto et al. introduces a platform built on the Tripal framework, designed to facilitate multi-omics research in Cannabis sativa. The platform integrates genomic, transcriptomic, proteomic, and metabolomic data, providing researchers with a comprehensive resource for data analysis and sharing. Additionally, APIs have been developed, enabling rapid querying. This manuscript detailed information on how to customize Tripal modules and Chado schema for managing biological entities. Finally, this manuscript highlights the importance of standardization in data storage and analysis, proposing community-wide adoption of standardized nomenclature to ensure consistency and traceability of data. Overall, the platform is poised to become a valuable resource for cannabis research and to advance scientific progress in related fields.

      While this manuscript was engaging, particularly in the sections on Tripal "re-engineering" and controlled vocabulary, I do have several concerns.

      1 Because my registration (using business email) has not been approved, I cannot test the functions requiring ICGRC registration.

      2 The authors noted that the Cannabis Genome Browser has not been updated. Do the authors have a plan for updating ICGRC? If so, what is the proposed update frequency?

      3 ICGRC currently includes only a few cannabis cultivars, especially when compared to other platforms like Kannapedia and CannabisGDB. Do the authors have plans to add additional cultivars, such as First Light and Jamaican Lions mentioned in this manuscript, in the near future?

      4 When I tried to register using Gmail, an error popped up: ‘Domain is not allowed to register for this site’. Perhaps it would be clearer to instruct users to use a business email for registration directly.

      5 There is a data submission function in ICGRC, but the exact workings of this feature remain unclear to me. If a user submits a cannabis genome to the ICGRC, whether this data will be visualized within specific modules like synteny search or genetic mapping tools on the platform.

      Reviewer 2. Hongyun Shang

      Is the code executable?

      Unable to test.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Unable to test.

      This is a comprehensive database with many features that improves the shortcomings of cannabis species that had no genome database in the past. It is a good work. Here are some minor suggestions:

      1. Did not find the function of searching gene and protein sequences directly by gene id without providing chromosome location, which may be a common feature of many omics databases.
      2. In the chapter "The need for cannabis multi-omics databases and analysis platforms", "There are no analysis tools or results available on this website", "No results available" seems inappropriate.
      3. In the chapter "Cannabis - Omics, Genetic and Phenotypic Datasets in the Public Domain", "Crop Ontology" Crop Ontology, is "Crop Ontology" repeated?
    1. Editors Assessment:

      PhysiCell is an open source multicellular systems simulator for studying many interacting cells in dynamic tissue microenvironments. As part of the PhysiCell ecosystem of tools and modules this paper presents a PhysiCell addon, PhysiMeSS (MicroEnvironment Structures Simulation) which allows the user to accurately represent the extracellular matrix (ECM) as a network of fibres. This can specify rod-shaped microenvironment elements such as the matrix fibres (e.g. collagen) of the ECM, allowing the PhysiCell user the ability to investigate physical interactions with cells and other fibres. Reviewers asked for additional clarification on a number of features. And the paper now clear future releases will provide full 3D compatibility and include working on fibrogenesis, i.e. the creation of new ECM fibres by cells.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe extracellular matrix is a complex assembly of macro-molecules, such as collagen fibres, which provides structural support for surrounding cells. In the context of cancer metastasis, it represents a barrier for the cells, that the migrating cells needs to degrade in order to leave the primary tumor and invade further tissues. Agent-based frameworks, such as PhysiCell, are often use to represent the spatial dynamics of tumor evolution. However, typically they only implement cells as agents, which are represented by either a circle (2D) or a sphere (3D). In order to accurately represent the extracellular matrix as a network of fibres, we require a new type of agent represented by a segment (2D) or a cylinder (3D).In this article, we present PhysiMeSS, an addon of PhysiCell, which introduces a new type of agent to describe fibres, and their physical interactions with cells and other fibres. PhysiMeSS implementation is publicly available at https://github.com/PhysiMeSS/PhysiMeSS, as well as in the official Physi-Cell repository. We also provide simple examples to describe the extended possibilities of this new framework. We hope that this tool will serve to tackle important biological questions such as diseases linked to dis-regulation of the extracellular matrix, or the processes leading to cancer metastasis.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.136), and has published the reviews under the same license. It is also part of GigaByte’s PhysiCell Ecosystem series for tools that utilise or build upon the PhysiCell platform: https://doi.org/10.46471/GIGABYTE_SERIES_0003 These reviews are as follows.

      Reviewer 1. Erika Tsingos

      One important aspect that the authors need to be aware of and mention explicitly is that their algorithm for fiber set-up leads to differences in fiber concentration and orientation at the boundary, because fibers that are not wholly contained in the simulation box are discarded. The effect of this choice can be seen upon close inspection of Figure 2: In the left panel, fibers align tangentially to the boundary, so locally the orientation is not isotropic. Similarly, in Figure 2 middle and right panels, the left and right boundaries have lower local fiber concentration. This issue could potentially affect the outcome of a simulation, so it's important that readers are made aware so that if necessary they can address this with a modified algorithm. ----- Minor comments: In the abstract, the phrasing implies agent-based frameworks are only used for tumour evolution. I would rephrase such that it is clear that tumour evolution is one example among many possible applications. I suggest adding a dash to improve readability in the following sentence in the introduction: "However, we note that the applications of PhysiMeSS stretch beyond those wanting to model the ECM -- as the new cylindrical/rod-shaped agents could be used to model blood vessel segments or indeed create obstacles within the domain." In the implementation section, add a short sentence to clarify if PhysiMeSS is "backwards compatible" with older PhysiCell models that do not use the fiber agent. Notation in equations: A single vertical line is absolute value, and two vertical lines is Euclidean norm? The explanation of Equation 1 implies that the threshold v_{max} should limit the parallel force, but the text does not explicitly say if ||v|| is restricted to be less or equal to v_{max}. Is that the case? In Equation 2, I don't see the need to square the terms in parenthesis. If |v*l_f| is an absolute value it is always positive. Since l_f is normalized the value of the dot product is only between 0 and the magnitude of v. Am I missing something? Are p_x and p_y in the moment arm magnitude coordinates with respect to the fiber center? Table 2: It would be helpful to have a separate column with the corresponding symbols used throughout the text and equations. Figure 5/6: Missing crosslinker color legend. ----- Typos/grammar: "As an aside, an not surprisingly," --> As an aside, and not surprisingly, "This may either be because as a cell tries to migrate through the domain fibres which act as obstacles in the cell’s path," --> remove the word "which"

      Reviewer 2. Jinseok Park

      Noel et al. introduce PhysiMess - a new PhysiCell Addon for ECM remodeling. This new addon is a powerful tool to simulate ECM remodeling and has the potential to be applied to mechanobiology research, which makes my enthusiasm high. I would like to give a few suggestions. 1) Basically, it is an addon of PhysiCell. So, I suggest describing PhysiCell and how to add the addon for readers who are not familiar with these tools. Also, screen captures of tool manipulation would be very helpful. 2) Figure 2 and 3 exhibit the outcome of the addon showing ECM remodeling. I would suggest to show actual ECM images modeled by the addon. 3) The equations reflect four interactions, and in my understanding, the authors describe cell-fibre, fiber-cell, and fiber-fiber interactions. I suggest generating an example corresponding to each interaction's modulation and explaining how the add-on results explain the physiological phenomena. For instance, focal adhesion may be a key modulator of cell-fibre or fiber-cell interaction, presumably, alpha or beta fiber. I would demonstrate how the different parameters generate different results and explain the physiological situation modeled by the results. 4) Similarly, Figure 5 and Figure 6 only show one example and no comparison with other conditions. For example, It would be better to exhibit no pressure/pressure conditions. It may help readers estimate how the pressure impacts cell proliferation.

      Reviewer 3. Simon Syga

      The presented paper "PhysiMeSS - A New PhysiCell Addon for Extracellular Matrix Modelling" is a useful extension to the popular simulation framework PhysiCell. It enables the simulation of cell populations interacting with the extracellular matrix, which is represented by a set of line segments (2D) or cylinders (3D). These represend a new kind of agent in the simulation framework. The paper outlines the basic implementation, properties and interactions of these agents. I recommend publication after a small set of minor issues have been addressed. Please refer to the attached marked-up PDF file for these minor issues and suggestions. https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTUwL2d4LVRSLTE3MTk5NDYwNjlfU1kucGRm

  14. Sep 2024
    1. Editors Assessment:

      This Data Release paper presents the genome of the whippet breed of dog. Demonstrating a streamlined laboratory and bioinformatics workflows with PacBio HiFi long-read whole-genome sequencing that enables the generation of a high-quality reference genome within one week. The genome study being a collaboration between an academic biodiversity institute and a medical diagnostic company. The presented method of working and workflow providing examples that can be used for a wide range of future human and non-human genome projects. The final is 2.47 Gbp assembly being of high quality - with a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. This reference being scaffolded into 39 chromosome-length scaffolds and the annotation resulting in 28,383 transcripts. The results also looked at the Myostatin gene which can be used for breeding purposes, as these heterozygous animals can have an advantage in dog races. The reviewers making the authors clarify this part a little better with additional results. Overall this study demonstrating how rapidly animal genome research can be carried out through close and streamlined time management and collaboration.

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground The time required for sequencing and de novo assembly of genomes is highly dependent on the interaction between laboratory work, sequencing capacity, and the bioinformatics workflow. As a result, genome projects are often not only limited by financial, computational and sequencing platform resources, but also delayed by second party sequencing service providers. By bringing together academic biodiversity institutes and a medical diagnostics company with extensive sequencing capabilities and know-how, we aimed at generating a high-quality mammalian de novo genome in the shortest possible time period. Therefore, we streamlined all processes involved and chose a very fast dog as a model: The Whippet.Findings We present the first chromosome-level genome assembly of the Whippet. We used PacBio long-read HiFi sequencing and reference-guided scaffolding to generate a high-quality genome assembly. The final assembly has a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. The total assembly length is 2.47 Gbp, of which 2.43 Gpb were scaffolded into 39 chromosome-length scaffolds. In addition, we used available mammalian genomes and transcriptome data to annotate the genome assembly. The annotation resulted in 28,383 transcripts resembling a total of 90.9% complete BUSCO genes and identified a repeat content of 36.5%.Conclusions Sequencing, assembling, and scaffolding the chromosome-level genome of the Whippet took less than a week and adds a high-quality reference genome to the list of domestic dog breeds sequenced to date.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.134), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Tianming Lan

      The authors provided an example of High-speed strategy for whole-genome sequencing, genome assembly and annotation for species and take an example with the Whippet dog. This is a very novel idea under the genomic era with plummeting sequencing cost, fast accumulated sequencing data but shortage of computing resources. The authors also provide a very high-quality reference genome for the Whippet dog species with very good contiguity, accuracy and completeness. However, I have several concerns need the authors to further consider before it could be published at the journal of GigaByte.

      Q1. There are too many keywords. Can the authors reduce a few? Biodiversity conservation, Comparative genomics, and evolutionary biology does not make sense in this manuscript. Q2. The authors performed reference-guided scaffolding analysis with the German Shepherd dog genome (GCA_011100685.1) as reference. Better if the authors explain why they selected this genome as the reference as there are several published dog genomes? Q3.The part of Heterozygosity make no sense to this manuscript unless there is a reasonable connection with other parts, because the dog is not a threatened species and also not a very special breed facing extensive inbreeding abd accumulation of deleterious mutations? Q4. The part of Myostatin doesn’t make sense to me, as I have read the paper the author cited and found that not all Whippet have this mutation? They sequenced 22 individuals, and 4 individuals are homozygous (-/-), 5 are heterozygous (mh/+) and the rest are homozygous (+/+). So you can always have a result by checking this mutation, but make no sense. Furthermore, one individual can hardly represent a species or a population? At the beginning of this paragraph, please change “Since” to “Since”. Q5. I think the most important find in this manuscript is how the authors finished a high-quality genome within a very short-term working. I suggest the authors remove the descriptions of Heterozygosity and Myostatin, but added a paragraph to tell readers the basic needs or standards for such a short-term work for genome assembly for a genome of something like dog. Just a suggestion, but I think would be better to improve the manuscript.

      Reviewer 2. Xiaobo Wang

      This study outlines an approach to expedite the sequencing and de novo assembly of genomes by leveraging collaboration between academic biodiversity institutes and a medical diagnostics company with advanced sequencing capabilities. The primary focus was on generating a high-quality de novo genome of the Whippet, a fast dog breed, within an accelerated timeframe. Below are some specific comments I would like to highlight.

      1. The authors mentioned the use of QUAST and QualiMap software tools to assess the genome of the Whippet; however, the corresponding results were not presented in the manuscript.
      2. The authors' reliance solely on mammalian protein sequences for homology annotation means that unique genes specific to the Whippet remain unannotated. The discrepancy of approximately 7% between the completeness assessments of the gene set and the genome via BUSCO further underscores the incomplete nature of the gene set. To address this, I recommend integrating transcriptome data, at the very least, to incorporate de novo annotation results. This addition should enhance the comprehensiveness and accuracy of gene annotations for the Whippet genome.
      3. The authors claim the absence of reported mutations in the Mstn gene but have not provided corroborating evidence, such as read alignment results from the genomic region, to verify that this is not due to assembly errors.
      4. If feasible, I propose integrating second-generation sequencing to further polish the genome and elevate its quality.
    1. Editors Assessment:

      This new software paper presents RiboSnake, a validated, automated, reproducible analysis pipeline implemented in the popular Snakemake workflow management system for microbiome analysis. Analysing16S rRNA gene amplicon sequencing data, this uses the widely used oQIIME2 [ tool as the basis of the workflow as it offers a wide range of functionality. Users of QIIME2 can be overwhelmed by the number of options at their disposal, and this workflow provides a fully automated and fully reproducible pipeline that can be easily installed and maintained. Providing an easy-to-navigate output accessible to non bioinformatics experts, alongside sets of already validated parameters for different types of samples. Reviewers requested some clarification for testing, worked examples and documentation, and this was improved to produce a convincingly easy-to-use workflow. Hopefully opening up an already very established technique to a new group of users and assisting them with reproducible science.

      This evaluation refers to version 1 of the preprint

    2. AbstractBackground Next-generation sequencing for assaying microbial communities has become a standard technique in recent years. However, the initial investment required into in-silico analytics is still quite significant, especially for facilities not focused on bioinformatics. With the rapid decline in costs and growing adoption of sequencing-based methods in a number of fields, validated, fully automated, reproducible and yet flexible pipelines will play a greater role in various scientific fields in the future.Results We present RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake for the computational analysis of 16S rRNA gene amplicon sequencing data. The pipeline comes with pre-packaged validated parameter sets, optimized for different sample types. The sets range from complex environmental samples to patient data. The configuration packages can be easily adapted and shared, requiring minimal user input.Conclusion RiboSnake is a new alternative for researchers employing 16S rRNA gene amplicon sequencing and looking for a customizable and yet user-friendly pipeline for microbiome analysis with in-vitro validated settings. The complete analysis generated with a fully automated pipeline based on validated parameter sets for different sample types is a significant improvement to existing methods. The workflow repository can be found on GitHub (https://github.com/IKIM-Essen/RiboSnake).

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.132), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Michael Hall

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Unable to test. The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.

      Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?

      The Usage instructions say to create a new repository using ribosnake as a template, but ribosnake is not a template repository (see https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template). The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      Not applicable.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      Yes, though as mentioned above, the README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.

      Additional Comments:

      The Introduction could be make far more concise, there's a lot of repetition.

      The installation command in figure 1 is three commands, not two as stated in the text (third-last paragraph Introduction), and is slightly misleading from an installation point of view as it assumes conda and snakemake are installed. Though it is mentioned later in the text (p5) that snakemake and conda require manual installation.

      The in-text citation for Greengenes2 is just [?] - maybe a latex issue?

      The last paragraph of the 'Features and Implementations' section was mostly already stated earlier in the manuscript.

      Make the colouring consistent between fig 2a-c and 2d as well as the vertical ordering to make for easier comparison. For example, in figures 2a-c Enterococcus (grey) is on the bottom, whereas in fig 2d it is red and in the middle. Colour legends should also be added to Figures 3-5 to match Fig 2.

      A small table should be added showing the comparison of RiboSnake and the original publication for the top 10 most abundant phyla for the Atacama soil dataset and their abundances (see last paragraph of 'Usage and Findings'.

      Reviewer 2. Yong-Xin Liu and Salsabeel Yousuf

      The manuscript presented by the authors describes a comprehensive study on the “RiboSnake pipeline” for 16S rRNA gene microbiome analysis, which is a user-friendly, robust, and multipurpose. RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake, offers parallel processing for efficient analysis of large datasets in both environmental and medical research contexts. Further demonstrating its effectiveness, this pipeline effectively analyzes human-associated microbiomes and environmental samples like wastewater and soil, thus expanding the scope of analysis for 16S rRNA data. The overall computational pipeline is useful and results are sound, validated through rigorous testing on MOCK communities and real-world datasets. However, there are some issues for improvement in the manuscript.

      Major comments: 1. In the clinical data section the author mentions rectal swabs were used from a published study [31]. While the source is referenced, it would be helpful to know if any information was provided in the referenced study regarding the collection methods or storage conditions for the rectal swabs. 2. The text mentions using cotton swabs pre-moistened with TE buffer + 0.5% Tween 20. While cotton swabs are common, are there any considerations for using different swab materials depending on the target analytes or sampling surface (e.g., flocked swabs for better epithelial cell collection)? 3. Does RiboSnake require user intervention during any steps, or is it fully automated? 4. The author mentions that contamination filtering parameters should be adjusted based on the sample type. How can users determine the appropriate filtering parameters for their specific samples? Are there guidelines for users to know how much adjustment is needed for specific scenarios? 5. The default abundance threshold for filtering low-frequency reads is chosen based on Nearing et al. [44]. Please discuss the rationale behind using a single threshold for all sample types? Would it be beneficial to allow users to define this threshold based on their data characteristics? 6. Would you like to explain the limitation of RiboSnake, such as specific types of samples it may not be suitable for or potential biases introduced by certain functionalities? 7. The manuscript mentions various visualization tools used throughout the pipeline (QIIME2, qurro). Please clarify which types of data are visualized with each tool, and how users can access or customize these visualizations? 8. To strengthen the manuscript's impact, consider discussing the specific novelty of RiboSnake compared to existing 16S rRNA gene microbiome analysis pipelines. Would you be able to elaborate on the unique features or functionalities of RiboSnake that address limitations of current methods? 9. EasyAmplicon is recently published pipeline and easy using in windows, mac and linux system,

      Minor comments: 1. Reference is missing in this sentence. “The default is the SILVA database [47]. Greengenes2 [? ] can be used alternatively”. 2. The author should careful about the lowercase and upper case throughout the manuscript. Please check the following for references:  ..the 2017 published Atacama Soil data set with samples taken fromthe Atacama desert was used [32] as well as samples collected fromsoil under switchgrass published in [33].  based on an Euclidean beta diversitymetric, shows that the positive controls, as well as the samples taken from subjects 1 and 3 (S1 and S3), cluster together.  A wide range of diversity analysis parameters are available in QIIME2 and its associated tools. These include the Shannon diversity index to measure richness, the Pielou index tomeasure evenness, or perform standard correlation analysis using Pearson or Spearman indices, among others. 3. In the introduction part this sentences “However, while these methods enable 16S rRNA analysis with minimal user interaction…” needs attention for clarity. Consider separating it into two sentences to emphasize the limitations of existing pipelines compared to the described methods’. Alternatively, using contrasting words like "in contrast" could highlight these differences. 4. More detail in attached PDF.

      https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTM5L2d4LVRSLTE3MTY5Nzk4MTktcmV2aXNlZC5wZGY=

      Re-review: The author's response has been fully addressed my concerns. The quality of the paper has apparently improved. I agree with the publication of this article.

    1. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 3: Georgios Fotakis

      1) General comments In this manuscript the authors present TooManyCellsInteractive (TMCI), a browser-based TypeScript graphical user interface for the visualization and interactive exploration of single-cell data. TMCI facilitates the visualization of single-cell data by representing it as a radial tree of nested cell clusters. It relies on TooManyCells, a suite of tools designed for multi-resolution and multifaceted exploration of single-cell clades based on a matrix-free divisive hierarchical spectral clustering method. A key advantage of TCMI lies in its capability to provide a quantitative depiction of relationships among clusters, allowing for the delineation of context-dependent rare and abundant cell populations, as showcased in the original publication [1] and in the present manuscript. TMCI extends the capabilities of TMC significantly, notably enhancing computational performance, particularly in scenarios where multiple features are overlaid (an improvement that is attributed to the persistent feature of the PostgreSQL database).

      A notable aspect of this manuscript is the fact that the authors performed a benchmark using publicly available scRNAseq datasets. This benchmark highlights TMCI's superior performance over TMC and its comparable performance to two other commonly utilized tools (Cirrocumulus and CELLxGENE). Moreover, the authors showcase TMCI's applicability through aggregating publicly available scRNAseq data. Here, they successfully delineate sub-populations of cancer drug-tolerant persister cells by employing minimum distance search pruning, enhancing the visibility of small sub-populations. Additionally, the authors note an increase in ID2 gene expression among persister-cell populations, as well as the enrichment of unique biological programs between short- and long-term persister-cell populations. Furthermore, they observe an upregulation of the diapause gene signature across all treated sub-populations. The biological insights the authors glean are novel and highly intriguing. In general, this manuscript is well written, with the authors offering comprehensive documentation that covers the essential steps for installing and running TMCI through their GitHub repository. Additionally, they provide a minimal dataset as an example for users. However, there are a few minor adjustments that, once implemented, would enhance the manuscript's value by improving clarity and providing valuable insights to the field.

      2) Specific comments for revision a) Major - As stated in the manuscript's abstract, visualising large cell populations from single-cell atlases poses greater challenges and demands compute-intensive processes. One of my major concerns revolves around TMCI's scalability when handling large datasets. The authors conducted benchmarking on relatively modest datasets (ranging from 18,859 to 54,220 cells). Based on the data provided in Supplementary Table S3, while TMCI demonstrates comparable performance to CELLxGENE on the Tabula Muris dataset and its subset (with mean memory consumption differences ranging from 870 MB to 1.8 GB), the disparity significantly increases when loading and rendering visualizations of the larger dataset, reaching 8.5 GB of RAM. It would be of great interest if the authors conducted a similar benchmark using a larger dataset to elucidate how TMCI scales with increased cell numbers, especially considering the trend in the field towards single-cell atlases and the availability of datasets consisting up to millions of cells (like the Tabula Sapiens [2] dataset or similar [3, 4]).

      • In the "Results" section, under the title "TMCI identifies sub-populations with highly expressed diapause programs," the authors assert that "the significantly different sub-populations were more easily seen in TMCI's tree". Since perception can be subjective (for instance, a user more accustomed to UMAP plots may find it challenging to interpret a tree representation), it would be beneficial for the authors to allocate a section of the supplementary material to demonstrate the clarity advantages of TMCI's tree visualization. One approach could involve a side-by-side comparison of visualizations generated by TMCI and CELLxGENE using the same color scheme. For instance, Figure 4b could be compared with Supplementary Figure S1g, Figure 4j with Supplementary Figure S1h, and so forth.

      • The "Discussion" section overlooks the future prospects of TMCI. As demonstrated in the case study, TMCI exhibits potential beyond serving as a visualization tool for identifying tree-based relationships in single-cell data. Are there any plans for integrating analytical functionalities to provide insights into cellular compositions and underlying biology, such as marker gene identification, differential gene expression analysis, and gene set enrichment analysis? In the future, could TMCI support the visualization of such results using methods like violin plots, heatmaps, and others?

      • In the "Materials and Methods'' section, the authors outline the process of aggregating the scRNAseq datasets used for the case study, including filtering and normalization steps. However, scRNAseq technologies are prone to significant noise resulting from amplification and dropout events. Additionally, when integrating different scRNAseq datasets, users need to consider potential batch effects. Did the authors employ any de-noising or batch correction methods? If not, what was the rationale behind this decision? It would be intriguing to observe any potential differences in the results following the application of such methods.

      • Remaining within the "Materials and Methods" section, providing a brief description of the methods and tools utilized for the differential gene expression analysis, the GSEA (if not solely conducted through Metascape), and the packages utilized to generate the plots in Figures 3 and 4 would enhance clarity and facilitate reproducibility.

      • Figure 4 - b: Distinguishing between the various cell lines on the partitioned nodes based on the current color coding—particularly for the MDA-MB-231 and PC9 cell lines, as well as between the treated and untreated populations of the SK-MEL-28 cell line—is quite challenging. Employing a different color scheme would significantly enhance clarity, making the different cell populations more distinguishable.

      • Figure 4 - d and k: The authors should add statistics as relying solely on the box and whisker plots makes it challenging to ascertain whether there is a significant difference between the conditions. For instance, it appears that ID2 is over-expressed between the control and treated population only in the SK-MEL-28 cell line.

      b) Minor - In the "Results" section, under the title "TMCI reduces time to display trees," the authors state: "these benchmarks indicate not only the superior performance of TMCI to generate static and interactive tree of single-cell data compared to other tools…". However, based on the results presented in the manuscript and the supplementary material, it seems that TMCI may not be outperforming alternative interactive visualization methods. This phrase should be revised to accurately reflect the benchmark results.

      References 1. Schwartz GW, Zhou Y, Petrovic J, Fasolino M, et al. TooManyCells identifies and visualizes relationships of single-cell clades. Nat Methods 2020;17(4):405-413. PMID: 32123397 2. The Tabula Sapiens Consortium, The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376, eabl4896. DOI:10.1126/science.abl4896 3. Sikkema L, Ramírez-Suástegui C, Strobl DC, et al. An integrated cell atlas of the lung in health and disease. Nat Med 2023;29, 1563-1577. DOI:10.1038/s41591-023-02327-2 4. Salcher S, Sturm G, Horvath L, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer cell 2022;40(12):1503-1520.E8. DOI:10.1016/j.ccell.2022.10.008

    2. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 2: Mehmet Tekman

      PAPER: TOOMANYCELLSINTERACTIVE REVIEW


      Table of Contents


      1. Using the Application .. 1. Positive Notes: ..... 1. General UI and Execution .. 2. Negative Notes: ..... 1. Controls ..... 2. Documentation ..... 3. Feature Overlays:
      2. Docker / Postgreseql
      3. Ethos of the Introduction

      The manuscript reads very well, and the quality of the language is good.

      This review tests the application itself, and makes some comment about some ambiguous wording in the introduction

      1 Using the Application

      I tested the Interactive Display at https://tmci.schwartzlab.ca/

      1.1 Positive Notes: ~~~~~~~~~~~~~~~~~~~

      1.1.1 General UI and Execution

      The general interactivity of the UI was very impressive and expressive. I liked that every aspect including the pies and the lines themselves could be coloured and scaled.

      I found the feature overlays and pruning history stack very intuitive, as well as rolling back the history on each state change.

      The choice of D3 was a good one, enabling very pleasing animations enter/exit/update state animations, as well as ease of SVG export.

      The inclusion of a command line `generate-svg.sh' for rendering without a browser is very useful.

      1.2 Negative Notes: ~~~~~~~~~~~~~~~~~~~

      1.2.1 Controls

      At first I wasn't able to find the controls, despite having the page open to 1330px wide, but then I realised I had to scroll down outside of the SVG container to find them.

      As mentioned in a recently opened PR, there's a CSS media rule `@media only screen and (min-width:1238px)' taking place, that looks strange on my Firefox 122 on Linux. Maybe better media rules for screens in the 700-900px wide range might be useful, as well as making separate rules for smartphones.

      1.2.2 Documentation

      Typescript is a good language to develop in, and lends itself naturally to documentation, though I did notice a distinct lack of documentation above many functions in the code base.

      Perhaps write a bit more documentation to make the code base accessible to new collaborators?

      Otherwise, the quality of code looked good, and the license was GPLv3 which is always welcome.

      1.2.3 Feature Overlays:

      I found the feature overlays super useful, though limited by the number of colours. These appear to be limited to one colour for all genes.

      Very useful for showing multiple genes, but it would be nice to have the ability to colour the expression of different genes with different colours, at least for < 3 genes of interest (due to the difficult colour mixing constraints).

      2 Docker / Postgreseql

      It is not clear to me what the Node server and PostgresQL database run in the docker container are actually doing, other than fetching cell metadata and marking user subsets from pruning actions.

      Could this not have been implemented in Javascript (e.g. IndexedDB)? Why does the data need to be hosted, if it's the user loading it from their own machine anyway. Is the idea that the visualization should be shared by multiple users who will be accessing the same dataset?

      If this is a single-user analysis, then why not keep all the computation and retrieval on the client-side?

      The reason I'm asking this is because I believe that by keeping the database operations within Javascript, you could run the system within a single Conda environment, or even with pure Node lockfile.

      I can understand needing a Docker for development purposes, but to actually run the software itself seems excessive. Is it not possible to separate the client and server into Conda? That way, one could then include the vizualisation (as the end stage) in bioinformatic pipelines.

      3 Ethos of the Introduction

      This is a small wording complaint in the Introduction section.

      TooManyCellsInteractive (TMCI) presents itself as a solution to the conventional scRNA-seq workflows that prepare the data via the usual: data → PCA → UMAP→ kNN → clustering stages.

      TMCI hints that it as an alternative solution to this workflow, but from what I can see in the documentation, it appears to require a cluster_tree.json' file, one that is produced only by the TooManyCells (TMC) pipeline.

      Unless I've misunderstood, it's not accurate to say that TMCI is an alternative to these conventional workflows, but that TMC is.

      TMCI simply consumes the files output by TMC and renders them. If what I'm saying is true, then the introduction should reflect that.

    3. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 1: Qingnan Liang

      Klamann et al. report a tool for single-cell data visualization, TMCI, which was related to the previous method of TMC. It is appreciated to see such continuous work and maintenance of the method and I do agree TMCI has the potential of promoting the application of TMC. The manuscript is generally well-written, and it suits well with the scope of GigaScience. The TMCI is publicly available with reasonably detailed tutorials. In this manuscript, however, at several points the elaboration does not provide sufficient details or rationales. I suggest revision/clarification as below before recommendation to publish.

      1. Does TMCI provide an interface with one or more popular single-cell frameworks, such as SingelCellExperiment, Seurat, or Scanpy? A TMCI user would probably use one of these frameworks to do other parts of the analysis.
      2. Is batch effect considered in the drug-treated data example? More generally, if a user want to use TMCI with multiple datasets, what would be the recommended approach for batch effect? Also, we know cell cycle is a factor that are usually 'regressed out' for single-cell analysis. Does TMC/TMCI consider this?
      3. "To normalize cells between data sets, we used term frequency-inverse document frequency to weigh genes such that more frequent genes across cells had less impact on downstream clustering analyses" We know TF-IDF is becoming a common practice in scATAC-seq analysis. Is this TF-IDF approach common for tree construction (or hierarchical clustering) with high dimensional data? Is this recommended for all users with scRNA-seq data?
      4. Figure 4C is not very easy to read. It may be helpful to label/highlight the comparison pairs to make the point.
      5. Also it is not sufficiently emphasized that how TMCI helped finding this ID2 target. Or how such visualization would trigger interesting downstream approaches. I guess the power of this tree approach is somehow similar to the increasingly popular 'metacell' approach, which combine similar cells to 'cell states'. Thus it makes an interesting midpoint between 'single-cell' and 'pseudo-bulk'. It would really be helpful to see that some states (nodes), although similarly treated, behave differently than others, if there are such cases (not sure if cell lines have such heterogeneity). Similar comments for the pathway analysis part.
  15. Aug 2024
    1. AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 1: Haruo Suzuki **

      I recommend that the authors consider revising based on the following points.

      1. the unpaired Wilcoxon signed-rank two-sided test. -> should be corrected to either "Wilcoxon rank-sum test" or "Mann-Whitney U test"

      https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test "Wilcoxon rank-sum test" redirects here. For Wilcoxon signed-rank test, see Wilcoxon signed-rank test. https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test Not to be confused with Wilcoxon rank-sum test.

      1. Since MOBscan can only predict the MOB type with plasmid proteins, we annotated the plasmids in the test set with Prokka, then manually submitted them to the MOBscan website for MOB type annotation.

      Given that MOBScan operates as an online tool and cannot be executed locally, the calculation of MOBScan's run time was confined to the duration spent on preprocessing with Prokka locally." (Please refer to Line 313-319 in the revised manuscript).

      -> Actually, it can be executed locally using the scripts included in https://github.com/santirdnd/COPLA/. It may not be necessary to run MOBscan locally (it may be okay that they manually submitted them to the MOBscan website), but I'll inform you regardless.

      1. In the comparison, it was observed that MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B). (Please refer to Line 418-421 in the revised manuscript).

      -> Do the authors' results contradict the following general expectation? MOB-typer utilizes BLAST, whereas MOBscan utilizes hmmscan, and therefore, MOBscan is expected to retrieve more distantly related proteins than MOB-typer.

      1. MOB-suit and MOBscan are represented by blue lines, orange lines and gray lines respectively. -> should be "MOB-suite"

      2. I suggest receiving English language editing before publishing the paper. "For the MOB typing, MOBscan [18] uses the HMMER model to annotated the relaxases and further perform MOB typing." -> should be "For the MOB typing, MOBscan [18] uses the HMMER model to annotate the relaxases and further perform MOB typing."

    2. AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 2: Dan Wang

      The manuscript provides a comprehensive background on the necessity and challenges of MOB typing in the context of plasmid genomics and its significance in tracking the transmission of resistance genes and virulence factors. The innovation introduced by MOBFinder, which incorporates an 11-class classification system, addresses a critical gap in current research methodologies by enhancing the precision of plasmid fragment classification. Key Strengths: Innovation: MOBFinder represents a novel approach in the typing of metagenomic plasmid fragments using word vector characterization combined with machine learning techniques. Methodological Rigor: The methodological approach, including the use of random forest models and the construction of a benchmark dataset from annotated complete plasmid genomes, is robust and well-executed. Performance: The tool demonstrates superior performance compared to existing tools like MOBscan and MOB-suite, providing a significant improvement in accuracy. Impact on Field: The application of MOBFinder in a T2D cohort illustrates the tool's practical utility and its potential to influence antibiotic resistance studies. Recommendation: Given the thorough revisions and the contributions this manuscript offers to the field of microbial genomics and antibiotic resistance, I recommend that the manuscript be accepted for publication in GigaScience.

    1. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 3: Cyril Pernet

      CAT has been around for a long time and is a well maintained toolbox - the paper describes all the features and additionally provides tests/validations of those features. I have left a few comments on the pdf (uploaded) which I don't see has mandatory and thus 'accepted' the paper (and leave the authors to decide what to do with those comments). It provides a nice reference for the toolbox.

    2. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 2: Chris Foulon

      Overall, I think the CAT software provides valuable tools to analyse morphometric differences in the brain and promotes open science. The study shows the software's capabilities rather well. However, I think some clarifications would help the readers understand and evaluate the quality of the methods.

      Comments: Figure 2: Looking at the chart, I have a question regarding the pipeline. Is it required to run the whole pipeline using CAT? Or is it possible to input already registered data to start directly with the VBM analysis or further?

      Voxel-based Processing: The above question is quite important, seeing that the preprocessing uses rather old registration methods. The users might want to use more recent registration methods, especially with clinical populations.

      Spatial Registration and Figure 3: For the registration, how is the registration performing with clinical populations (e.g. stroke patients)? It can be significant for the applicability of the methods with specific disorders.

      Surface Registration and Figure 3: What type of noise is used to evaluate the accuracy? This can be important as not every noise can be modelled easily, and some noises are more or less pronounced depending on the modality.

      Maybe having the letters of the figure panels referred to in the text would help the reader.

      Performance of CAT: Although I see the advantage of using simulated data, I think it would require more explanation. First, what tells the reader the quality of this simulated data, and how does it compare to real data? Second, is it only healthy data? In that case, the accuracy evaluation might not be relevant for the majority of the clinical studies using CAT.

      Longitudinal Processing: Are VBM analyses sensitive enough to capture changes over days? I would be surprised, but I would be interested to see studies doing it (and the readers would also benefit from it, I reckon).

      Mapping onto the Cortical Surface: I am a bit confused about the interest in mapping functional or diffusion parameters to the surface. Do you have examples of articles doing that? It sounds like it would waste a lot of information from these parameters, but I am not familiar with this type of analysis. "Optionally, CAT also allows mapping of voxel values at multiple positions along the surface normal at each node". I do not understand this sentence; I think it should be clarified.

      Example application: Is there a way to come back from the surface space to the volume space to compare the results? For example, VBM and SBM should provide fairly similar results, but comparing them is difficult when they are not in the same space. Additionally, in the end, the surface representation is just that, a representation; most other analyses are still done on the volume space, so it could be helpful to translate the result on the surface back to the volume (if it is not already available).

      Evaluation of CAT12: I was confused with Supplemental Figure 1 as it is not mentioned in the caption that it is the AD data and not the simulated one. Maybe it would help the reader to mention it.

      Regarding the reliability of CAT12, it seems to capture more things, but I struggle to see how we can be sure that this is "better" than other methods; couldn't it be false positives?

      "those achieved based on manual tracing and demonstrated that both approaches produced comparable hippocampal volume." comparable volumes do not really mean the same accuracy; this sentence could be misleading.

      I think the multiple studies show that CAT12 is as valid as any other tool but I am not sure the argument that it is better is as solid. Of course, I understand that there is no ground truth for what a relevant morphological change is for a given disease.

      Methods: Statistical Analysis: Why is the FWER correction used for the voxel-wise statistics (which perform many comparisons) and FDR used on ROI-wise statistics (which perform much fewer comparisons)? I would expect the opposite.

      "The outcomes of the VBM and voxel-based ROI analyses were overlaid onto orthogonal sections of the mean brain created from the entire study sample (n=50); " I don't understand what this refers to.

    3. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 1: Chris Armit

      This Technical Note describes the Computational Anatomy Toolbox (CAT) software tool, which includes a Graphical User Interface that can be used for morphometric analysis of Structural MRI data. The CAT software tool is impressive, and enables voxel-based and surface-based morphometric analysis to be accomplished on Structural MRI data, and also voxel-based tissue segmentation and surface mesh generation to be applied to these 3D imaging datasets. The authors helpfully illustrate the utility of the Computational Anatomy Toolbox (CAT) using T1-weighted structural brain images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.

      This is an excellent, freely available tool for the Neuroimaging community and the authors are to be commended for developing this impressive software tool.

      Minor comments

      I first attempted to launch the CAT software tool on macOS 14.0 (Sonoma) with Apple M1 chip, and on the command line I received the following message: "spm12" is damaged and can't be opened. You should move it to the Bin.

      I additionally tested the CAT software tool on macOS 12.6 (Monterey) with Intel chip, and I was able to run the CAT software tool on this platform.

      A minor criticism is that the installation instructions in the supporting Readme file for archive [CAT12.9_R2023b_MCR_Mac_arm64.zip], which runs on macOS with Intel chip, only details how to install the SPM (Statistical Parametric Mapping) software tool. The CAT software tool needs to be downloaded separately and then moved into the directory of the SPM toolbox, and these installation instructions are included in the supporting CAT software documentation (https://neuro-jena.github.io/cat12-help/#get_started)

      With the issues I encountered in installation, I invite the authors to list the System Requirements - specifically the Operating Systems that are needed to run the CAT software tool - in the GigaScience manuscript and also in the supporting CAT software documentation.

      In addition, it would be particularly helpful if the instructions on how to install CAT in the context of SPM were included in the supporting Readme files for the Computational Anatomy Toolbox (CAT) zip archives.

    1. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 3: Luca Beltrame

      Lavikka and coworkers present an interesting visualization framework and associated application for genomics visualization. The challenges outlined by the authors in finding appropriate visualization tools for large-scale genomics data were also experienced by this reviewer, and thus better and improved tools are always welcome.

      The manuscript is well laid out, presenting the key facts in a proper manner. The use of GPU rendering for graphs is an excellent move, and I expect to be extremely useful even for machines with lower-end GPUs. The code looks reasonably written and commented (being an application, this too is important for a review). I have also tested the examples, and indeed the software is very useful (the documentation should, however, point out that some issues regarding saving the canvas still exist). One may argue that the use of JSON for the graph grammar can be awkward, but at the same time other file formats may be more problematic and/or require specialized parsers (which open yet another can of worms).

      Documentation is also logically organized. As a minor suggestion, the authors may want to add some form of search to their documentation page.

      There are is an open questions that the authors may want to answer: they explicitly mention GISTIC 1.0 for the G-score plots. Is there a specific reason why they chose 1.0? The 2.0 algorithm is far more robust and produces more reliable results.

    2. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 2: Alessandro Romanel **

      In this article, the authors introduce GenomeSpy, a grammar-based toolkit for creating customized, interactive visualizations for genomic data analysis. I find the article extremely interesting, and I believe the framework introduced by the authors has broad utility. The website is well-maintained and documented, and I particularly found the examples mentioned in the paper to be useful and informative. The authors chose to present their toolkit by narrating the navigation of a dataset generated in the DECIDER study. While the narrative makes the utility of the visualizations clear in data interpretation, what is not clear at all is how easy it is to use GenomeSpy to create those same visualizations. I believe that the success of a toolkit like this is strongly tied to its ease of use, and this aspect is not clear or prominently highlighted in the manuscript. Additionally, it would be interesting to more clearly highlight GenomeSpy's strengths compared to other approaches. By combining Rshiny and ggplot, it is indeed possible to create complex interactive data visualizations. Therefore, it would be necessary to more strongly emphasize what the other innovative aspects of GenomeSpy are, beyond GPU acceleration, compared to other approaches available today.

    3. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 1: Andrea Sboner

      In this manuscript, the authors present Genome Spy, a visualization toolkit geared toward the rapid and interactive exploration of genomic features. They demonstrate how this tool can help investigators explore a large cohort of 753 ovarian cancers sequenced by whole-genome sequencing (WGS). By using the tool, they were able to identify outliers in the dataset and refine their diagnosis. The tool is inspired by Vega-lite, a high-level grammar for interactive graphics, and extends it for genomic applications.

      The manuscript is clearly written, and the authors provide links to the applications itself, tutorials and examples. I want to commend them for doing this. This is a tool that would nicely complement others and has a specific advantage of using high-performance GPUs that are now common in modern computers.

      The only concern that I have is about a couple of claims that may not be fully supported by the data provided: 1. Claim: users can implement new visualization designs easily. While the grammar certainly enables the users to define new designs, I do not think that this is necessarily easy, as the authors themselves recognize in the discussion section when they suggest providing templates to reduce the learning curve. Indeed, the example in Figure 2 is still quite verbose and would need some time for anyone to understand the syntax and the style. The playground web application facilitates testing it, though. 2. Claim: the grammar-based approach allows to be mixed and matched. I did not find any specific example of how to do this. It would have been quite interesting to see the intersection between the DNA representation of structural variants and RNA-seq data (if this is what it means as "mix and match").

    1. AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 2: Liuyang Zhao **

      In this study, the authors initiate a novel exploration by employing parameter escalation experiments to assess the impact of reference size and alternative allele frequency cutoffs on the effects of virus lineage composition in wastewater samples and their references. The research provides valuable insights into how different parameter settings influence outcomes in test data sets, particularly highlighting the role of virus lineage composition in wastewater samples and the corresponding references. Detailed parameters for these analyses are made available in several bash files at osf.io/upbqj. Despite these significant contributions, certain areas could benefit from further enhancement:

      1.The current methodology utilizes Ion Torrent for testing mock samples. However, this approach may not fully capture the variability in alignment and sub-lineage analysis. Incorporating additional sequencing data from PacBio, Nanopore, and Illumina would offer a more comprehensive examination of these aspects, potentially leading to more robust findings.

      2.While the study showcases a variety of pipelines based on mutation-based and sequence-based tools in Table 1, the evaluation of three data sets was limited to only using MAMUSS (as a mutation-based reference) and VLQ-nf (as a sequence-based reference). For more conclusive guidance in pipeline selection, it is advisable for the authors to expand their analysis to include at least two or three more pipelines. This recommendation aligns with observations noted by the authors at line 619, suggesting a comprehensive benchmark comparison would significantly enhance the study's utility and appeal to readers seeking optimal pipeline strategies.

    2. AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 1: Irene Bassano **

      In the manuscript "Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data" Aßmann et. al compare two methods, a sequence and mutation-based, respectively, to better understand the circulating lineages and sub-lineages in wastewater samples. Since the advent of wastewater-based epidemiology (WBE) as a tool to complement results from clinical data, there has been search for novel tools that can give robustness to the results and more importantly confidence in the data analysis. In this context, this manuscript is very important as it is contributing towards achieving that goal. This is clear in the fact that they have designed a new tool, namely MAMUSS. 1. One aspect however that the manuscript fails to mention is the difficulty in reconstructing full genome sequences from wastewater data. This has been one of the biggest problems since it is widely accepted that viral particles in water do degrade, and consequently what is being sequenced is a partial genome. Consensus sequences are therefore very difficult to obtain. 2. Another aspect that the authors fail to mention in the introduction or as a point of discussion, is how a variant is defined and how we take this information from clinical samples to then adopt it to define variants in environmental samples, although some relevant tools are mentioned such as COJAC and MMMVI. Yet, how these are used, it is not explained. 3. The manuscript is well written, there are some repetitive sentences that need to be removed (see comments on PDF) as well as a couple of sentences which are not grammatically correct (see comments on PDF). 4. It is worth mentioning that the words "variants" and "lineages" are used interchangeably. I do suggest they choose one term only. 5. The manuscript mentions several times the presence of false and true positive, however does not mention how these were calculated. These need to be supported by a small statistical test. 6. There are minor corrections throughout the manuscript that need to be address. All these are highlighted as comments in the original manuscript.

    1. Editors Assessment:

      RAD-Seq (Restriction-site-associated DNA sequencing) is a cost-effective method for single nucleotide polymorphism (SNP) discovery and genotyping. In this study the authors performed a kinship analysis and pedigree reconstruction for two different cattle breeds (Angus and Xiangxi yellow cattle). A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping using RAD-Seq. Producing a SNP panel with 7305 SNPs capturing the maximum difference between paternal and maternal genome information, and being able to distinguish between the F1 and F2 generation with 90% accuracy. Peer review helped highlight better the practical applications of this work. The combination of the efficiency of RNA-seq and advances in kinship analysis here can helpfully help improve breed management, local resource utilization, and conservation of livestock.

      This evaluation refers to version 1 of the preprint

    2. AbstractKinship and pedigree information, used for estimating inbreeding, heritability, selection, and gene flow, is useful for breeding and animal conservation. However, as the size of the crossbred population increases, inaccurate generation and parentage recoding in livestock farms increases. Restriction-site-associated DNA sequencing (RAD-Seq) is a cost-effective platform for single nucleotide polymorphism (SNP) discovery and genotyping. Here, we performed a kinship analysis and pedigree reconstruction for Angus and Xiangxi yellow cattle, which benefit from good meat quality and yields, providing a basis for livestock management. A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping. The identified SNPs panel included 7305 SNPs capturing the maximum difference between paternal and maternal genome information allowing us to distinguish between the F1 and F2 generation with 90% accuracy. In addition, parentage assignment software based on different strategies verified that the cross-assignments. In conclusion, we provided a low-cost and efficient SNP panel for kinship analyses and the improvement of local genetic resources, which are valuable for breed improvement, local resource utilization, and conservation.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.131), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Liyun wan

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      The detailed parameters for the SNP and InDel calling should be described to allow reproduction.

      Additional Comments:

      This research provides valuable insights into the use of RAD-Seq to kinship analysis and pedigree reconstruction, which is useful for breeding and animal conservation purposes. Overall, the study is well-conducted and the findings are relevant. However, there are a few aspects that require attention before the manuscript can be considered for publication. Please address the following points: 1. Provide practical applications: Highlight the practical applications of your research in livestock management, breed improvement, local resource utilization, and conservation. Discuss how the low-cost and efficient SNP panel can contribute to these areas and provide suggestions for further research or implementation. 2. Language and clarity: Review the manuscript for clarity, grammar, and sentence structure. Ensure that all key terms and concepts are defined and explained to facilitate understanding for a broad readership. Once these revisions have been made, I believe the manuscript will be much stronger and suitable for publication.

      Reviewer 2. Mohammad Bagher Zandi

      Is the language of sufficient quality?

      Yes. It was great.

      Are all data available and do they match the descriptions in the paper?

      Yes. The raw sequencing reads were deposited but it would be better to share the the SNPs data as well.

      Is the data acquisition clear, complete and methodologically sound?

      No. SNPs detection and SNPs selection for assignment test is not clear.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. In some cases, the materials and methods section is vague. It is better to correct them. It is mentioned in the attached manuscript text.

      Additional Comments: Well done research, but the manuscript need some correction as commented on the attached file. See: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTA1L2dpZ2EtY29tZW50cy5kb2N4

    1. Editors Assessment: This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong (see https://doi.org/10.46471/GIGABYTE_SERIES_0006). This example assembles the genome of the black-faced spoonbill (Platalea minor), an emblematic wading bird from East Asia that is classified as globally endangered by the IUCN. This Data Release reporting a 1.24Gb chromosomal-level genome assembly produced using a combination of PacBio SMRT and Omni-C scaffolding technologies. BUSCO and Merqury validation were carried out, gene models created, and peer reviewers also requested MCscan synteny analysis. This showed the genome assembly had high sequence continuity with scaffold length N50=53 Mb. Presenting data from 14 individuals this will hopefully be a useful and valuable resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractPlatalea minor, the black-faced spoonbill (Threskiornithidae) is a wading bird that is confined to coastal areas in East Asia. Due to habitat destruction, it has been classified by The International Union for Conservation of Nature (IUCN) as globally endangered species. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of P. minor using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (1.24 Gb) contains 95.33% of the sequences anchored to 31 pseudomolecules. The genome assembly also has high sequence continuity with scaffold length N50 = 53 Mb. A total of 18,780 protein-coding genes were predicted, and high BUSCO score completeness (93.7% of BUSCO metazoa_odb10 genes) was also revealed. A total of 6,155,417 bi-allelic SNPs were also revealed from 13 P. minor individuals, accounting for ∼5% of the genome. The resource generated in this study offers the new opportunity for studying the black-faced spoonbill, as well as carrying out conservation measures of this ecologically important spoonbill species.

      This work is part of a series of papers presenting outputs of the Hong Kong Biodiversity Genomics https://doi.org/10.46471/GIGABYTE_SERIES_0006 This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.130), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Richard Flamio Jr.

      Is the language of sufficient quality?

      No. There are some grammatical errors and spelling mistakes throughout the text.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. The authors did a phenomenal job at detailing the methods and data-processing steps.

      Additional Comments:

      Very nice job on the paper. The methods are sound and the statistics regarding the genome assembly are thorough. My only two comments are: 1) I think the paper could be improved by the correction of grammatical errors, and 2) I am interested in a discussion about the number of chromosomes expected for this species (or an estimate) based on related species and if the authors believe all of the chromosomes were identified. For example, is the karyotype known or can the researchers making any inferences about the number of microchromosomes in the assembly? Please see a recent paper I wrote on microchromosomes in the wood stork assembly (https://doi.org/10.1093/jhered/esad077) for some ideas in defining the chromosome architecture of the spoonbill and/or comparing this architecture to related species.

      Re-review:

      The authors incorporated the revisions nicely and have produced a quality manuscript. Well done.

      Minor revisions Line 46: A comma is needed after (Threskiornithidae). Line 47: “The” should not be capitalized. Line 48: This should read “as a globally endangered species.” Line 49: “However, the lack of genomic resources for the species hinders the understanding of its biology…” Line 56: Consider changing “also revealed” to “identified” to avoid repetition from the previous sentence. Line 65: Insert “the” before “bird’s.” Lines 69-70: Move “locally” higher in the sentence – “and it is protected locally…” Line 72: Replace “as of to date” with “prior to this study”. Lines 78-79: Pluralize “part.” Line 86: Replace “proceeded” with “processed.” Line 133: “…are listed in Table 1.” Line 158: “accounted” Line 159: “Variant calling was performed using…” Line 161: “Hard filtering was employed…” Lines 200-201: “The heterozygosity levels… from five individuals were comparable to previous reports on spoonbills – black-faced spoonbill … and royal spoonbill … (Li et al. 2022).” Line 202: New sentence. “The remaining heterozygosity levels observed…” Line 206: “…genetic bottleneck in the black-faced spoonbill…” Lines 208-209: “These results highlight the need…” Lines 213-214: “…which are useful and precious resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.” Line 226: Missing a period after “heterozygosity.” For references, consider adding DOIs. Some citations have them but most citations would benefit from this addition.

      Reviewer 2. Phred Benham

      Is the language of sufficient quality?

      Generally yes, the language is sufficiently clear. However, a number of places could be refined and extra words removed.

      Are all data available and do they match the descriptions in the paper?

      Additional data is available on figshare.

      I do not see any of the tables that are cited in the manuscript and contain legends. Am I missing something. Also there is no legend for the GenomeScope profile in figure 3.

      The assembly appears to be on genbank as a scaffold level assembly, can you list this accession info in the data availability section in addition to the project number.

      Is there sufficient data validation and statistical analyses of data quality?

      Overall fine, but some additional analyses would aid the paper. Comparison of the spoonbill genome to other close relatives using a synteny plot would be helpful.

      It would also be useful to put heterozygosity and inbreeding coefficients into context by comparing to results from other species.

      Additional Comments:

      Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.

      Generally, the methods are sound and appropriate for the generation of genomic sequence.

      Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)

      The tables appear to be missing from the submitted manuscript?

      Minor comments: Line 49: delete its

      Line 49-51: This sentence is a little awkward, please revise.

      Line 64: delete 'the'

      Line 67: replace 'with' with 'the spoonbil as a'

      Line 68: delete 'Interestingly'

      Line 70: can you be more specific about what kind of genetic methods had previously been performed?

      Line 79: can you provide any additional details on the necessary permits and/or institutional approval

      Line 78: what kind of tissue? or were these blood samples?

      Line 110: do you mean movies?

      Line 143: replace data with dataset

      Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.

      Line 171: delete 'resulted in'

      Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?

      Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?

      Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.

      Figure 3 is missing a legend. Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.

      Generally, the methods are sound and appropriate for the generation of genomic sequence.

      Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)

      The tables appear to be missing from the submitted manuscript?

      Minor comments: Line 49: delete its

      Line 49-51: This sentence is a little awkward, please revise.

      Line 64: delete 'the'

      Line 67: replace 'with' with 'the spoonbil as a'

      Line 68: delete 'Interestingly'

      Line 70: can you be more specific about what kind of genetic methods had previously been performed?

      Line 79: can you provide any additional details on the necessary permits and/or institutional approval

      Line 78: what kind of tissue? or were these blood samples?

      Line 110: do you mean movies?

      Line 143: replace data with dataset

      Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.

      Line 171: delete 'resulted in'

      Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?

      Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?

      Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.

      Figure 3 is missing a legend. Re-review:

      I previously reviewed this manuscript and overall the authors have done a nice job addressing all of my comments.

      I appreciate that the authors include the MCscan analysis that I suggested. However, the alignment of the P. minor assembly and annotations to other genomes suggests rampant mis-assembly or translocations. Birds have fairly high synteny and I would expect Pmin to look more similar to the comparison between T. caerulescens and M. americana in the MCscan plot. For instance, parts of the largest scaffold in the Pmin assembly map to multiple different chromosomes in the Tcae assembly. Similarly, the Z in Tcae maps to 11 different scaffolds in the Pmin assembly and there does not appear to be a single large scaffold in the Pmin assembly that corresponds to the Z chromosome.

      The genome seems to be otherwise of strong quality, so I urge the authors to double-check their MCscan synteny analysis. If this pattern remains, can you please add some comments about it to the end of the Data Validation and Quality Control section? I think other readers will also be surprised at the low levels of synteny apparent between the spoonbill and ibis assemblies.

  16. Jul 2024
    1. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Virgilio Gail Ponferrada (R1)

      Thanks to the authors for accommodating the reviewers' suggestions. The manuscript continues to be well constructed and easy to read. I appreciate the addition of micro-CT analysis of Xenopus gut development and the inclusion of scans of additional samples for statistical analysis bolstering their findings. Should the manuscript be accepted for publication, perhaps the authors will contact Xenbase (www.xenbase.org), the Xenopus research database, as an additional means of featuring their micro-CT datasets. I suggest this manuscript be accepted for publication.

    2. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: John Wallingford (Original submission)

      Laznovsky et al. present a nice compendium of micro-CT-based digital volumes of several stages of Xenopus development. Given the prominence of this important model animal in studies of developmental biology and physiology, this dataset is quite useful and will be of interest to the community. That said, the study has some key limitations that will limit its utility for the research community, though these do not reduce the dataset's impact in the education and popular science realms, which is also a stated goal for the paper. Overall, I recommend publication after an effort has been made to address the following concerns.

      1. The atlas adequately samples developmental stages from late tadpole through metamorphosis. However, as far as I can tell only a single sample has been imaged at each stage. Thus, the quantifications of inter-stage differences shown here (Fig. 2, 4, 5) are at best very rough estimates and also provide no information about intra-stage variability in these metrics. This is not a fatal weakness, but it is an important caveat that I believe should be very explicitly stated in the text and in the figure legend of relevant figures.

      2. I am very disappointed that the rich history of microCT on Xenopus seems to have been entirely ignored by these authors. MicroCT has already been used to describe the skull, the brain, liver, blood vessels, etc. during Xenopus development. (Just a few papers the authors should read are: Slater et al., PLoS One 2009; Senevirathnea et al., PNAS, 2019; Ishii et al., Dev. Growth, Diff. 2023; Zhu et al., Front. Zool 2020.) It has also been used for comparative studies of other frogs (Kondo et al., Dev. Growth, Diff. 2022; Kraus, Anat. Rec. 2021; Jandausch et al., Zool. Anz. 2022; Paluh, et al., Evolution 2021, Paluh et al., eLife 2021). None of these -or the many other relevant papers- are discussed or cited here. The research community would be much better served if authors make a serious effort to integrate their methods and their results into this existing literature.

      3. An opportunity may have been missed here to provide some truly new biological insights: The gut remodels substantially during metamorphosis, but to my knowledge that has NOT be previously examined by microCT. It may not work, as the gut may simply be too soft to visualize, but then again, it may be worth trying.

    3. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Virgilio Gail Ponferrada (Original submission)

      The manuscript is well written and easy to understand. It will be a good contribution to the Xenopus research community as well as a useful reference for the field of developmental and amphibian biology.

      I suggest the following revisions: - For the graphical abstract try alternating NF stage numbers above and below samples for a cleaner look, adult male and adult female can both remain at the top. - Appreciate the rationale for providing the microCT analysis presented in this manuscript and choices of late stage tadpoles, pre- and prometamorphosis, through metamporphosis to the adult male and female frog. - For the head development section authors can make reference to the Xenhead drawings, Zahn et al. Development 2017. - Head Development section paragraph 4, change word from "gender" to "sex." - Supplementary Table 3. Change "gender-related" to "sex-related." - Micro-CT Data Analysis of Long Bone Growth Dynamics section paragraph 1 change "in terms of gender" to "in terms of sex." - Figure 4 panels A and B don't reflect the observation that adult females are enlarged males. While the authors state that the view of the male and female skeletons are maximized and not proportional as stated in the caption, suggest that scale bars be employed and the images adjusted to show the size relationship difference between the sexes as in Figure 1. On first glance and perhaps to those not as familiar with the difference in sex size in Xenopus that in this particular example of the adult male image being more spread out compared to the image of the female, it feels misleading. - Ossification Analysis section paragraph 2 change "frog's gender" to "frog's sex." - Figure 5 panel A, the label is overlapping "NF 59." For panels B and B' scale bars on these panels would help the reader understand the proportions. Yes, there is the 3mm scale bar from panel A and as stated in the caption, but including them in the B panels could help even if panel B had a scale bar labeled at 0.25 mm and panel B' was 3 mm. - Segmentation of Selected Internal Soft Organ section, perhaps more commentary on the ability to observe the development of the segmentation of the brain regions: cbh: cerebral hemispheres; cbl: cerebellum; dch: diencephalon; mob: medulla oblongata; opl: optic lobes; sp: spinal cord while clearly shown in Figure 6, some accompanying description in the text would help readers in general or give the implication that microCT analysis of mutant or diseased frogs could help identify physical characteristics of frogs with developmental or neurological disorders. This would help transition from the analysis of a specific organ to the next section Further Biological Potential of Xenopus's Data. - These analyses, while thorough accompanied by novel visuals, require statistical implementation of multiple tadpoles and frogs per NF stage to account for variation in samples and to bolster the claims stated in skull thickness, the head mass and eye distance changes, increased length of the long bones during maturation, and femural ossification cartilage to bone ratios. This may constitute a suggested major revision to perform these analyses.

    4. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Brian Metscher (Original submission)

      The authors present a set of 3D images of selected developmental stages of the widely-used laboratory model Xenopus laevis along with some examples of how the data might be used in developmental analyses. The dataset covers stages from mid-larva through metamorphosis to adult, which should provide a starting point for various studies of morphological development. Some studies will undoubtedly require other stages or more detailed images, but the presented data were collected with straightforward methods that will allow compatibility with future work.

      The data appear to be sound in the collection and curation. Data availability is made clear in the article, and the complete set will be publicly available in standard formats on the Zenodo repository. This should ensure full accessibility to anyone interested. The article is well-organized and clearly written.

      A few points about the methods could be clarified: Was only one specimen per stage scanned? Specimens were dehydrated through an ethanol series and then stained with free iodine in 90% methanol, and then rehydrated back through ethanol. Why was methanol used for the staining and not dehydration? It seems odd to switch alcohols back and forth without intermediate steps. This could have some effect on tissue shrinkage. It should be indicated that the X-ray source target is tungsten (even though it is unlikely to be anything else in this machine). The "real images" (p. 7) in Suppl. Fig. 1 should simply be called photographs - microCT images are real too. For the measurements of bone mass, is the cartilage itself actually visible in the microCT images? p. 13: "The dataset's diverse species representation…" What does this mean? It is only one species. The limitations on the image data are not discussed. All images have limits to their useful resolution and contrast among components; this is not a weakness, just a reality of imaging. The different reconstructed voxel sizes for different size specimens are mentioned, but it might be helpful to indicate the voxel sizes in Figure 1 as well as in the relevant table. And if the middle column of Figure 1 could be published with full resolution of the snapshots it would help show the actual quality of the images.

    1. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Yuan Jiang (R1)

      The authors have fully addressed my comments.

    2. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Yuan Jiang (original submission)

      Referee Report for "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes"

      This paper introduced gNOMO2, a new version of gNOMO, which is a bioinformatic pipeline for multiomic management and analysis of microbiomes. The authors claimed that gNOMO2 incorporates new differential abundance, integration, and visualization tools compared to gNOMO. However, these new features as well as the distinction between gNOMO2 and gNOMO has not been clearly presented in the paper. In addition, the Methods section is written as a pipeline of bioinformatic tools and it is not clear what these tools are used for unless one is familiar with all the bioinformatic tools.

      My major comments are as follows:

      1. Given the existing work on gNOMO, it is critical for the authors to distinguish gNOMO2 from gNOMO to show its novelty. In the Methods section, the authors presented the six modules of gNOMO2. Are these all new from gNOMO, or does gNOMO included some of these functions? A clearer presentation of gNOMO2 versus gNOMO is needed.
      2. The authors did not present the methods in each module very well. For example, the authors wrote in Module 2 that "MaAsLin2 [31] is employed to determine differentially abundant taxa based on both AS and MP data. Furthermore, a joint visualization of MP and AS results is performed using the combi R package [32]. The final outputs include AS and MP based abundance tables, results from differential abundance analysis, and joint visualization analysis results." Without reading the references 31 and 32, it is very hard to understand what this module is really doing.
      3. The authors used the term "integrated multi-omics analysis" in all six modules of gNOMO2. It is not clear how this terms really means. It reads like that it is not really integrated analysis, instead, it is more like a module that can handle different types of data separately, such as differential abundance analysis for each type. What other integration has been used except joint visualization? What new integration tools have been incorporated in gNOMO2?
      4. In the differential abundance analysis, does the pipeline consider the features of microbiome data, such as their count, sparsity, and compositional features? Can the modules incorporate covariates in their differential abundance analysis? It is quite useful to have covariates adjusted in a differential abundance analysis?
      5. In the Analyses section, the authors applied gNOMO2 to re-analyze samples from previously published studies. They found some discrepancy between their results and the ones in the literature. Although some discrepancy is normal, the authors need to explain better what causes the discrepancy and whether it could yield different biological conclusions.
    3. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Alexander Bartholomaus (original submission)

      Summary: "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes" by Arıkan and Muth presents a multi-omics tools for analysis of prokaryotes. It is an evolution of the first version and offers various separate modules, taking different type of input data. They present different example analysis based on already published data and reproduced the results. The manuscript is very well written (I could not detect a single typo) and it was fun to read! Well done! I have only very few comments and suggestions, see below. However, I had a problem executing the code.

      Key questions to answer: 1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes 2) Are the conclusions adequately supported by the data shown? Yes 3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? Very well written! 4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? No direct statistics given in the manuscript. Maybe the authors could include some example output as .zip file for interested potential users.

      Detailed comments to the manuscript: Line 168: What does "cleaned and redundancies are removed" mean? Are only identical genomes removed? Or are genome part that are identical (I guess this barely exists, except for conserved gene parts as 16S, or similar) removed? Or are only redundant genes removed? How is redundancy defined, 99% identical stretch? Line 399-405: When looking at figure 5A I am wondering how Fluviicoccus and Methanosarcina in the MP faction appear relatively abundant in some samples. Where they de novo assembled in the MG or MT modules? General comment figures: I know that it is a hack to deal with automatic figure generation and especially the axis labels (as names have very different length). However, I think some figures might be hardly visable in the printed version, especially axes label for panel B are very small. Maybe you can put the critical figures separately in the supplement, e.g. each B panel a one page.

      Suggestions: As suggest above, maybe the authors could include some example output (a simple example) as .zip file for interested potential users. This would give an idea of how the output looks like and what to expect besides the plots. But differential abundance tables might be more important than the plots, as the user would generate their own plot for later publications.

      Github and software: I also tested the software and followed the instructions in the Github. I successfully executed the "Requirements" and "Config" steps (including create of metadata file and copying of amplicon reads) and tried to execute Modul1.

      However, the following error occurred (using up-to-date conda and snakemake on Ubuntu linux): (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -v 6.15.5 (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -s workflow/Snakefile --cores 20 SyntaxError in line 9 of /home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/s3.py: future feature annotations is not defined (s3.py, line 9) File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/init.py", line 34, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/smart_open_lib.py", line 35, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/doctools.py", line 21, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 104, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 49, in register_transport File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/importlib/init.py", line 126, in import_module In addition to solving the problem, an example metadata file and some explanation about the output (which I did not see yet) would be good for less experienced users.

    1. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Weiwen Wang (R1)

      The author has addressed most of my concerns, although some issues remain unresolved due to hardware and technical limitations.

    2. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Weiwen Wang (original submission)

      This manuscript by LeRoy et al. introduces PEPhub, a database aimed at enhancing the sharing and interoperability of biological metadata using the PEP framework. One of the key highlights of this manuscript is the visualization of the PEP framework, which improves the adoption of the PEP framework, facilitating the reuse of metadata. Additionally, PEPhub integrates data from GEO, making it convenient for users to access and utilize. Furthermore, PEPhub offers metadata validation, allowing users to quickly compare their PEP with other PEPhub schemas. Another notable feature is the natural language search, which further enhances the user experience. Overall, PEPhub provides a comprehensive solution that promotes efficient metadata sharing, while leveraging the impact of the PEP framework in organizing large-scale biological research projects.While this manuscript was interesting to read, I have several concerns regarding its "semantic" search system and the interaction of PEPHub.1.

      The authors mentioned their use of a tool called "pepembed" to embed PEP descriptions into vectors. However, I was unable to locate the tool on GitHub, and there is limited information in the Method section regarding this. Could the authors provide additional details regarding the process of embedding vectors?2. The authors implemented semantic search as an advantage of PEPhub. However, they did not evaluate the effectiveness of their natural language search engine, such as assessing accuracy, recall rate, or F1 score. It would be beneficial for the authors to perform an evaluation of their natural language search engine and provide metrics to demonstrate its performance. This would enhance the credibility and reliability of their claims regarding the advantages of natural language search in PEPhub.3. It would be more beneficial to include the metadata in the search system rather than solely relying on the project description. For instance, when I searched for SRX17165287 (https://pephub.databio.org/geo/gse211736?tag=default), no results were returned.4. When creating a new PEP, it appears that I can submit two samples with identical values. According to the PEP framework guidelines, it is mentioned that "Typically, samples should have unique values in the sample table index column". Therefore, the authors should enhance their metadata validation system to enforce this uniqueness constraint. Additionally, if I enter two identical values in the sample field and then attempt to add a SUBSAMPLE, an error occurs. However, when I modify one of the samples, I am able to save it successfully.5. The error messages should provide more specific guidance. Currently, when attempting to save metadata with an incorrect format, all error messages are displayed as: "Unknown error occurred: Unknown".6.

      PEPhub should consider providing user guidelines or examples on how to fill in subsample metadata and any relevant rules associated with it.7. In the Validation module, what are the rules for validation? Does it only check for the required column names in the schema, or does it also validate the content of the metadata, such as whether the metadata is in the correct format (e.g., int or string)? Additionally, it would be beneficial to provide an option to download the relevant schema and clearly specify the required column names in the schema. This would enable users to better organize their PEP to comply with the schema format and ensure that their metadata is accurately validated.8. This version of PEPHub primarily focuses on metadata. Have the authors considered any plans to expand this database to include data/pipeline management within the PEP framework? It would be valuable for the authors to discuss their future plans for PEPHub in this manuscript.Some minor concerns:1. When searching for content within a specific namespace, it would be beneficial for the pagination bar at the bottom of the webpage to display the number of pages. Now there are only Previous/Next buttons.2. As a web service, it is better to show the supporting browsers, such as Google Chrome (version xxx and above), Firefox (version xxx and above). I failed to open PEPHub website using an old version of Chrome.

    3. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Jeremy Leipzig (original submission)

      Metadata describes the who, what, where, when, and why of an experiment. Sample metadata is arguably the most important of these, but not the only type. LeRoy et al describes a user-centric sample metadata management system with extensibility, support for multiple interface modalities, and fuzzy semantic search.This system and portal, PEPHub, bridges the gaps between LIMS, which are tightly bound to the wet lab, metadata fetchers like GEOfetch (from the same group) or pysradb, and public portals like MetaSRA and the others listed in . Then and both of which don't allow you to roll your own portal internally, and whose search criteria are not fuzzy or semantic.People have been storing metadata in bespoke databases for decades, but not in an interoperable mature fashion. The PepHUB portal builds on some existing Pep standards by the same group, introducing a restful API and GUI.I find this paper a novel and compelling submission but would like the following minor revisions:1. Typically in SRA a sample refers to a dna sample drawn from a tissue sample (ie BioSample) and then runs describe sequencing attempts on those dna samples, and files are produced from each of the runs. It is unclear to me how someone working in an internal lab using PEPHub would know how to extract the file locations of sequence files associated with a sample if these are many-to-one. In the GEO example provided I can click on the SRX link to see the runs and files but how would this work for an internally generated entry? I need the authors to explain this either as a response or in the text.2. I think the paper has to briefly describe how the authors envision how PEPhub should interface with or replaces a LIMS for labs that are producing their own data and describe how it can help accelerate the SRA submission process for these data generating labs.3. Change "Bernasconi2021" to META-BASE in the text4. Some of the search confidence measures show an absurd level of significant digits (e.g.56.99999999999999% Please round that as it's only used for sorting.

    1. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (R2)

      The authors describe Omada, which is software to cluster transcriptomic data using multiple methods. The approach selects a heuristically best method from among those tested. The manuscript does describe a software package and there is evidence that the implementation works as described. The manuscript structure was substantially easier for me to follow with the revisions. The manuscript does not have evidence that the method outperforms other potential approaches in this space. It is not clear to me if this is or is not an important consideration for this journal. The form requires that I select from among the options offered. Given that this requires editorial assessment, I have marked "Minor Revision" but I do not feel a minor revision is necessary if, with the present content of the paper, the editor feels it is appropriate. If a revision is deemed necessary, I expect it would need to be a major one.

    2. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (R1)

      The authors have revised their manuscript. They added benchmarking for the method, which is important. The following overall comments still apply - there is not substantial evidence provided for the selections made:

      "I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.

      The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices.

      Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified." It is possible this is not necessary, but I simply note it as I continue to have these challenges with the revised manuscript.

    3. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Pierre Cauchy (R1)

      Kariotis et al. have efficiently addressed most reviewer comments. Omada, the tool presented there will be of interest to the oncology and bioinformatics communities.

    4. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (original submission)

      The authors describe a system for clustering gene expression data. The manuscript describes clustering workflows (data cleaning, assessing data structure, etc).

      I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.

      The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices. Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified. I examined the vignette, and I found that it provided a set of examples. I can imagine that running this on larger datasets would be highly time-consuming. It would be helpful to add benchmarking or an estimate of compute time. Given that this seems feasible to parallelize, it might make sense to provide a mechanism for parallelization.

      I examined the software briefly. There are some comments. Dead code exists in some files. There is at least one typo in a filename (gene_singatures.R). Some of the choices that seemed arbitrary appear to be written into the software (e.g., get_top30percent_coefficients.R).

    5. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Pierre Cauchy **

      Kariotis et al present Omada, a tool dedicated to automated partitioning of large-scale, cohort-based RNA-Sequencing data such as TCGA. A great strength for the manuscript is that it clearly shows that Omada is capable of performing partitioning from PanCan into BRCA, COAD and LUAD (Fig 5), and datasets with no known groups (PAH and GUSTO), which is impressive and novel. I would like to praise the authors for coming up with such a tool, as the lack of a systematic tool dedicated to partitioning TCGA-like expression data is indeed a shortcoming in the field of medical genomics Overall, I believe the tool will be very valuable to the scientific community and could potentially contribute to meta-analysis of cohort RNA-Seq data. I only have a few comments regarding the methodology and manuscript. I also think that it should be more clearly stated that Omada is dedicated to large datasets (e.g. TCGA) and not differential expression analysis. I would also suggest benchmarking Omada to comparable tools via ROC curves if possible (see below). Methods: This section should be a bit more homogeneous between text descriptive and mathematical descriptive. It should specify what parts are automated and what part needs user input and refer to the vignette documentation. I also could not find the Omada github repository. Sample and gene expression preprocessing: To me, this section lacks methods/guidelines and only loosely describes the steps involved. "numerical data may need to be normalised in order to account for potential misdirecting quantities" - which kind of normalisation? "As for the number of genes, it is advised for larger genesets (>1000 genes) to filter down to the most variable ones before the application of any function as genes that do not vary across samples do not contribute towards identifying heterogeneity" What filtering is recommended? Top 5% variance? 1%? Based on what metric? Determining clustering potential: To me, it was not clear if this is automatically performed by Omada and how the feasibility score is determined. Intra-method Clustering Agreement: Is this from normalised data? Because affinity matrix will be greatly affected whether it's normalised or non-normalised data as the matrix of exponential(-normalised gene distance)^2 Spectral clustering step 2: "Define D to be the diagonal matrix whose (i, i)-element is the sum of A's i-th row": please also specify that A(i,j) is 0 in this diagonal matrix. Please also confirm which matrix multiplication method is used, product or Cartesian product? Also if there are 0 values, NAs will be obtained in this step. Hierarchical clustering step 5: "Repeat Step 3 a total of n − 1 times until there is only one cluster left." This is a valuable addition as this merges identical clusters, the methods should emphasise that the benefits of this clustering reduction method to help partition data, i.e. that this minimises the number of redundant clusters. Stability-based assessment of feature sets: "For each dataset we generate the bootstrap stability for every k within range". Here it should be mentioned that this is carried out by clusterboot, and the full arguments should be given for documentation "The genes that comprise the dataset with the highest stability are the ones that compose the most appropriate set for the downstream analysis" - is this the single highest or a gene list in the top n datasets? Please specify. Choosing k number of clusters: "This approach prevents any bias from specific metrics and frees the user from making decisions on any specific metric and assumptions on the optimal number of clusters.". Out of consistency with the cluster reduction method in the "intra-clustering agreement" section which I believe is a novelty introduced by Omada, and within the context of automated analysis, the package should also ideally have an optimized number of k-clusters. K-means clustering analysis is often hindered due to the output often resulting in redundant, practically identical clusters which often requires manual merging. While I do understand the rationale described there and in Table 3, in terms of biological information and especially for deregulated genes analysis (e.g. row z-score clustering), should maximum k also not be determined by the number of conditions, i.e 2n, e.g. when n=2, kmax=4; n=3, kmax=8? Test datasets and Fig 6: Please expand on how the number of features 300 was determined. While this number of genes corresponds to a high stability index, is this number fixed or can it be dynamically estimated from a selection (e.g. from 100 to 1000)? Results Overall this section is well written and informative. I would just add the following if applicable: Figure 3: I think this figure could additionally include benchmarking, ROC curves of. Omada vs e.g. previous TCGA clustering analyses (PMID 31805048) Figure 4: I think it would be useful to compare Omada results to previous TCGA clustering analyses, e.g. PMID 35664309 Figure 6: swap C and D. Why is cluster 5 missing on D)?

    6. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Ka-Chun Wong ** (Original submission) The authors have proposed a tool to automate the unsupervised clustering of RNA-seq data. They have adopted multiple testing to ensure the robustness of the identified cell clusters. The identified cell clusters have been validated across different molecular dimensions with sound insights. Overall, the manuscript is well-written and suitable for GigaScience in 2023. I have the following suggestions: 1. It is very nice for the authors to have released the tool in BioConductor. I was wondering if the authors could also highlight it at the end of abstract, similar to the Oxford Bioinformatics style? It could attract citations. 2. The authors have spent significant efforts on validating the identified clusters from different perspectives. However, there are many similar toolkits. Comparisons to them in both time, userfriendliness, and memory requirement would be essential. 3. Since the submitting journal is GigaScience, running time analysis could be necessary to assess the toolkit's scalability performance in the context of big sequencing data. 4. Single-cell RNA-seq data use cases could also be considered in 2023.

    1. Editors Assessment:

      Oxford nanopore direct RNA sequencing (DRS) is a relatively new sequencing technology enabling measurements of RNA modifications. In vitro transcription (IVT)-based negative controls (i.e. modification-free transcripts) are a practical and targeted control for this direct sequencing, providing a baseline measurement for canonical nucleotides within a matched and biologically-derived sequence context. This work presents exactly this type of a long-read, multicellular, poly-A RNA-based, IVT-derived, unmodified transcriptome dataset. Review flagging more statistical analyses needed be performed for the data quality, and this was provided. The resulting data providing a resource to the direct RNA analysis community, helping reduce the need for expensive IVT library preparation and sequencing for human samples. And also serving as a framework for RNA modification analysis in other organisms.

      This evaluation refers to version 1 and 2 of the preprint

    2. ABSTRACTNanopore direct RNA sequencing (DRS) enables measurements of RNA modifications. Modification-free transcripts are a practical and targeted control for DRS, providing a baseline measurement for canonical nucleotides within a matched and biologically derived sequence context. However, these controls can be challenging to generate and carry nanopore-specific nuances that can impact analysis. We produced DRS datasets using modification-free transcripts from in vitro transcription (IVT) of cDNA from six immortalized human cell lines. We characterized variation across cell lines and demonstrated how these may be interpreted. These data will serve as a versatile control and resource to the community for RNA modification analysis of human transcripts.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.129), and has published the reviews under the same license. These reviews are as follows:

      Reviewer 1. Joshua Burdick

      Is the language of sufficient quality?

      Yes. In line 284, "bioinformatic" may be more often used than "BioInformatic", but the meaning is clear.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. Presumably the files (e.g. eventalign data) which are not in SRA will need to be uploaded to the GigaByte site.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. Line 177 should presumably be "nanopolish evenetalign".

      Is there sufficient data validation and statistical analyses of data quality?

      Yes. In my opinion, Figure 3(A) nicely illustrates the uncertainty in current nanopore data, which is useful.

      Additional Comments:

      The RNA samples, and nanopore sequencing data, should be useful as a negative control. Sequencing these IVT RNA samples using the newer ONT RNA004 pore and kit might also be useful.

      Reviewer 2. Jiaxu Wang

      Is there sufficient data validation and statistical analyses of data quality?

      No. The authors ran DSR for the in vitro transcribed transcriptional RNAs from 6 cell lines to remove the possible natural modifications. The data can be used as a control RNA pool for natural or artificial modification studies. however, more statistical analyses should be performed for the data quality. see comments below: (1) For more possible usage of this data, some QC analysis is better to be provided to confirm the quality of these sequencing data. For example: 1) What is the correlation between in vitro transcribed transcriptional RNAs and original DSR for each cell line? 2) how many genes have been captured in each cell line? (2) In Figure 2B, the author provides 3 conditions for ‘exclude’ and ‘include’, some statistical analysis should be performed to confirm how many cases in condition 1, condition 2, and condition 3. How many mismatches are showing in only 1 cell line, some cell lines or all the cell lines? The shared correct genes may be more confident references for the modification analysis. (3) Different reads of the same gene could have different mismatches in the IVT RNAs due to RT-PCR bias or other reasons (especially for the lower expressed RNAs), for example, there are 100 reads in total, 90 reads are the correct nucleotide at a given position, 10 reads have a mismatch in the IVT sample, then how to define the signal as the control reference? Given that the nature modification is low in RNA, some threshold should be applied for the confident result, for example, what is the lowest expression threshold that could be used as a confident control reference?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. For more possible usage of this data, more QC data should be performed, please refer to my above comments.

      Re-review: I am happy to see the changes. Thanks!

    1. Editors Assessment:

      This paper presents a new tool to make using PhysiCell easier, which is an open-source, physics-based multicellular simulation framework with a very wide user base. PhysiCell Studio is a graphical tool that makes it easier to build, run, and visualize PhysiCell models. Over time, it has evolved from being a GUI to include many additional functionalities, and can be used as desktop and cloud versions. This paper outlines the many features and functions, the design and development process behind it, and deployment instructions. Peer review improved the organisation of the various repositories and adding both a requirements.txt and environment.yml files. Looking to the future the developers are planning to add new features based on community feedback and contributions, and this paper presents the many code repositories if readers wish to contribute to the development process.

      This evaluation refers to version 1 of the preprint

    2. AbstractDefining a multicellular model can be challenging. There may be hundreds of parameters that specify the attributes and behaviors of objects. Hopefully the model will be defined using some format specification, e.g., a markup language, that will provide easy model sharing (and a minimal step toward reproducibility). PhysiCell is an open source, physics-based multicellular simulation framework with an active and growing user community. It uses XML to define a model and, traditionally, users needed to manually edit the XML to modify the model. PhysiCell Studio is a tool to make this task easier. It provides a graphical user interface that allows editing the XML model definition, including the creation and deletion of fundamental objects, e.g., cell types and substrates in the microenvironment. It also lets users build their model by defining initial conditions and biological rules, run simulations, and view results interactively. PhysiCell Studio has evolved over multiple workshops and academic courses in recent years which has led to many improvements. Its design and development has benefited from an active undergraduate and graduate research program. Like PhysiCell, the Studio is open source software and contributions from the community are encouraged.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.128), and has published the reviews under the same license. This is part of the PhysiCell Ecosystem Series: https://doi.org/10.46471/GIGABYTE_SERIES_0003

      Reviewer 1. Meghna Verma:

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      The authors have provided links for video descriptions for installation and that is appreciated.

      One overall recommendation is: If all the screenshots (for e.g.: from Fig 1-12 of the main paper and all the subsections in Supplementary) can be combined in one figure that will help enhance the complete overview and the overall flow of the paper.

      Additional comments are available here: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTA3L1Jldmlld19QaHlzaUNlbGxTdHVkaW9fTVYucGRm

      Reviewer 2. Koert Schreurs and Lin Wouters supervised by Inge Wortel

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      The problem statement is addressed in the introduction, which mentions the need for a GUI tool as a much more accessible way to edit the XML-based model syntax. However, it is somewhat confusing who exactly the intended audience of the paper is. Is the paper targeted at researchers that already use PhysiCell, but might want to switch to the GUI version? Or should it (also) target the potential new user-base of researchers interested in using ABMs, for whom the XML version was not sufficiently accessible and who will now gain access to these models because there is a GUI? Specifying the intended audience might impact some sections of the paper. For example, for users who already use PhysiCell, the step-by-step tutorials might not be useful since they would already know most of the available options; they would just need a quick overview of what info is in which tab. But if the paper is (also) targeted at potential new users, then some additional information could make both the paper and the tool much more accessible, such as:
      
      • A clear comparison to other modeling frameworks and their functionalities. Why should they use PhysiCell instead of one of the other available (GUI) tools? For example, the referenced Morpheus, CC3D and Artistoo all focus on a different model framework (CPMs); this might be worth mentioning. And what about Chaste? Does it represent different types of models, or are there other reasons to consider PhysiCell over Chaste or vice versa? For new users, this would be important information to include. The paper currently also does not mention other frameworks except those that offer a GUI. While the main point of the paper is the addition of the GUI, for completeness sake it might still be good to mention a broader overview of ABM frameworks and how they compare to PhysiCell, or simply to refer to an existing paper that provides such an overview.
      • The current tutorial immediately dives into very specific instructions (what to click and exact values to enter), often without explaining what these options mean or do. New users would probably appreciate to get a rough outline of which types of processes can be modelled, and which steps they would take to do so. This could be as easy as summarising the different main tabs before going into the details. I understand that some of these explanations will overlap with the main PhysiCell software – but considering that the GUI will open up modelling to a different type of community, it might make sense to outline them here to get a self-contained overview of functionality.
      • Indeed, if the above information is provided, the detailed tutorial might fit better as an appendix or in online documentation. That would also leave more space to explain not only which values to enter, but also what these variables do, why choose these values, what other options to consider, etc. Having this information together in one place would be very useful for beginning users.

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      The software is available under the GPL v3 licence.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      There is a Github repository, ensuring that it is possible to contribute and report issues, and the paper explicitly invites community contributions. However, although the paper mentions that it is possible to seek support through Github Issues and “Slack channels”, we could find no link to the latter resource. This should probably be added to make this resource usable for the reader (or otherwise the statement should be removed)

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Mostly yes, as installation and deployment are outlined in the paper and documentation. However, we did notice a couple of issues: - The studio guide explains how to compile a project in PhysiCell (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but does not mention that Mac users need to specify the g++ version at the top of the Makefile. This is explained in a separate blog (http://www.mathcancer.org/blog/setting-up-gcc-openmp-on-osx-homebrew-edition/) but should be outlined (or at least referenced) here as well. - There are several different resources covering the installation process, referring to e.g. github.com/physicell-training, github.com/PhysiCell-Tools/Studio-Guide, and the abovementioned blog. But this might not be very accessible to all users targeted by the new GUI functionality (especially when command line interventions and manual Makefile edits are involved). While not all of this has to be changed before publication, having all information in one place would already improve accessibility to a larger user-base. - When following the instructions (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), “python studio/bin/studio.py -p -e virus-sample” the -p flag gives an error: “Invalid argument(s): [‘-p’]”. We assumed it has to be left out, but perhaps the docs have to be updated.

      Is the documentation provided clear and user friendly?

      Mostly yes, as there is already a lot of documentation available. However, the user-friendliness could be improved with some minor changes. For example, the documentation could be made more user-friendly if resources were available from a central spot. Currently, information can be found in different places: - https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md provides installation instructions and a nice overview of what is where in the GUI, but as mentioned above, does not mention potential issues when installing on MacOS. - The paper provides very detailed examples; these might be nice to include along with the abovementioned overview. - Potentially other places as well. It would be great if the main documentation page could at least link to these other resources with a brief description of what the user will find there. Further, some additions would make the documentation more complete: - It would be good to have an overview somewhere of all the configuration files that can be supplied/loaded (e.g. those for “rules” and for initial configurations). - A clearer instruction/small tutorial on how to use simularium and paraview with physicell studio; especially for paraview there is no instruction on how to use your own data or make your own `.pvsm` file In the longer term, it might be worthwhile to set up a self-contained documentation website (this is relatively easy nowadays using e.g. Github pages), which can outline dependencies, installation instructions, a quick overview, detailed tutorials, example models, links to Github issues/slack communities. This is not a requirement for publication but might be worth looking into in the future as it would be more user-friendly.
      

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      No. The core functionality of the software is nicely outlined in the Github README (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but as mentioned before, this high-level overview is missing in the paper itself. The README and paper recommend installing the Anaconda python distribution to get the required python dependencies. This is fine, but adding a setup file or requirements.txt might still be useful for users who are more familiar with python and want a more minimal installation. Providing a conda environment.yml that allows running the studio along with paraview and/or simularium might also be helpful. Note that running the studio with simularium in anaconda did not work because anaconda did not have the required vtk v9.3.0; instead we had to install simularium without anaconda (“pip3 install simularium”).

      Are there (ideally real world) examples demonstrating use of the software?

      The detail tutorial nicely walks the reader through the tool (although as mentioned before, a high-level overview is missing and the level of detail feels slightly out of place in the paper itself). When walking through the example in the paper and the supplementary, we did run into a few (minor) issues: - It might be good to stress explicitly that after copying the template.xml into tumor_demo.xml, the first step is always to compile using “make”. The paper mentions “Assuming … you have compiled the template project executable (called “project”) …”. But it might not be immediately clear to all users how exactly they should do so (presumably by running “make tumor_demo” after copying the xml file?). - When running “python studio/bin/studio.py -c tumor_demo.xml -e project” as instructed, a warning pops up that “rules0.csv” is not valid (although the tool itself still works). - The instructions for plotting say to press “enter” when changing cmin and cmax, but Mac offers only a return key. Pressing fn+return to get the enter functionality also does not work; it might be good to offer an alternative for Mac. - When reproducing the supplementary tutorial, results were slightly different. It might be good if the example would offer a random seed so that users can verify that they can reproduce these results exactly. In our hands, when reproducing figs 39, 40, 48, 49 yields way more (red) macrophages (even when running multiple times), but we could not be sure if this is due to variation between runs, or a mistake in the settings somewhere.
      
      
      The paper mentions that they have started setting up automated testing, but it does not give an idea of what the current test coverage is. Did they add a few tests here and there, or start to systematically test all parts of the software? I understand the latter might not be achievable immediately, but it would be good if users and/or contributors can at least get a sense of how good the current coverage is. (Note: the framework uses pytest, which seems to offer some functionality to generate coverage reports, see e.g. https://www.lambdatest.com/blog/pytest-code-coverage-report/). The code in studio_for_pytest.py has a comment “do later, otherwise problems sometimes”, but it is not entirely clear if the relevant issue has been resolved.
      

      Additional Comments: The presented tool offers a GUI interface to the PhysiCell framework for agent-based modeling. As outlined for the paper, this offers significant value to the users since editing a model is now much more accessible. The tool comes with extensive functionality and instructions. Overall, the tool functions as advertised, and will be of great value to the community of PhysiCell users that now have to edit XML files by hand. It is therefore (mostly) publishable as is if some of the issues with installation (mentioned above) can be straightened out. That said, we do think some improvements could make both the tool and the paper more accessible to a larger user audience. Most of these have been mentioned in the other questions, but we will list some additional ones below. Note that many of these are just suggestions, so we will leave it up to the authors if and when they implement them.

      Suggestions for the paper: While the paper nicely outlines design ideas and usage of the tool, there were some points where we felt that the main point did not quite come across, for example: - As mentioned in the question about problem statement and intended audience, adding some information to the paper would make it a more useful resource to users not yet familiar with PhysiCell (see remarks there). - The section “Design and development” describes the development history of the tool. In principle this is a valuable addition, because it illustrates how the project is under ongoing development and has already been improved several times based on feedback of users. However, the amount of information on each previous stage is slightly confusing; it is not entirely clear how this relates to the paper and current tool. If the main point is to showcase that the current tool has been built based on practical user experiences, this would probably come across better if this section was somewhat shorter and focused on the design choices rather than previous versions. If the main point is something else, it should be clarified what the main idea is. – The point of Table 1 was unclear to us – consider removing or explaining the main idea. - Several figures do not have captions (e.g. Figure 1 but also others); it would be helpful to clarify what message the figure should convey. – P4 “adjust the syntax for Windows if necessary” – is it self-explanatory how users should adjust? Consider adding the correct code for windows as well if possible, since users that want to use the GUI tool might not be familiar with command line syntax. - P6 “if you create your own custom C++ code referring directly to cell type ID” – this functionality is never discussed. This might be part of the general PhysiCell functionality, but it would be good to at least provide a link to a resource on how you could do this. - P8 “Only those parameters that display … editing the C++ code” – it was not entirely clear to me what this means, could you clarify? - P13 mentions you can immediately see changes to the model parameters made. This is very useful for prototyping when users want immediate feedback. However, what happens when you try to save output for a simulation where parameters were changed while the simulation was running? Would users be reminded that their current output is not representative? - Discussion: it is good to mention that the tool is already being used. Can you give an indication based on your experience how long it takes new users to learn to navigate the tool? This might be useful information to add in the paper. - The last statement on LLMs seems to come out of nowhere. Consider leaving it out or expanding further on what would be needed to make this work/how feasible this is.

      Further comments on the tool itelf: - The paper mentions that results may not be fully reproducible if multiple threads are used (I assume this is the case even when a random seed is set). In this case, would it make sense to throw a warning the first time a user tries to set a seed with multiple threads, to avoid confusion as to why the results are not reproducible? - Unusable fields are not always greyed out to indicate that they are disabled, which sometimes makes it seem as though the tool is unresponsive. In other places unusable options are set to grey, so it might be good to double-check if this is consistent. - At the initial conditions (IC) page there is no legend; it might be good to add one. - There are some small inconsistencies between the field names mentioned in the paper and those in the tool/screenshots. For example “boundary condition” (p5) should be “dirichlet BC”, “uptake” (p6) should be “uptake rate”. For the latter, the paper mentions that the length scale is 100 micron but this should be visible in the tool as well. - Not all fields have labels, so it is not always clear what the options do (see e.g. drop-downs in Figure 6). – There are a few points in the tool where you have to “enable” a functionality before it works, but this might not always be intuitive. For example, if you upload a file with initial conditions, it can be assumed that you want to use it. There might be good reasons for this in some cases but in general, consider if all these checkpoints are necessary or if this could be simplified. Same goes for the csv files that have to be saved separately instead of through the main “save” button – in the long term it might be worth saving all relevant files when they are updated, or at least throwing a warning that you have to save some of them separately.

    1. Editors Assessment:

      Many studies have explored the genetic determinants of COVID-19 severity, these GWAS studies using microarrays or expensive whole-genome sequencing (WGS). Low-coverage WGS data can be imputed using reference panels to enhance resolution and statistical power while maintaining much lower costs, but imputation accuracy is difficult to balance. This work demonstrates how to address these challenges utilising the GLIMPSE1 algorithm, a less resource-intensive tool that produces more accurate imputed data than its predecessors. Generating a dataset containing 79 imputed low-coverage WGS samples from patients with severe COVID-19 symptoms during the initial wave of the SARS-CoV-2 pandemic in Spain. The validation of this imputation and filtering process shows that GLIMPSE1 can be confidently used to impute variants with minor allele frequency up to approximately 2%. After peer review the authors clarified and provided more validation and statistics and figures to help convince this approach was valid. This work showcasing the viability of using low-coverage WGS imputation to generate data for the study of disease-related genetic markers, alongside a validation methodology to ensure the accuracy of the data produced. Helping inspire confidence and encouraging others to deploy similar approaches to other infectious diseases, genetic disorders, or population-based genetic studies. Particularly in large-scale genomic projects and resource-limited settings where sequencing at higher coverage could prove to be prohibitively expensive.

      This evaluation refers to version 1 of the preprint

    2. AbstractDespite advances in identifying genetic markers associated to severe COVID-19, the full genetic characterisation of the disease remains elusive. This study explores the use of imputation in low-coverage whole genome sequencing for a severe COVID-19 patient cohort. We generated a dataset of 79 imputed variant call format files using the GLIMPSE1 tool, each containing an average of 9.5 million single nucleotide variants. Validation revealed a high imputation accuracy (squared Pearson correlation ≈0.97) across sequencing platforms, showing GLIMPSE1’s ability to confidently impute variants with minor allele frequencies as low as 2% in Spanish ancestry individuals. We conducted a comprehensive analysis of the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms developed to characterise severe COVID-19 symptoms. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.127 ), and has published the reviews under the same license. For a video summary from the author see: https://youtu.be/x6oVzt_H_Pk?si=Byufhl0mIL3h0K6u

      The reviews are as follows:

      Reviewer 1. Jong Bhak:

      Severe cases of covid-19 patients are critical data. This manuscript deals with detailed clinical information genome set as a subset of exome sequences and provide invaluable data for on-going global covid-19 omics studies.

      Reviewer 2. Alfredo Iacoangeli:

      The authors present the release of a new dataset that include low coverage WGS data of 79 individuals who experienced severe covid-19 in Madrid (Spain). The authors processed the data and imputed common variants and they are making this dataset available to the scientific community. They also present the clinical data of these patients in a descriptive and informative fashion. Finally, the authors also validated the quantify of their imputation, showcasing the potential of low coverage WGS as an alternative to microarrays. Overall the manuscript is written very well, clear, and exhaustive. The data is certainly valuable. Its generation and processing and analysis appears robust.
      

      Overall I support the publication of this article and dataset. I only have a small number of minor suggestions for the authors: The sentence "Traditionally, the genotyping process has relied on array technologies as the standard, both at the broader GWAS level and the more specific genetic scoring and genetic diagnostics levels" sounds a little off. I totally understand where the authors come from but given the central role of NGS and Sanger for genetic diagnostics I would suggest the authors to modify accordingly or to keep the GWAS focus.

      Please double-check the use a statistical terms in the description of the imputed data. For example: "On average, each VCF file in this rich dataset contains 9.49 million high-confidence single nucleotide variants [95%CI: 9.37 million - 9.61 million] (Figure 1)." The use of CI in this context is a little miss-leading as it is not strictly referring to a distribution of probability but to a finite collection. A range would be more appropriate. The authors say that they examined the ethnicity of the 79 individuals, however I do not think the ancestry is actually reported anywhere while a few figures show ancestral population data. The authors might clarify or correct the terminology.

      Looking at figure 2 the sentence " although the male age distribution exhibits a broader range and higher variability, suggestive of a greater" does not appear justified. The authors might want to clarify or correct accordingly.

      The sentence "This exploratory analysis highlights the diverse ways in which severe COVID-19 can present, and the importance of comprehensive and nuanced clinical phenotyping in improving our understanding and management of the disease." suggests some basic clustering might be useful. The readers might benefit from a couple of graphs or figures quantifying the overlap of the SNPs across samples and maybe one that shows the density of SNPs across the genome.

    1. this pathogen, coinciding with a progressive shrinking of the degradative arsenal and expansions in lineage specific genes. Comparative transcriptomics of four reference species with different evolutionary histories and adapted to different hosts revealed similarity in gene content but differences in the modulation of their transcription profiles. Only a few orthologs show similar expression profiles on different plant cell walls. Combining genome sequences and expression profiles we identified a set of core genes, such as specific transcription factors, involved in plant cell wall degradation in Colletotrichum.Together, these results indicate that the ancestral Colletotrichum were associated with eudicot plants and certain branches progressively adapted to different monocot hosts, reshaping part of the degradative and transcriptional arsenal.

      Reviewer 2: Nicolas Lapalu This manuscript describes the adaptation of the Colletotrichum genus to monocotyledonous and dicotyledonous plants with regard to the content and expression of genes from 30 genomes, with a subsampling of 4 genomes for transcriptomic analyses. Major remarks: "Considering that the analyses carried out are affected by the sampling, as closely related species are likely to have more shared genes compared to species that are more distant from others," Yes, Indeed, it's clearly a possible bias due to the sampling, as you write. As you considered all genomes together to define specific genes, monocot specific species have few specific genes due to their phylogenetic proximity. Based on this, could you address these observations based on combination of figure 1 and 2: 1. The number of specific genes in C.eremochloae (1608) vs in C.sublineola (1643), while divergence time between both seems short and similar to the group of C.lupini, C.costaricense (monoct) … with approximatively 100 genes specific to each species. How could such closely related genomes have acquired so many specific genes in such a short time compared with other species during the same period of evolution? 2. Same remark for in C.phormii (911) vs C.salicis (286), when it's even more disturbing with the switch to dicot and a loss of many genes for C.salicis. For both cases mentionned above, a detailed comparison between the two genomes could be useful to obtain some explanations of the events and genes involved. Moreover, interpretation of the phylogenetic tree (Figure 1), could be lead to propose three clusters of genomes, based on evolution time and plant host: Monocot, Dicot "old" (C.orbiculare, C.noveboracense, …) and Dicot "young" (C.melonis, C.cuscutae, …). Did the authors attempt an analysis with a such view of the data? Maybe that will complete the view of C.acutatum complex (46 genes) vs C.graminciola complex (28 genes) form which C.orchidophylum and C.phormis are excluded. Finally, one of the most interesting thing is the proximity of C. phormii and C.salicis in the same clade but with a recent host specialization. Despite the poor quality of the genome of C.salicis vs C.phormii, an comparative genomic approach with a tool like Synchro could provide clues as to gene losses and their location (all along the genome/ specific regions ). Figure 3: Please explain further Figure 3 A, described as a PCA. No axis (dimension) has been shown with a % explaining the divergence between organisms. This is confusing and does not allow me to know whether the gene sets used to compare the 4 genomes are only shared genes or all genes. The rest of the figure is much clearer and the comments are clear on the response to species specificity (under/over expression of genes) for each genome. Figure 4: "the expression of the orthologous genes was clustered for the four fungal species (Figure 4A)" As written, it is assumed that you used ortholog genes established between the 4 species, this does not appear to be the case with so many genes missing in C.graminicola in figure 4. To continue on this point, I have not found the minimum number of species found in a cluster to set a cluster of orthologs (maybe written but not found). What is the threshold for divergence or sequence similarity? Have you considered sequence length (query coverage vs subject coverage) to allow clustering of potentially split/fragmented genes in annotations? Minor remarks: The authors limit their analysis to 30 genomes, whereas more than 270 genomes of Colletotrichum are available, from over 70 species. Research time is clearly longer than the time to generate genomic resources, but it could be interesting to list a few new genomes missing from those analysed and that could have significant added value (particularly if sequenced in long reads, providing complete genomes). Transcriptomic analyses were carried out on 4 genomes. The choice of the genomes was not discussed, and maybe done by convenience with strains available at the lab. In fact, C.higginsianum is well sequenced, assembled and studied and chosen as one of the specific hosts of dicotyledons, whereas it is a member of the C.destructivum complex. Similarly, C.phormii appears to be a recent species with an adaptation to monocots. L 113 : "species with bigger genomes are characterized by a lower GC content", please rewrite the link between genome size and GC content. Between species of same genus genome size is most often linked to the invasion of TE element (RIPed or not in fungi). Strongly ripped genomes (Leptosphaeria, Venturia) are not always large compared to the size of other species. Data availability: All genomes were released in public Databases. I do not find accession numbers for RNA-Seq runs. Many supplementary details have been provided. I appreciate the BUSCO logs for checking the completeness of gene sets, which provide me some clues about the quality of genome annotation, that was never discussed or pointed out in the manuscript as possible source of bias. Overall, the manuscript is very interesting and confirms the results previously identified in terms of specificities of CAZy families associated with host plant adaptation in the Colletotrichum genus. The authors demonstrate a great knowledge of the CAZome and associated biological processes, which provides a great deal of valuable information for the community working on Colletotrichum and more generally for all those working on such enzymes. Finally, the transcriptomic data suggest that species specificity and host adaptation are more related to an expression pattern than to specific gene content, than a specific gene content.

    2. Colletotrichum fungi infect a wide diversity of monocot and eudicot hosts, causing plant diseases on almost all economically important crops worldwide. In addition to its economic impact, Colletotrichum is a suitable model for the study of gene family evolution on a fine scale to uncover events in the genome that are associated with the evolution of biological characters important for host interactions. Here we present the genome sequences of 30 Colletotrichum species, 18 of them newly sequenced, covering the taxonomic diversity within the genus. A time-calibrated tree revealed that the Colletotrichum ancestor diverged in the late Cretaceous around 70 million years ago (mya) in parallel with the diversification of flowering plants. We

      Reviewer 1: Jamie McGowan In this study, Baroncelli and colleagues carry out a comprehensive analysis of genomic evolution in Colletotrichum fungi, an important group of plant pathogens with diverse and economically significant hosts. Their comparative genomic and phylogenomics analyses are based on the genome sequences of 30 Colletotrichum species spanning the diversity of the genus, including pathogens of dicots, monocots, and both dicots and monocots. This includes 18 genome sequences that are newly reported in this study. They also perform comparative transcriptomic analyses of 4 Colletotrichum species (2 dicot pathogens and 2 monocot pathogens) on different carbon sources. Overall, I thought the manuscript was very well written and technically sound. The results should be of interest to a broad audience, particularly to those interested in fungal evolutionary genomics and plant pathology. I only have a few minor comments. Minor comments: (1) Lines 50 - 51: "The plant cell wall (PCW) consists of many different polysaccharides that are attached not only to each other through a variety of linkages providing the main strength and structure for the PCW". I found this confusing - is the sentence incomplete? (2) Line 66: "Some Colletotrichum species show…" I think there should be a couple of introductory sentences about Colletotrichum before this. (3) Figure 1: It would be informative to label which genomes were sequenced with PacBio versus just Illumina. (4) Lines 254 - 255: "As no other enrichment was identified we performed a manual annotation of genes identified in Figure 3D". I don't think it is clear here what manual annotation this is referring to. (5) One area where I felt the analysis was lacking was the lack of analyses on genome repeat content. The authors highlight the large variation in genome sizes within Colletotrichum species (~44 Mb vs ~90 Mb) and show in Figure 1 that this correlates with increased non-coding DNA. It would have been interesting to determine if this is driven by the proliferation of particular repeat families. (6) Another concern is the inconsistent use of genome annotation methods. 12 of the genomes reported in this study were annotated using the JGI annotation pipeline, whereas the other 6 were annotated using the MAKER pipeline. Several studies (e.g., Weisman et al., 2022 - Current Biology) show that inconsistent genome annotation methods can inflate the number of observed lineage specific genes. The authors may wish to comment on this or demonstrate that this isn't an issue in their study (e.g., by aligning lineage specific proteins against the other genome assemblies).

    1. respectively. Focusing on inversions and translocations, symmetric SVs which are readily genotyped within both populations, 24 were found to be structural divergences, 2,623 structural polymorphisms, and 928 shared structural polymorphisms. We assessed the functional significance of fixed interspecies SVs by examining differences in estimated recombination rates and genetic differentiation between species, revealing a complex history of natural selection. Shared structural polymorphisms displayed enrichment of potentially adaptive genes.

      Reviewer 2: Lejun Ouyang Structural variation plays an important role in the domestication and adaptability of species. The author compared the structural variation between E. melliodora and E. sideroxylon populations. This is a very interesting study, but it feels that the author is just statistical data. However, the biological problems caused by these differences have not been condensed, such as the impact of structural variation on recombination. What effect does it have on the differentiation of the two populations? Is it promoting or inhibiting? Secondly, the author's writing is not very clear, and some of the results are described too simply, resulting in unclear conclusions. When formatting pictures, try to avoid nesting pictures, and use A, B, C, etc. to represent them. However, some obvious issues, but not limited, are listed above. Here are other minor issues: 1. Lines 62-64: References are required. 2. Lines 145-150: It is recommended to put it in the materials and methods section. 3. The Synteny and structural variation annotation section requires a detailed explanation of the results in Figure 2 and Table 2. 4. It is recommended to make Table 2 into a picture, the effect will be better. 5. The form should be a three-line grid. 6. Why does the recombination rate in Table 3 have positive and negative errors at the genome level, but only negative errors at the chromosome average level? 7. 219-220 It is recommended that methods not appear in the results section. It is recommended to put it in the methods section. 8. The Structural variation genotyping in the results section needs to be modified. 9. Figure 6 is a bit confusing. It is recommended to revise it to make it clearer. 10. The results section of Figure 7 is not clearly described and the notes are not clear. What do the different colors represent? 11. Lines 263-264: It is recommended that methods should not appear in the results section, but can be placed in the materials and methods section. 12. It is recommended that Figure 8 be divided into Figure 8A and Figure 8B. Try not to have pictures within pictures, which can easily lead to unclear references. 13. Lines 276-281: It is recommended to put it in the method section. 14. Lines 289-290: It is recommended to put it in the method section. 15. Lines 307-308: E. melliodora and E. sideroxylon italics 16. Lines 311-318, lines 320-321: It is recommended to put them in the method section. 17. Lines 338-339: E. melliodora and E. sideroxylon italics. 18. Line 342: It is recommended to put it in the discussion. 19. It is recommended to change Figure 9B, Figure 10B and Figure 11B to Figure 20. Line 561: Add references.

    2. Structural variants (SVs) play a significant role in speciation and adaptation in many species, yet few studies have explored the prevalence and impact of different categories of SVs. We conducted a comparative analysis of long-read assembled reference genomes of closely related Eucalyptus species to identify candidate SVs potentially influencing speciation and adaptation. Interspecies SVs can be either fixed differences, or polymorphic in one or both species. To describe SV patterns, we employed short-read whole-genome sequencing on over 600 individuals of E. melliodora and E. sideroxylon, along with recent high quality genome assemblies. We aligned reads and genotyped interspecies SVs predicted between species reference genomes. Our results revealed that 49,756 of 58,025 and 39,536 of 47,064 interspecies SVs could be typed with short reads, in E. melliodora and E. sideroxylon

      Reviewer 1: Jakob Butler Ferguson et al have performed a thorough analysis of two species of Eucalyptus, quantifying the extent of structural variation between assembled genomes of the species and determining how prevalent those variations are across a selection of wild material. I believe this study is of sufficient quality for publication in GigaScience, if some minor inconsistencies and grammatical issues are addressed, and a few supporting analyses are performed. The major changes I would like to see include the addition of a syri plot of the complete set of SVs between E. melliodora and E. sideroxylon. I believe this, along with correcting the scale on the plots of recombination in Figure S6/S7 would allow for a better comparison of how recombination rate is interacting with the SVs. I would also suggest a more formal test of enrichment for COG terms, to better support the statements of "enrichment" in the discussion. Suggested changes by line: Line 142 - This section is quite short, I would either merge this section into the Genome scaffolding (and annotation) section, or expand on the results of the gene annotation. Line 182 - (Supplementary Figure S4) Line 183 (and throughout) - Please be consistent with your references to tables and figures. Line 186 - delete comma after 28.63% Line 194 - These are density plots rather than histograms Figure 4 - Both axes are labelled as PC1 Line 217 (page 10, line numbers are doubled up) - This seems repetitive, perhaps "…especially as they may also represent divergent sequences". Line 221 (page 11) - Please insert "and" before polymorphic translocations Line 223 - You have stated that those not successfully genotyped in both species are private or artefacts earlier in the paragraph, please reduce the repetition. Figure 6 - I don't find this figure particularly informative (and somewhat confusing to interpret). I think showing the percentages of each different SV in a visual form implies a level of equivalence in genomic impact, which is difficult to reconcile with the raw difference in numbers. I think a supplemental table with the focus on the percentages would illustrate the point better. Line 246 - There is no mention in the methods about what r threshold was used to declare a pair "correlated", please state it here or in the methods. Line 265 - This line was confusing to interpret. A suggested alteration: "significant value. After attempting to functionally annotating all genes across the genome and placing them within COG categories, 247 of the total 281 gene candidates in SSPs were annotated. These genes were enriched for...." Line 266 - I would like to see a formal enrichment analysis rather than "increased/decreased association", so we could have a clearer picture of which gene functions are truly over/underrepresented in SSPs. You could subsequently limit Figure 8 to those that show a difference. Line 275 - The grammar of this title is a bit off, perhaps "Effect of syntenic, rearranged, unaligned regions and genes on recombination rates" Line 276 - This is the first mention of p, please define it as recombination rate Line 283 - The supplemental Figure S6 and S7 seem to have regions of heightened recombination, but this is difficult to interpret and compare with the current variable axis scales. Please make these consistent. I would also like to see the syri graph of the two aligned genomes, as this would allow for a visual comparison of SV regions with recombination rate. Line 290 - How were p-values adjusted? Line 294 - More information about this 'significantly' higher recombination rate would be good, either in the figure or further expanded in the text. Line 307 - Italics for species names (repeated in Figure 10 and Figure 11 caption) Line 310 - Similar problem to line 275 Figure 10 - Having Figure 9b repeated in Figure 10 and Figure 11 is unnecessary. Line 336 - Vertical lines show average FST, not p Line 341 - Similar problem to line 275 Line 356 - translocations should be plural Line 367 - Vertical lines show average SNP density, not p Line 391 - This is the first mention of barrier loci, please define Line 413 - As mentioned above, I would recommend a formal enrichment test to support this statement Line 428 - Grammar is poor here, please correct Line 490 - Please make this a complete sentence Line 499 - Please state how the Hi-C map was manually edited, and what informed the position of those edits. Line 508 - Please provide an example of how well your LAI score of ~18 compares. The LAI paper seems to intimate that 10 is low quality? Line 513 - Missing bracket for version number Line 536 - Syntenic rather than synteny Line 717 - Formatting error in references Supp table S3-S4-S5 - Space between E. and sideroxylon

    1. Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

      Reviewer 3: Dmitrii Meleshko The paper titled "LRTK: A Platform-Agnostic Toolkit for Linked-Read Analysis of Both Human Genomes and Metagenomes" by Yang et al. is dedicated to the development of a unified interface for linked-read data processing.The problem described in the paper indeed exists; each linked-read technology requires complex preprocessing steps that are not straightforward or efficient. The idea of consolidating multiple tools in one place, with some of them modified to handle multiple data types, is commendable. Overall, I am supportive of this paper. My main concern, however, is that the impact of linked-read applications in the paper appears to be exaggerated, and the authors need to provide more context in their presentation. Also, some parts of the paper are vague described. I will elaborate on my concerns in more detail below.X) Linked-read sequencing generates reads with high base quality and extrapolative 64 information on long-range DNA connectedness, which has led to significant 65 advancements in human genome and metagenome research[1-3]. - Citations 1-3 do not really tell about advancements in human genome and metagenome research, these are technologies papers. Similar problem can be found in "Despite the limitations that genome specificity…" paragraph. Authors cited and described several algorithms, that are not really genomic studies. E.g. "stLFR[2] has found application in a customized pipeline that has been developed to first convert its raw reads into a 10x-compatible format, after which Long Ranger is applied for downstream analysis." is not an example of genomic study, but a pipeline description.X) Table S1 does not improve the paper, I would say it does completely the opposite. LongRanger is not a toolkit, it should be considered as read alignment tool that outputs some SVs and haplotypes along the way. So LongRanger vs LRTK comparison does not make sense to me. There are other tools that solve metagenome assembly problem, human assembly problem, call certain classes of SVs etc.x) I think incorporating longranger is important, since its performance is reported to be better than EMA for human samples and it is also more popular than EMA. Is it possible and have you tried doing it?x) I would remove exaggerations such as "myriad" from the text. The scope of linked-reads is pretty limited nowadays. I agree that linked-reads might be useful in metagenomics/transcriptomics and other scenarios that were mentionedin the text, but the number of studies is very limited especially nowadays, and was not really big when 10X platform was on the risex) "LRTK reconstructs long DNA fragments" - when people talk about long fragment reconstruction, they usually mean moleculo-style reconstruction through assembly. This reconstruction resemble "barcode deconvolution", described in Danko et al, and Mak et al. So I would stick to this terminologyx) it is important to note that, Aquila, LinkedSV and VALOR2 are linked-read specific tools, while FreeBayes, Samtools and GATK are short-read tools. Also, provide target SV length for both groups of tools.x) There are some minor problems with Github readme. E.g. "*parameters". Also, I don't understand how to use conversion in real life… E.g. 10X Genomics data often comes as a folder with multiple gzipped R1/R2/I1 files. I don't understand how would I use it in that case.x) Please cite or explain why this is happening (not only when) - "A known concern with stLFR linked-read sequencing is the loss of barcode specificity during analysis."x) I don't understand what is "Length-weighted average (μFL) and unweighted average (WμFL) of DNA 688 fragment lengths." from the figure. One of them is just an average and what about second? Figure looks confusingx) LRTK supports reconstruction of long DNA fragments - this section describes something else. More about statistics and data QCx) LRTK promotes metagenome assembly using barcode specificity - please remove supernova, it was never a metagenomic assembler. Check cloudSPAdes insteadx) "The superior assembly performance we have observed" - superior compared to what? If so, some short-read benchmark should be included.x) "LRTK improves human genome variant phasing using long range information" - What dataset is this? What callset was used for ground truth? Briefly describe how comparisons were done?x) Figures 5F-G together are very confusing.First I don't expect tools like LinkedSV to have high recall (around 1.0) and low precision. Also, figure G is kind of subset of figure F, but results are completely different. Also use explicit notation. E.g. 50-1kbp and 1-10kbp mean completely different thingsx) We curated one benchmarking dataset and two real datasets to demonstrate the 307 performance of LRTK - what do you mean by "curation" herex) Why don't you use Tell-Seq barcode whitelist mentioned here - https://sagescience.com/wpcontent/uploads/2020/10/TELL-Seq-Software-Roadmap-User-Guide-2.pdfx) Tiered alignment approach is vaguely introduced. It is not clear what "n% most closely covered windows." mean, or how do we select a subset of reference genomes for the second phase

    2. benchmarking and three real linked-read data sets from both the human genome and metagenome. We showcase LRTK’s ability to generate comparative performance results from the preceding benchmark study and to report these results in publication-ready HTML document plots. LRTK provides comprehensive and flexible modules along with an easy-to-use

      Reviewer 2: Lauren Mak Summary: This manuscript describes the need for a generalized linked-read (LR) analysis package and showcases the package the authors developed to address this need. Overall, the workflow is welldesigned but there are major gaps in the benchmarking, analysis, and documentation process that need to be addressed before publication.Documentation:The purpose of multiple tool options: While the analysis package is technically sound, one major aspect is left unexplained- why are there so many algorithm options included without guidance as to which one to use? There are clearly performance differences by different algorithms (combinations of 2+ not considered either) on different types of LR sequence.Provenance of ATCC-MSA-1003: Nowhere in the manuscript is the biological and technical composition of the metagenomics control described. It would be helpful to mention that this is specifically a mock gut microbiome sample, as well as the relative abundances of the originating species as well as the absolute amounts of genetic material per species (ex. as measured by genomic coverage) in the actual dataset. As a corollary, there should be standard deviations in any figures that display a summary statistic (ex. Figure 3A- precision, recall, etc.) that seems to be averaged across the species in a sample. This includes Figure 3A and Figure 4A.Dataset details: There is no table indicating the number of reads for each dataset, which would be helpful in interpreting Figures 3 and 4.Open source?: However, there was no Github link provided, only a link to the Conda landing page. Are there thorough instructions provided for the package's installation, input, output, and environment management?Benchmarking:The lack of simulated tests: The above concern (expected performance on idealized datasets) is best addressed with simulated data, which was not done despite the fact that LRSim exists (and apparently the authors have written a tool for stLFR as well previously).Indels: What are the sizes of the indels detected? Why were newer tools, such as PopIns2, Pamir, or Novel-X not tried as well?Analysis:Lines 166-169: Figure 1 panel A1 vs. B1- why do the distribution of estimated fragment sizes from the 10x datasets look so different in metagenomic vs. human samplees, when there is reasonable consistency in TELL-Seq and stLFR datasets?Lines 182-184: Figure 3A- why is LRTK's taxonomic classification quality generally lower than the of the tools? At least in terms of recall, it should perform better as mapping reads to reference genomes should have a lower false negative rate than k-mer-based tools. Also, what is the threshold for having detect a taxon? Is it just any number of reads or is there a minimum bound?Lines 187-188: Figure 3B- at least 15% of each caller's set of variants is unique to the variant, while a maximum of 50% is universal. I'd not interpret that as consistency.Lines 192-193: Are you referring to allelic imbalance as it is popularly used to refer to expression variation between the two haplotypes of a diploid organism? This clearly doesn't apply in the case of bacteria. If this is not what you're referring to, please define and/or cite the applicable definition.Lines 201-208: It's odd that despite the 10x datasets having the largest estimated fragment size, they have some of the smallest genome fractions, NGA50, and NA50. Why is this? Are they just smaller datasets, on average?Miscellaneous:UHGG: Please mention the fact that the UHGG is the default database, as well as whether or not the user will be able to supply their own databases.Line 363: What does {M} refer to?Line 369: What does U mean here? Is this the number of uniquely aligned reads in one of the windows N that a multi-aligned read aligns to?Lines 371-372: What does 'n% most closely covered windows' refer to?Lines 399-405: How are SNVs chosen for MAI analysis from the three available SNV callers?Lines 653-656: Which dataset was used for quality evaluation?Line 665: What do the abbreviations BAF and T stand for?

    3. Linked-read sequencing technologies generate high base quality reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and has been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to one specific sequencing platform. To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genomes and metagenomes. LRTK provides functions to perform linked-read simulation, barcode error correction, read cloud assembly, barcode-aware read alignment, reconstruction of long DNA fragments, taxonomic classification and quantification, as well as barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically, and provides the user with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on two

      Reviewer 1: Brock Peters Yang et al. describe a package of tools, LRTK, for cobarcoded reads (linked reads) agnostic of library preparation methods and sequencing platforms. In general, it appears to be a very useful tool. I have a few concerns with the manuscript as it is currently written:1. Line 203 "With Pangaea,LRTK achieves NA50 values of 1.8 Mb and 1.2 Mb for stLFR and TELL-Seq sequencing data, respectively. On 10x Genomics sequencing data, Athena exhibited superior assembly performance, with a NGA50 of 245 Kb."This is a bit of an awkward two sentences as you are comparing NA50 values for stLFR and TELL-Seq and then NGA50 for 10X Genomics and it makes it sound like 10X Genomics performed the best. Also, these numbers don't seem to agree with the figure.2. How long does an average run take to process? Say a 35X human genome coverage sample? Are there requirements for memory? A figure and metrics around this sort of thing would be helpful.3. How much data was used per library? What was the total coverage? Was the data normalized to have the same coverage per library? If not, it's very difficult to make fair comparisons between the different technologies.4. There's a section on reconstruction of long fragments, but then there really isn't any evaluation of this result and it's not clear if these are even used for anything. For all of these sequencing types I would assume that you can't really do much in the way of seed extension since the coverage across long fragments for these methods is much less than 1X. I think this needs to be developed a little more or it needs to be explained how these are used in your process or you just need to say you didn't use them for anything but here's some potential applications they could be used for. What type of file is output from this process? I think it's interesting, but just not clear how to use this data.5. I did try to install the software using Conda, but it failed and it's not clear to me why. Perhaps it's something about my environment, but you might want to have some colleagues located in different institutions try to install it to make sure it is easy to do so.

    1. Results The cupuassu genome spans 423 Mb, encodes 31,381 genes distributed in the ten chromosomes, and it exhibits approximately 65% gene synteny with the T. cacao genome, reflecting a conserved evolutionary history, albeit punctuated with unique genomic variations. The main changes are pronounced by bursts of long-terminal repeats retrotransposons expansion at post-species divergence, retrocopied and singleton genes, and gene families displaying distinctive patterns of expansion and contraction. Furthermore, positively selected genes are evident, particularly among retained and dispersed, tandem and proximal duplicated genes associated to general fruit and seed traits and defense mechanisms, supporting the hypothesis of potential episodes of subfunctionalization and neofunctionalization following duplication, and impact from distinct domestication process. These genomic variations may underpin the differences observed in fruit and seed morphology, ripening, and disease resistance between cupuassu and the other Malvaceae species.

      Reviewer 2: Jian-Feng Mao Rafael et al. contributed their study, "Genomic decoding of Theobroma grandiflorum (cupuassu) at chromosomal scale:Evolutionary insights for horticultural innovation". In this study, high-quality genome assembly for an important plant was generated and the authors further investigated genome characterization, genome evolution, gene families etc. The data quality is high, though some points need to be clarified. And the reported data and investigations could provide valuable inference for following studies.This paper is generally well-prepared.Major comments:1. Quality control of genome assembly. The quality of genome assembly could be better evaluated with more stringent parameters. On assembly quality control, I will recommend to always follow criteria established in Earth Biogenome Project (Report on Assembly Standards, https://www.earthbiogenome.org/assembly-standards). Please evaluate the present assemblies with the criteria from EBP project, I think, on at least some if not all the items. At least, I think Merqury results would be very informative.Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-92. Gaps in each pseudo-chromosome. Not clear if gaps are still remained, or the genome is of gap-free?3. Centromere region. How centromeres were identified? Centromeres were shown, but no description on how you did identify them. Given the high quality of genome assembly, it would be very interesting to incorporate the investigation into distribution of centromeres. A pipeline (https://github.com/ShuaiNIEgithub/Centromics (identifying centromere with multi-oimcs data, such as repeat profiling, and Hi-C chromatin contact) is helpful, and it was generally described at https://academic.oup.com/hr/article/10/1/uhac241/6775201?login=true) has already been prepared and widely applied in data analyses in some just published T2T assemblies.

    2. Background Theobroma grandiflorum (Malvaceae), known as cupuassu, is a tree indigenous to the Amazon Basin, valued for its large fruits and seed-pulp, contributing notably to the Amazonian bioeconomy. The seed-pulp is utilized in desserts and beverages, and its seed butter is used in cosmetics. Here, we present the sequenced telomere-to-telomere cupuassu genome, disclosing features of the genomic structure, evolution, and phylogenetic relationships within the Malvaceae.

      Reviewer 1: Xupo Ding 1. The Line or page number should be added in the revised manuscript, it is hard to point the comment to definite line.2. The methods and parameters of TE analysis should be detailed in the main text or supplementary file, especially for the LAI calculation, the LAI output by our pipeline is 11.47 and the pipeline was built according to default parameters of LTR_retiever (https://github.com/oushujun/LTR_retriever).3. What was the mutation rate (r) used for TE insert time calculation? If the insertion time were from the original files of EDTA, please notice that the default r is 1.3e-8 of grass family once --u was not set with promoting EDTA, that should be converted with the correct r value.4. Generally, the Gypsy content was usually more than Copia content in plant genome, please check it. If it were correct, please infer the reason.5. All results of GO enrichment were better enriched with KEGG.6. The results about enrichment were wrote hastily, lots of GO function or GO numbers were just list, the details should be abundant. Cite the Figures or tables or references in these sections.7. In Figure 1C, the Ks distribution need corrective, the authors can refer the polyploidization of durian genome published on Plant physiology in 2019.8. In Figure 2C, why some orders of TE loss the SD?9. In Figure 3A, T. grandiflorum and T. cacao present highly syntenic at gene level, the software of Liftoff might detect extra genes to T. grandiflorum genome based on the T. cacao genome. This is just a suggestion.10. In Figure 5A, there were 282 special genes in T. grandiflorum, please enrichment them with GO and KEGG.11. Figure 5B and D were from the GO enrichment the GO numbers should be added around annotation or list them in the supplementary files.12. In Figure 5C, the confidence interval of divergence time should be added.13. In the data availability, the weblink is not for everyone, GigaDB will record your data, so the unopened weblink might not necessary.14. In the MS, disease resistance were mentioned repeatedly, the GO enrichment has been provided some evidence, it will be better to perform the KEGG analysis with the special genes and expanded or contracted genes to verify, especially stat the changes in the ko04626.15. The language must be improved and modified by naive academic English speaker.

    1. Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.

      Reviewer 2: Satoshi Hiraoka In this manuscript, the authors developed a novel pipeline, Hecatomb, for viral genome analysis using metagenome and virome data that accepted both short- and long-read sequencing data. Using the pipeline, the authors performed the analysis using one virome and one metagenome dataset from different environments (stool and coral reef, respectively). The analyses showed reasonable results according to the original studies and rather they discovered candidate novel phages and new findings that possibly have great insight into the microbial ecology. The manuscript is overall informative and well-written. The Hecatomb incorporates famous bioinformatics tools that are frequently used in viral genome analyses today, allowing many researchers including beginners to examine virome datasets easily and effectively. Thus the pipeline is likely valuable and would contribute to wide studies of viruses, most of which are not cultured and its characteristics are unknown. Noteworthy, there is an informative document page ( https://hecatomb.readthedocs.io/en/latest/ ) including tutorials, which are very helpful for many users. I think this point could be more emphasized in the manuscript. However, unfortunately, lacking the analysis of the mock dataset makes it hard to estimate the accuracy of the pipeline. I think adding such kinds of analysis for evaluating the performance would greatly improve the study.I have some suggestions that would increase the clarity and impact of this manuscript if addressed.Major:In general, to clearly evaluate the efficiency of the novel bioinformatic tools and pipelines, benchmarking using ground-truth datasets is important in advance to the application using real datasets. To reach this, in this case, some artificial datasets that are composed of known viral and prokaryotic genomes with defined composition and library types (single and paired-end) and sequenced read length (current short- and long-reads) could be designed as mock metagenome data. Via the analysis using the mock datasets, the accuracy of the pipeline can be evaluated. It would be appreciated if the author performed such benchmarking tests as well as the real data applications.According to the GitHub page, the Hecatomb is designed to generate results that reduce false-positive and enrich for true-positive viral read detection. This point is important for understanding the purpose of developing the pipeline and differentiating the pipeline tool from other ones. The efficiency of the false-positive reduction using this pipeline would be better clearly shown in this manuscript. Therefore the mock dataset analyses are expected.When I read the manuscript, I was confused about what the targeted dataset the pipeline aiming for. Is the Hecatomb designed to analyze common prokaryotic shotgun metagenomic data to detect viruses? In other words, is the pipeline not limited to analyzing viral metagenomes (viromes), which specifically enriched viral particles from the samples for sequencing (e.g., density centrifugation to condense viral particles)? The stool samples were likely virome datasets (viral particles were enriched via 0.45-μm-pore-size membrane filtration according to the article), whereas the coral reef data are metagenome datasets. I would suggest that the terms "viral metagenome" (or virome, specifically targeting only viruses) and common "metagenome" (mainly focusing on prokaryotes) should be clearly distinguished throughout the manuscript including the title.I'm wondering about the sequence clustering step in Module 1. In my understanding, from the metagenomic settings, genomic regions are randomly sequenced, and thus most of the sequenced reads will not be clustered together using the criteria as described in the manuscript, and not so many sequences are reduced in this step. Is this step truly needed? Please add more explanation and importance about this step. For example, how many ratios of the reads were reduced in the test of the two real datasets (stool and coral reef) in this step?Minor:The introduction section is informative but a bit long. The section could be shortened.Some viruses were newly found using the pipeline (e.g., Fig1A). Which one is which virus types (dsDNA, ssDNA, dsRNA, ssRNA)? This information would be better to show clearly in the figure.I think the sequences derived from RNA viruses are generally not abundantly included in typical metagenomics datasets except if with specific techniques in the experiment. I think the potential for detecting RNA viruses from typical metagenomic DNA sequencing reads will be discussed in the Introduction section.L103. Please describe where the name "Hecatomb" is derived from in this article, though this is shown on the GitHub page.L119. " round A/B libraries" here, but I have not heard or could not find this term in the articles cited here. Please add more explanation of what is "round A/B libraries".L130 up to 2 insertions and deletions?L131. BBmap included in BBtools [73]?L181. A brief explanation of the "Baltimore classification" here would improve the readability for readers who are not familiar with this.L239. There is no explanation of what "SIV" means before.L253-L268 & Figure 4B. According to Figure 2A, there are two paths (1,2,5: aa and 1,3,4,5: nt) for detecting viral reads. I'm interested in which path is major and which is minor. Could the authors provide the ratio of the reads that predicted using aa or nt in each dataset examination (each stool and coral)?L431, L436. Not only BioProject but SRA accession ID should be provided.L479. There is no LACC here. What is his main contribution? Just reviewing and editing the manuscript is insufficient for citing as an author: see https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-ofauthors-and-contributors.html#twoFigure 1. There are some DBs newly created and used in the pipeline (e.g., Viral AA DB, Multi-kingdom AA DB, Virus NT DB, and Polymicrobial MT DB). I think it would be better to add how to make the DBs in this or other figures. This must contribute to understanding how to construct the DBs and why to use them in this pipeline.Figure 1. specified (1)-(4) in the legend, not just color.Figure 4A. Please provide the total number of sequencing reads in addition to the read count assigned to each virus.Figure 4C. CPM was not explained in the manuscript and not listed in L460.L490. Some references are incomplete. e.g., lack of article ID or page number (49, 79, 90, 94, 95, 96, 100, 101, 102), remaining unnecessary words ("academic.oup.com" in 90, 91), etc. Please check the reference list carefully.Figure S5. Alignment length (bp)Table S2. For calculating the best hit identify, what database was used?

    2. Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.

      Reviewer1: Arvind Varsani The MS titled "Hecatomb: An Integrated Software Platform for Viral Metagenomics" addresses the developed of a toolkit for viral meatgenomics analysis that assembles a variety of tools into a workflow.Overall, I do not have any issue with this MS or the toolkit.I have some minor points to help improve the MS and make it as current as possible.1. Line 40: I would include Cenote-take 2 PMID: 33505708, geNomad https://www.biorxiv.org/content/10.1101/2023.03.05.531206v12. Line 40: I would probably not cite the preprint of this current paper - see ref 21.3. Line 80: Actually Cenote-take (both version 1 and 2) both use HHMs and as far as I know so does geNomad.4. Line 248: Please note that Siphoviridae, Podoviridae and Myoviridae are not currently family names. See PMID: 366830755. This means you will likely need to edit you figure to collapse these to Caudovirales6. Line 250-251: Picornaviridae and Adenoviridiae should be in italics7. Line 270: Here and elsewhere, please note that a taxa do not infect a host, it is a virus that infects a host. "Mimiviridae, that infect Acanthamoeba, and Phycodnaviridae, that infect algae, are both dsDNA viruses with large genomes" should ideally be written as "Viruses in the family Mimiviridae infect Acanthamoeba and those in the family Phycodnavirida infect algae, are dsDNA viruses with large genomes."8. Figure 6: the name tags of the CDS/ ORFS are truncated e.g. replication initiate…, heat maturation prot…9. Figure 6: Major head protein should be major capsid protein.10. One thing that I would highlight is that none of the workflows / tool kits developed account for spliced CDS. This is a major issue in automation of virus genome annotation at the moment and with this there will be some degree of misidentification for taxa assignment.

    1. Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

      Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for reference-free and de novo analysis of sequencing data is a valuable application. The authors have demonstrated the impressive performance of ntsm on low coverage data through experimental results presented in the manuscript, showcasing its strengths in terms of sensitivity, accuracy. However, while ntsm eliminates the need for reference genome alignment, it still relies on a pre-defined set of variant sites and pre-built PCA rotation matrices. This raises doubts about the true reference-free nature of ntsm and raises concerns about its generalizability to other species.Major comments:1.The concept of reference-free:I believe that ntsm's approach is not truly reference-free. In order to use ntsm, it requires the use of existing high-quality population SNP sites and kmers from the human reference genome. Additionally, the population PCA results are used to assist in pairwise comparisons between samples. Both of these information can only be obtained when a reference genome is available. A true referencefree tool would be applicable to species without a reference genome, such as SPLASH (Chaung et al., 2023, Cell). ntsm can be considered as an alignment-free or kmer-based tool.2.The reduction of computational costs:NTSM differs from Somalier in its computational workflow. To compare the computational costs or time, a holistic end-to-end comparison is necessary, rather than timing individual steps such as kmer counting and sample pairwise comparison separately. Conducting an end-to-end comparison for an analysis task allows users to have a comprehensive understanding of the tool's time and cost consumption. Furthermore, when comparing software, it is important to allocate computational resources fairly. For example, ntsm utilizes 16 threads in the 'Sample comparison process' stage, while for the 'k-mer counting (ntsm) vs. alignment (somalier)' stage, tools like bwa and minimap2, which can utilize multiple threads, were run using a single thread.3.Sensitivity and Specificity:More experimental details are needed. In the section 'Sensitivity and Specificity of Sample Swaps,' were the results obtained using the 39 HPRC samples? Did it include their Hi-C data?For Fig 6, did the results come from all sequencing datasets of the 39 samples, including Illumina and ONT? Since the results was obtained using full coverage, would the threshold change at lower coverage?For Fig 7, which demonstrates ntsm's results, was PCA information used as an auxiliary? Does the use of PCA information impact Sensitivity and Specificity?4.Regarding PCA-based method:The 39 HPRC samples used in the study are actually part of the 3,202 samples from the 1000 Genomes Project. Therefore, it is important to clarify whether the PCA matrix used in the study already includes information from these 39 samples. From a rigorous experimental design perspective, a precomputed PCA matrix should not include information from the 39 samples. Otherwise, the effect of the PCA matrix on these 39 samples may be overestimated. It raises questions about whether the same results can be achieved on non-1000 Genomes Project samples.5.The applicability of the tool:In order to expand the applicability of ntsm to a wider range of species, two aspects need to be addressed:1). Provide detailed information on customizing the sites file. From the site files available in ntsm code repository on GitHub, the process of selecting variant sites seems to be more complex than what is described in the manuscript, involving more than just SNP variants.2). The sites and PCA files should be user-customizable inputs instead of being built-in. This limitation restricts the application of ntsm to other species.Minor comments:The manuscript appears to have been hastily written and requires further polish by the authors.1. In Figure 6, A and B seem to be labeled incorrectly.2. In Figure 9, the two subplots have different y-axes, one labeled "min" and the other labeled "s." Could you clarify what each subplot is illustrating?3. When mentioning HPRC for the first time, it would be helpful to provide the full name and explanation of the acronym. However, the full explanation appears in the next paragraph.4. "We then keep only purine to pyrimidine (A or T to G or C) variants, as final insurance against possible human error influencing this tool" It seems there may be a mistake or confusion in the sentence. The writer should indeed mention "A/G <-> C/T" instead of "A/T <-> G/C" to accurately describe purine to pyrimidine variants. The writer may have made an error in describing the nucleotide exchange, or it could be a typographical mistake.5. There is a typo in the formula for estimating sequencing error rate. (nm)·log(1-… …

    2. Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

      Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect sample swaps. Compared with the alignment-based method, Somalier, nsam performs better on low coverage data (≤5X) and is more efficient in terms of memory and computing time. The authors use PCA-based spatial index heuristic to reduce the number of sample comparisons. Of course, in my opinion, compared with the time spent on counting k-mer, the time saved by the PCA-based method is trivial. In addition, ntsm also provides other features such as error rate estimation. The tool requires population snp information, which limits its applications in practice to some extent. Overall, ntsm is a fast and practical tool for calculating intra-species sample similarity and detecting sample swaps. The writing and experiments in this paper are generally well done. There are some major and minor issues that I suggest the authors consider addressing.Major issues:The paper mentions that due to high error rates, nanopore data is difficult to analyze. Can the authors analyze the performance of ntsm under different error rate data? In general, alignment-based methods may perform better on high error rate data. This is very useful information for users to choose the tool.The authors use the PCA-based spatial index heuristic to reduce the number of pairwise comparisons. However, the relation between PCA distance and similarity score is not clear here. How to ensure that samples with similarity scores less than the threshold are within the search radius?The paper involves two metrics, say, similarity score and relatedness, to detect sample swaps. Can the authors analyze the relation between them to help readers understand the advantages and disadvantages of the two methods?Minor issues:In the "Conlusions" section, the second "useful" in the sentence "this method provides other useful information useful in QC" is redundant."R=1, p<2.2e-16" in Figure 3 is not explained.In the "Sequencing error rate estimation" section, the variable n is not explained.In Figure 9, the case of the first letter of two y-axis labels (time) is inconsistent.

  17. May 2024
    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presents the genome of the golden birdwing butterfly Troides aeacus (Lepidoptera, Papilionidae). A notable and popular species in Asia that faces habitat loss due to urbanization and human activities. The lack of genomic resources impedes conservation efforts based on genetic markers, as well as better understanding of its biology. Using PacBio HiFi long reads and Omni-C a 351Mb genome was assembled genome anchored to 30 pseudo-molecules. After reviewers requested more information on the genome quality it seems there was high sequence continuity with contig length N50 = 11.67 Mb and L50 = 14, and scaffold length N50 = 12.2 Mb and L50 = 13. Allowing a total of 24,946 protein-coding genes were predicted. This study presents the first chromosomal-level genome assembly of the golden birdwing T. aeacus, a potentially useful resource for further phylogenomic studies of birdwing butterfly species in terms of species diversification and conservation. This evaluation refers to version 1 of the preprint

    2. AbstractTroides aeacus, the golden birdwing (Lepidoptera, Papilionidae) is a large swallowtail butterfly widely distributed in Asia. Despite its occurrence, T. aeacus has been assigned as a major protective species in many places given the loss of their native habitats under urbanisation and anthropogenic activities. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of T. aeacus using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (351 Mb) contains 98.94% of the sequences anchored to 30 pseudo-molecules. The genome assembly also has high sequence continuity with scaffold length N50 = 12.2 Mb. A total of 28,749 protein-coding genes were predicted, and high BUSCO score completeness (98.9% of BUSCO metazoa_odb10 genes) was also revealed. This high-quality genome offers a new and significant resource for understanding the swallowtail butterfly biology, as well as carrying out conservation measures of this ecologically important lepidopteran species.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.122), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Dr. Kumar Saurabh Singh

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. 1. I've noticed that the genome assembly file has been uploaded to NCBI, but I couldn't locate the corresponding annotation files in GFF format. Additionally, I couldn't find gene models for Troides aeacus on NCBI or any other platform. As per Giga Science data policy, these files should be made publicly available. 2. The paper lacks information on the contig N50 and L50, although I did find this data on NCBI. Is there a specific reason for omitting the contig N50/L50 details from the main text or tables?

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. 1. I have noticed that the QV value is missing for the given assembly. To assess the base-level accuracy of your assembly, the authors should calculate the consensus quality (QV), comparing the frequency of k-mers present in the raw Omni-C reads (as you only have short-reads from Omni-c) with those present across the final assembly perhaps using Merqury. 2. Incorporating Omni-c data did not result in a significant increase in the contig N50. Have you identified any specific reasons for this outcome? 3. The overall BUSCO completeness for proteins appears to be disproportionately low (~86%) compared to genomic completeness (~98%). Could this be attributed to the absence of RNAseq data for predicting accurate gene models?

      Is there sufficient data validation and statistical analyses of data quality?

      I believe it's essential to assess the assembly quality through comparative genomic analyses, a component seemingly missing from the manuscript. While the text mentions the availability of genomic resources within the same genus, conducting a genome-wide comparison of these assemblies could provide valuable insights into the overall synteny and contiguity of the T. aeacus assembly. To ensure annotation consistency, it's important to compare genome assemblies by generating distributions of intron/exon lengths for annotations across multiple assemblies.

      Reviewer 2. Dr.Xueyan Li

      Link to review: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDk1L0dpZ2FieXRlRFJSLTIwMjQwMS0wMS1jb21tZW50cy5kb2N4

      Re-review: The paper has substantially been enhanced after the first revision. I suggest that this manuscript can be published after the following minor revisions: 1.L279: ‘formosanus’ is also part of the scientific name which should be Italic type. 2.It’s recommended to beautify the figures and tables.

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the common chiton, Liolophura japonica (Lischke, 1873). Chitons are marine molluscs that can be found worldwide from cold waters to the tropics that play important ecological roles in the environment, but to date are lacking in genomes with only a few assemblies available. This data was produced using PacBio HiFi reads and Omni-C sequencing data, the resulting genome assembly being around 609 Mb in size. From this 28,010 protein-coding genes were predicted. After review improved the methodological details the quality metrics look near chromosome-level, having a scaffold N50 length of 37.34 Mb and 96.1% BUSCO score. This high-quality genome should hopefully be a valuable resource for gaining new insights into the environmental adaptations of L. japonica in residing the intertidal zones and for future investigations in the evolutionary biology in Polyplacophorans and other molluscs.

      This evaluation refers to version 1 of the preprint

    2. AbstractChitons (Polyplacophora) are marine molluscs that can be found worldwide from cold waters to the tropics, and play important ecological roles in the environment. Nevertheless, there remains only two chiton genomes sequenced to date. The chiton Liolophura japonica (Lischke, 1873) is one of the most abundant polyplacophorans found throughout East Asia. Our PacBio HiFi reads and Omni-C sequencing data resulted in a high-quality near chromosome-level genome assembly of ∼609 Mb with a scaffold N50 length of 37.34 Mb (96.1% BUSCO). A total of 28,233 genes were predicted, including 28,010 protein-coding genes. The repeat content (27.89%) was similar to the other Chitonidae species and approximately three times lower than in the genome of the Hanleyidae chiton. The genomic resources provided in this work will help to expand our understanding of the evolution of molluscs and the ecological adaptation of chitons.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.123), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Jin Sun

      Are all data available and do they match the descriptions in the paper?

      Yes. The assembly and annotations can be found in the Figshare.

      Is the validation suitable for this type of data?

      Yes. I have examined the HiC interaction map, and I think the scaffolding is high-quality.

      Additional Comments:

      The presentation is clear, but I would suggest the authors include the latest BUSCO score for the gene models.

      Reviewer 2. Priscila M Salloum

      Is the language of sufficient quality?

      Yes. The language is appropriate and does not hinder understanding, but some minor proof reading could benefit the manuscript. I left a few suggestions in my comments to the authors.

      Are all data available and do they match the descriptions in the paper?

      No. The data made available on NCBI has the 632 scaffolds, but the 13 pseudomolecules are not shown (in GCA_032854445.1, under Chromosomes, it reads “This scaffold-level genome assembly includes 632 scaffolds and no assembled chromosomes”), please clarify where information/data for the 13 pseudomolecules can be found. The figshare repository has the annotation files, but it lacks a metadata file detailing what each of the annotation files is (the file names are descriptive, but they do not replace a metadata file). The data availability statement lacks information about the transcriptomes (were these made available?) Supplementary tables are mentioned in the text file but were not made available (at least not for review).

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. All that was provided was consistent.

      Is the data acquisition clear, complete and methodologically sound?

      No. Some clarification is needed (was the same sample used for the genome and transcriptome assembly? Were the different tissues processed in the same way? What software were used for all the bioinformatics steps? What were all the parameters and filters used for genome and transcriptome assembly and annotation?) I left specific suggestions in a file with additional comments to the authors.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. Software versions, citations, and parameters are missing from the methods section. Some results refer to methods not explained in the methods section.

      Is the validation suitable for this type of data?

      Yes. More details on the BlobTools parameters used are needed.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. Supplementary tables were mentioned but not provided (at least not for review). There is enough information for others to reuse the genome data, although more information in the methods section (as mentioned above) and a metadata file would make this even more useful. There is no mention of where the transcriptome has been deposited, and an extremely brief mention to how it was assembled (e.g., no details on parameters used or software versions).

      Additional Comments: Please include all citations in the reference list.

      And see additional file with comments: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD00OTYmZmlsZT0xOTgmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the long-spined sea urchin Diadema setosum (Leske, 1778). Using PacBio HiFi long-reads and Omni-C data the assembled genome size was 886 Mb, consistent to the size of other sea urchin genomes. The assembly anchored to 22 pseudo-molecules/chromosomes, and a total of 27,478 genes including 23,030 protein-coding genes were annotated. Peer review added more to the conclusion and future perspectives. The data hopefully providing a valuable resource and foundation for a better understanding of the ecology and evolution of sea urchins.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe long-spined sea urchin Diadema setosum is an algal and coral feeder widely distributed in the Indo-Pacific and can cause severe bioerosion on the reef community. Nevertheless, the lack of genomic information has hindered the study its ecology and evolution. Here, we report the chromosomal-level genome (885.8 Mb) of the long-spined sea urchin D. setosum using a combination of PacBio long-read sequencing and Omni-C scaffolding technology. The assembled genome contained scaffold N50 length of 38.3 Mb, 98.1 % of BUSCO (Geno, metazoa_odb10) genes, and with 98.6% of the sequences anchored to 22 pseudo-molecules/chromosomes. A total of 27,478 genes including 23,030 protein-coding genes were annotated. The high-quality genome of D. setosum presented here provides a significant resource for further understanding on the ecological and evolutionary studies of this coral reef associated sea urchin.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.121), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Phillip Davidson

      Is the language of sufficient quality?

      Yes. Minor language errors that should be corrected in copy-editing

      Additional Comments:

      In their work, Hui et al present a chromosome-level genome assembly for Diadema setosum, the long-spined urchin. This new data is especially exciting given no high-quality genomic resource for the Diadematoida is available, bolstering comparative genomics work of echinoderms and the study of this species. Overall, the methods and data are well described and have produced a high quality genome assembly and associated annotations that will be a valuable addition to the community. I have a handful of primarily minor suggestions detailed below:

      Major comments:

      1. Conclusions and future perspectives: Currently, this section is only a sentence and states the new assembly will “further understanding of ecology and evolution of sea urchins”, which I think is a little uninspiring. I think more detail can be provided in this section to explain how this genome assembly adds to current knowledge. For example, reiterating that this is the first chromosome-level Diadematoida assembly, or perhaps explaining with examples how a good reference genome can inform ecological studies. Overall, the significance of this work is not really explained which I think sells this nice work short.

      Minor comments:

      1. Lines 232-233 state the mean coding sequence is 483 bp which seems a bit low, but having examined the peptide fasta file, I believe the average amino acid length is 483 AA, giving an average coding sequence length of ~1449bp. Please confirm and correct if necessary. This would also increase the total # of coding basepairs listed in Table 1.

      2. Lines 66-71: The authors state there are 5 chromosome-level sea urchin assemblies, all of which are camarodonts. However, I believe there are at least three additional chromosome-level assemblies for sea urchins not mentioned: 1) Echinometra sp. EZ (Ketchum et al, 2022; https://academic.oup.com/gbe/article/14/10/evac144/6717576 ) and 2) Paracentrotus lividus (Marletaz et al, 2023; https://www.sciencedirect.com/science/article/pii/S2666979X23000617?via%3Dihub ) and 3) Strongylocentrotus purpuratus (https://www.echinobase.org/echinobase/) Further, P. lividus is not a camarodont, so the text should be corrected accordingly.

      3. Lines 106: Please state whether the individual samples for genome sequencing was male or female

      4. Lines 54-54: The BUSCO score is reported at 98.1% but it should be be specified if this is the complete BUSCO score or the single-copy BUSCO score. Ideally, the single copy and duplication scores, rather than the complete, score is reported so readers have an idea for the duplication rate/haploid-ness of the assembly. Same issue on lines 221. Thank you for reporting in Table 1.

      5. Line 56: Text states “27,478 genes including 23,030 protein coding genes” were annotated. Augustus often outputs genes and transcripts, so I am wondering if the authors mean 27K transcripts including 23K genes. If so, the authors should clarify. If not, I think a brief statement of what these additional 4K genes are would be informative

      6. Table 1: Please clarify if “HiFi (X): 21” is referring to 21X coverage. Please correct length of coding sequence to amino acid sequence, and total coding sequence length. Same with Figure 1 panel B.

      Reviewer 2. Remi Ketchum

      Minor Edits

      Line 62: Change to “lack a vertebral column” instead of “lack the” Line 64: Change to “sea urchins” instead of sea urchin Line 70: Ketchum et al 2022 in GBE produced a chromosome-level genome assembly of Echinometra sp. EZ so this citation should be included here. Line 91: change to “results in a reduction in coral community complexity”

      I think that the end of the introduction could use a sentence or two that explicitly states why this genome will be a valuable resource to the scientific community. I think this will also help wrap up the introduction.

      Line 101: Can you provide coordinates? Also could you remove the word ‘alive.’ Line 130: I am confused by what you mean “the sample was then proceeded” Line 181: Was this the same individual that you used for genomic DNA isolation? Line 196: please could you include the specific flags that you used for purge_dups? Did you run Hifiasm with the default parameters?

      Line 240: I would definitely try and include some more sentences in this section. Line 253: Is this section supposed to be here? I think this is meant to go into the methods section.

      The authors could think about potentially a comparison table of the different urchin genome stats that are available currently? I would also encourage the readers to generate KAT plots to validate that they have successfully collapsed the haplotypes – a common problem with higher heterozygosity.

      Reviewer 3. F. Marlétaz

      I think it would be great to give further detail on the statistics out of the hifiasm contiging step. What are the contig statistics (after the hifiasm step)?

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presenting the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. Using PacBio and Omni-C data a 29.2 Mb genome was assembled, with a scaffold N50 of 1.925 Mb and 92.0% BUSCO score demonstrating the quality (review pushing the authors to provide more detail and QC stats to help better convince on this). This data providing a useful resource for further phylogenomic studies in the family Dacrymycetaceae and investigations on the biosynthesis of glycolipids with potential applications in the food industry.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe edible jelly fungus Dacryopinax spathularia (Dacrymycetaceae) is wood-decaying and can be commonly found worldwide. It has also been used in food additives given its ability to synthesize long-chain glycolipids. In this study, we present the genome assembly of D. spathularia using a combination of PacBio HiFi reads and Omni-C data. The genome size of D. spathularia is 29.2 Mb and in high sequence contiguity and completeness, including scaffold N50 of 1.925 Mb and 92.0% BUSCO score, respectively. A total of 11,510 protein-coding genes, and 474.7 kb repeats accounting for 1.62% of the genome, were also predicted. The D. spathularia genome assembly generated in this study provides a valuable resource for understanding their ecology such as wood decaying capability, evolutionary relationships with other fungus, as well as their unique biology and applications in the food industry.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.120), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Anton Sonnenberg

      Is the language of sufficient quality? Yes.

      Are all data available and do they match the descriptions in the paper? Yes.

      Is the data acquisition clear, complete and methodologically sound? Yes.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes.

      Figure 1E could be improved by eliminating in the pie-chart the non-repeat sequences or bar-plot the repeats. That will visualize better the frequencies of each type of repeats.

      Reviewer 2. Riccardo Iacovelli

      Is the language of sufficient quality? No.

      There are several typos spread across the text, and some sentences are written in an unclear manner. I provide some suggestions in the attachment.

      Are all data available and do they match the descriptions in the paper?

      Yes, but some of the data shown is rather unclear and/or not supported by sufficient explanation. For example, what is actually Fig. 1C showing? Because the reference in the text (which contains a typo, line 197) refers to something else. What is the second set of stats in Fig. 1B? This other organism is not mentioned at all anywhere in the manuscript.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. NCBI TaxID of the sequenced species object of this work is missing.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. In my opinion, some of the procedures described for the processing of the sample and library prep for sequencing are reported in an unclear way. For example, lines 100-103: no details on RNAse A treatment; how do you define chloroform:IAA (24:1) washes? how much supernatant is added to how much H1 buffer to have the final volume of 6 ml? Another example, lines 180-175: what parameters did you use for EvidenceModeler to generate the final consensus genes model? The weight given to each particular prediction set is important.

      Is there sufficient data validation and statistical analyses of data quality?

      No/ While sufficient data validation and statistical analyses have been carried out with respect to DNA sequencing and genome assembly, nothing is reported about DNA extraction and quality. The authors mention several times throughout the text that DNA preps are checked via NanoDrop, Qubit, gel electrophoresis, etc. But none of this is shown in the main body or in the supplementary information. Without this information, it is difficult to assess directly the efficacy of DNA extraction and preparation methods. I recommend including this type of data.

      Additional Comments:

      In this article, the authors report the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. In general, I find the data of sufficiently high quality for release, and I do agree with the authors in that it will prove useful to gain further insights into the ecology of the fungus, and to better understand the genetic basis of its ability to decay wood and produce valuable compounds. This can ultimately lead to discoveries with applications in biotech and other industries.

      Nevertheless, during the review process I noticed several shortcomings with respect to unclear language, insufficient description of the experimental procedures and/or results presented, and missing data altogether. These are all discussed within the checklist available in the ReView portal. For minor comments line-by-line, see below:

      1: Dacrymycetaceae should be italicized (throughout the whole manuscript). This follows the convention established by The International Code of Nomenclature for algae, fungi, and plants (https://www.iaptglobal.org/icn). Although not binding, this allows easy recognition of taxonomic ranks when reading an article. 49: other fungus -> other fungi 56: photodynamic injury -> UV damage/radiation (photodynamic is used with respect to light-activated therapies etc.) 60: in food industry as natural preservatives in soft drinks -> in food industry to produce natural preservatives for soft drinks 68: cultivated in industry as food additives -> cultivated in industry to produce food additives 69: isolated fungal extract -> the isolated fungal extract 71: What do you mean by Pacific? It’s unclear 71-72: the genomic resource -> genomic data/ genome sequence 72: I would remove “with translational values”, it is very vague and does not add anything to the statement 78: genomic resource -> genomic data/ genome sequence 78-81: this could be rephrased in a smoother manner: e.g. something like “the genomic data will be useful to gain a better understanding of the fungus’ ecology as well as the genetic basis of its wood-decaying ability and…” 85: fruit bodies -> fruiting bodies 88-89: Grown hyphae from >2 week-old was transferred  Fungal hyphae from 2-week old colonies were transferred 90-91: validated with the DNA barcode of Translation  assigned by DNA barcoding using the sequence of Translation… 95: ~ -> Approximately (sentences are not usually started with symbols or numbers) 101-3: Procedure is not clear enough (see other comments through ReView portal) 124: for further cleanup the library -> to further clean up the library / for further cleanup of the library 132: as line 95 152: as lines 95, 132 181-5: Insufficient description of methods, see comments through ReView portal 197: Figure and 1C; Table 2 -> Figure 1C and Table 2 200: average protein length of 451 bp -> average protein-coding gene length / average protein length of ~150 amino acids 211: via the fermentation process with applications in the food industry -> via the fermentation process with potential applications in the food industry

      As a fungal biologist myself interested in fungal genomics and biotechnology, I would like to thank the authors for carrying out this work and the editor for the opportunity to review it. I am looking forward to reading the revised version of the manuscript.

      Riccardo Iacovelli, PhD GRIP, Chemical and Pharmaceutical Biology department University of Groningen, Groningen - The Netherlands

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the milky mangrove Excoecaria agallocha, also known as blind-your-eye mangrove due to its toxic properties of its milky latex that can cause blindness when it comes into contact with the eyes. Living in the brackish water of tropical mangrove forests from India to Australia, they are an extremely important habitat for a diverse variety of aquatic species, including the mangrove jewel bug of which this tree is the sole food source for the larvae. Using PacBio HiFi long-reads and Omni-C technology a 1,332.45 Mb genome was assembled, with 1,402 scaffolds and a scaffold N50 of 58.95 Mb. After feedback the annotations were improved, predicting a very high number (73,740) protein coding genes. The data presented here provides a valuable resource for further investigation in the biosynthesis of phytochemical compounds in its milky latex with the potential of many medicinal and pharmacological properties. As well as increasing the understanding of biology and evolution in genome architecture in the Euphorbiaceae family and mangrove species adapted to high levels of salinity.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe milky mangrove Excoecaria agallocha is a latex-secreting mangrove that are distributed in tropical and subtropical regions. While its poisonous latex is regarded as a potential source of phytochemicals for biomedical applications, the genomic resources of E. agallocha remains limited. Here, we present a chromosomal level genome of E. agallocha, assembled from the combination of PacBio long-read sequencing and Omni-C data. The resulting assembly size is 1,332.45 Mb and has high contiguity and completeness with a scaffold N50 of 58.9 Mb and a BUSCO score of 98.4 %. 73,740 protein-coding genes were also predicted. The milky mangrove genome provides a useful resource for further understanding the biosynthesis of phytochemical compounds in E. agallocha.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.119), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Minghui Kang

      Is the data acquisition clear, complete and methodologically sound?

      The sample collection site needs to include latitude and longitude data.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters

      Additional Comments: This study presents the assembly of an Excoecaria agallocha genome using PacBio HiFi and Omni-C technologies. The assembly exhibits good contiguity and completeness, providing a valuable resource for further understanding the phylogenetic position, evolutionary history, and natural product biosynthesis in Excoecaria agallocha. However, there are still some issues that need to be addressed and modified, including the following points: L82 It would be preferable to mention the number of chromosomes and the anchor rate of the chromosome-scale assembly here, as well as the estimated genome size based on K-mer analysis, to further support the accuracy and completeness of the assembly. L88 I think the authors need to rearrange the order of the figures, as it is not appropriate for Fig. 1F to appear before Fig. 1A. Please check the results part and arrange the pictures in a reasonable order. L117 The sample collection site needs to include latitude and longitude data. L187 Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters. L219 The pseudochromosome scaffolding rate of 86.08% appears to be somewhat low (<90%). The sequences that were not scaffolded onto chromosomes could be a result of untrimmed redundancy in the genome assembly or could indicate some assembly errors. L220 Please note that in this instance, Fig. 1C appears before Fig. 1B in the text. I kindly request the author to review and adjust the numbering and arrangement of figures throughout the entire manuscript. L223 The quality of gene annotation appears to be significantly lower than the quality of genome assembly (82.1%/98.4%), indicating poor gene annotation accuracy. Please review the accuracy of the HMM model trained by the Augustus software or consider using a more accurate annotation workflow. L225 Unclassified repetitive sequences account for over 50% of the total repetitive sequences, which can significantly impact subsequent analyses relying on repetitive sequences. It is recommended to use alternative software, such as The Extensive de novo TE Annotator (EDTA), which provides more accurate classification and utilizes a more comprehensive repetitive sequence library, to validate these results.

      Reviewer 2. Dr.Jarkko Salojarvi

      Is the language of sufficient quality? Yes. Are all data available and do they match the descriptions in the paper? Yes Are the data and metadata consistent with relevant minimum information or reporting standards? Yes Is the data acquisition clear, complete and methodologically sound? Yes Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Is there sufficient data validation and statistical analyses of data quality? Yes Is the validation suitable for this type of data? Yes Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    1. Editors Assessment:

      The King Angelfish (Holacanthus passer) is a great example of a Holacanthus angelfish that are some of the most iconic marine fishes of the Tropical Eastern Pacific. However, very limited genomic resources currently exist for the genus and these authors have assembled and annotated the nuclear genome of the species, and used it examine the demographic history of the fish. Using nanopore long reads to assemble a compact 583 Mb reference with a contig N50 of 5.7 Mb, and 97.5% BUSCOs score. Scruitinising the data, the BUSCO score was high compared to the initial N50’s, providing some useful lessons learned on how to get the most out of ONT data. The analysis suggests that the demographic history in H. passer was likely shaped by historical events associated with the closure of the Isthmus of Panama, rather than by the more recent last glacial maximum. This data provides a genomic resource to improve our understanding of the evolution of Holacanthus angelfishes, and facilitating research into local adaptation, speciation, and introgression of marine fishes. In addition, this genome can help improve the understanding of the evolutionary history and population dynamics of marine species in the Tropical Eastern Pacific.

      This evaluation refers to version 1 of the preprint

    2. AbstractHolacanthus angelfishes are some of the most iconic marine fishes of the Tropical Eastern Pacific (TEP). However, very limited genomic resources currently exist for the genus. In this study we: i) assembled and annotated the nuclear genome of the King Angelfish (Holacanthus passer), and ii) examined the demographic history of H. passer in the TEP. We generated 43.8 Gb of ONT and 97.3 Gb Illumina reads representing 75X and 167X coverage, respectively. The final genome assembly size was 583 Mb with a contig N50 of 5.7 Mb, which captured 97.5% complete Actinoterygii Benchmarking Universal Single-Copy Orthologs (BUSCOs). Repetitive elements account for 5.09% of the genome, and 33,889 protein-coding genes were predicted, of which 22,984 have been functionally annotated. Our demographic model suggests that population expansions of H. passer occurred prior to the last glacial maximum (LGM) and were more likely shaped by events associated with the closure of the Isthmus of Panama. This result is surprising, given that most rapid population expansions in both freshwater and marine organisms have been reported to occur globally after the LGM. Overall, this annotated genome assembly will serve as a resource to improve our understanding of the evolution of Holacanthus angelfishes while facilitating novel research into local adaptation, speciation, and introgression in marine fishes.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.115), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Iria Fernandez Silva

      Is the language of sufficient quality? Yes. But, A "the" is missing before "clingfish" in line 171

      Additional Comments:

      The genome assembly presented is of high quality, with values of accuracy and completeness in pair with chromosome level assemblies. The study is very well presented in terms on quality of the results and clarity in the presentation of methods and results. An added value is that it allows understanding how different type of data and assemblers interact in improvng the assembly quality. I also found interesting to see how contiguity and completeness are not always correlated, as this assembly has a great completeness BUSCO score in spite of not having the greatest N50 (compared with the most modern assemblies). This is possibly inherent to the type of data (ONT reads) and this information may guide researchers in making decission over future assembly projects. The demographic analysis is a nice addition to the study, the results are coherent and add information interesting to study the evolution of reef fishes and the biogeography of the TEP. I would appeciate more detail in the captions of figure 4, particularly those of the figure 4D.

      Reviewer 2. Yue Song

      The sequencing and annotation of King Angelfish genomes is impressive and represents a significant addition to the genomic resources for marine fishes. By hybrid assembly, a high-quality genome was provided, and the relationship between historical dynamics of its population and geological events was further discussed. However, in the section on inferring the demographic history, there is no mention of how the author inferred the mutation rate of this species. In addition, the author obtained 486 contigs throughout the assembly using ONT data combined with short reads. Is it possible to further assemble these contigs into chromosomal level? Of course, this does not indicate that it must be achieved within this manuscript, but rather suggests the inclusion of additional discussion on methods to further enhance the referential value of this genome. Additional specific comments: (1) Line 86, I guess the author probably meant to say there were 486 contigs, right? (2) Line 294, "gene models", not "gen models" (3) Line 110-111, it is puzzled my about the numbers in parentheses. I don't quite understand what these numbers mean. I haven't seen any explanation in this MS. Did I miss something? (4) If possible, it is recommended to show the phylogenetic relationships between these species in Figure 3.

    1. Editors Assessment: Marsupial species are invaluable for comparative studies due to their distinctive modes of reproduction and development, but there are a shortage of genomic resources to do these types of analyses. To help address that data gap multi-tissue transcriptomes and transcriptome assemblies have been sequenced and shared for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that due to is ease of breeding is emerging as a useful lab model. Using ONT nanopore and Pacbio long-reads and illumina short reads 2,093,982 transcripts were sequenced and assembled, and functional annotation of the assembled transcripts was also carried out. Some addition work was required to provide more details on the QC metrics and access to the data but this was resolved during review. This work ultimately producing dunnart genome assembly measuring 3.23 Gb in length and organized into 1,848 scaffolds, with a scaffold N50 value of 72.64 Mb. These openly available resources hopefully provide novel insights into the unique genomic architecture of this unusual species and provide valuable tools for future comparative mammalian studies.

      This evaluation refers to version 1 of the preprint

    2. AbstractMarsupials exhibit highly specialized patterns of reproduction and development, making them uniquely valuable for comparative genomics studies with their sister lineage, eutherian (also known as placental) mammals. However, marsupial genomic resources still lag far behind those of eutherian mammals, limiting our insight into mammalian diversity. Here, we present a series of novel genomic resources for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that, due to its ease of husbandry and ex-utero development, is emerging as a laboratory model. To enable wider use, we have generated a multi-tissue de novo transcriptome assembly of dunnart RNA-seq reads spanning 12 tissues. This highly representative transcriptome is comprised of 2,093,982 assembled transcripts, with a mean transcript length of 830 bp. The transcriptome mammalian BUSCO completeness score of 93% is the highest amongst all other published marsupial transcriptomes. Additionally, we report an improved fat-tailed dunnart genome assembly which is 3.23 Gb long, organized into 1,848 scaffolds, with a scaffold N50 of 72.64 Mb. The genome annotation, supported by assembled transcripts and ab initio predictions, revealed 21,622 protein-coding genes. Altogether, these resources will contribute greatly towards characterizing marsupial biology and mammalian genome evolution.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.118), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Qiye Li

      For the ONT, PacBio and Illumina data for genome assembly, is there any new data that was generated in this manuscript? Are all of the data collected from the same individual? If so, what is the gender of the individual for genome assembly? It will be appreciated to make this information clear to readers. Page 3: I think "Pacific Biosciences CRL" should be modified to "Pacific Biosciences CLR"

      Reviewer 2. Emma Peel.

      Are all data available and do they match the descriptions in the paper?

      No. The figshare link doesn't work, but I'm presuming this is because the paper hasn't been published? Will data be accessioned in the GigaScience Database to ensure accessiblity? The illumina short-read genomic and RNAseq datasets are available through NCBI and match descriptions in the paper. I was unable to find the raw PB and ONT data from [68] that was used to generate the genome assembly. The authors of [68] indicate these datasets are available in supplementary table 3, but if you click through the figshare link in this table the raw data isn't there, nor anywhere else listed in the data availability section. Can the authors please clarify the location of the raw data and update the data availability section of this manuscript accordingly.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. Access to the GigaDB accession hasn't been provided, so I am unable to determine if the data and metadata is consistent with minimum information reporting standards according to the GigaDB checklists.

      Is the data acquisition clear, complete and methodologically sound?

      Yes. Some minor clarifications are required, see comments in the PDF. For example, please include detail on how RNA quality was determined (e.g. RIN numbers) and provide more detail regarding method of library preparation, flowcell and instrument used for Illumina sequencing.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. The only detail lacking is the method of transcript quantification used to determine the top 90% most highly expressed transcripts.

      Is the validation suitable for this type of data?

      Yes. Data validation is suitable, however I would like to see a comparison of v1.1 genome assembly with other marsupial genome assemblies.

      Additional Comments:

      This study is an important addition to marsupial omics resources, and I was excited to see such a comprehensive set of transcriptomes. My main comment is the need to explain and discuss the initial assembly (v1) in the introduction to provide context for the improved assembly. See comments in the attached PDF.

      Annotated paper: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDg3L2d4LURSLTE3MDE2Njk5NzdfRVAgKDIpLnBkZg==

    1. AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.Competing Interest StatementThe authors have declared no competing interest.FootnotesRepair the typos of the title.

      Reviewer 2. Mohammadali Khan Mirzaei

      Yin et al. have developed a new tool to differentiate eukaryotic and prokaryotic viruses. The tool offers a potential benefit to the community, but there are several issues with the contribution in its current form, as discussed below.

      Major issues: The authors should separate their training and testing databases. Ideally, their testing dataset should include a set of previously unseen viruses that have their host experimentally confirmed. In addition, the performance of IPEV should be compared with tools commonly used in the field, including vcontact2: https://doi.org/10.1038/s41587-019-0100-8 and iPHoP: https://doi.org/10.1371/journal.pbio.3002083. However, none of these tools are developed to directly differentiate eukaryotic and prokaryotic viruses, identification of viral taxonomy or host range could lead to the identification of viral type. Moreover, the authors have used multiple approaches for their assessment of the type of viruses. Yet, it is not clear how they combined the results they generated by these approaches in their decisions.

      Minor issues: Please use either phageome or phages instead of phage virome. There are some typos in the text that need to be fixed.