984 Matching Annotations

Last 7 days
www.biorxiv.org www.biorxiv.org

TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy

3
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Julia Markowski
  
  In the presented Technical Note "TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy" by Hartmann et al., the authors introduce TinkerHap, a new hybrid phasing tool that primarily relies on read-based phasing for both short- and long-read sequencing data, but can additionally incorporate externally phased haplotypes, enabling it to build upon phase information derived from existing statistical or pedigree-based phasing approaches. This hybrid approach addresses an important and timely challenge in the field: integrating the complementary strengths of different phasing strategies to improve the accuracy and span of haplotype blocks, particularly for rare variants, or in variant-sparse genomic regions. The authors clearly articulate the limitations of existing approaches and present their solution in a manner that is both elegant and accessible. Design features such as multiple output formats and compatibility with third-party tools demonstrate a practical awareness of user needs. The authors evaluate TinkerHap using both short-read and long-read state-of-the-art benchmarking datasets, and compare its performance against commonly used phasing tools, demonstrating improvements in both phasing accuracy and haplotype block lengths. Overall, this is a well-conceived and thoughtfully implemented contribution to the phasing community.
  
  While the manuscript is overall well written, there are a few areas where additional clarification or extension would improve its impact. I recommend the following revisions to help clarify key aspects of the method, enhance the generalizability of the evaluation, and align the manuscript more closely with journal guidelines.
  
  Major Comments * (1) Limited scope of benchmarking The evaluation on the highly polymorphic MHC class II region is appropriate for highlighting TinkerHap's strengths in phasing rare variants in variable regions. However, the current evaluation on short -read based phasing is based on a ∼700 kb region selected for its high variant density, which limits the generalizability of the findings. Since the manuscript emphasizes improved performance in regions with sparse genetic variation, it would strengthen the work to include chromosome-wide or genome-wide benchmarks, particularly on short-read data. This would also provide a more balanced comparison with tools like SHAPEIT5, which predictably underperform in the MHC class II region due to their reliance on population allele frequencies and linkage disequilibrium patterns that are less effective for rare or private variants. * (2) Coverage and scalability The manuscript describes TinkerHap as scalable, but since the algorithm relies on overlapping reads, it is unclear how its performance varies with sequencing depth. Including a figure or supplementary analysis showing phasing accuracy, runtime, and memory usage at different coverage levels (particularly for short-read data) would help support this claim and guide users on appropriate coverage requirements. * (3) Clarify algorithmic novelty It would be helpful to elaborate on how TinkerHap's read-based phasing algorithm differs from existing approaches such as the weighted Minimum Error Correction (wMEC) framework implemented in WhatsHap. For example, what specifically enables TinkerHap's read-based mode to produce longer haplotype blocks than other read-based tools? * (4) Data description A brief characterization of the input datasets, such as the sequencing depth, as well as the number and average genomic distance of heterozygous variants in the MHC class II region and the GIAB trio data would provide important context for interpreting the reported phasing accuracy and haplotype block lengths. * (5) Manuscript structure Since the algorithm itself is the core novel contribution, it should be part of the results section, as well as the description of the evaluation currently in placed in the discussion. According to GigaScience's Technical Note guidelines, the method section should be reserved for "any additional methods used in the manuscript, that are not part of the new work being described in the manuscript."
  
  Minor Comments * (a) Novelty of hybrid approach While TinkerHap's ability to integrate externally phased haplotypes is valuable, similar functionality exists in other tools, for example, SHAPEIT can accept pre-phased scaffolds (including those generated from read-based phasing), and WhatsHap supports trio-based phasing. Consider refining the language to more precisely describe what is uniquely implemented in TinkerHap's hybrid strategy. It would be interesting to see how the presented results of using SHAPEIT's phasing output as input for TinkerHap compare to an approach of feeding TinkerHap's read-based phasing results into SHAPEIT. * (b) Reference bias claim The introduction states that read-based phasing is "independent of reference bias." While this approach is generally less susceptible to reference bias than statistical phasing, bias can still arise during the read alignment stage, potentially affecting downstream phasing. This point should be clarified. * (c) GIAB datasets The abstract mentions only the GIAB Ashkenazi trio, but later the Chinese trio is included in the analysis as well. Please clarify whether results are averaged across the two datasets. * (d) Tool version citation Please clarify in the text that the comparison was made using SHAPEIT5, not an earlier version.
  
  Recommendation: Minor Revision With additional clarification on generalizability and coverage sensitivity, this manuscript will make a valuable contribution to the field.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yilei Fu
  
  TinkerHap is a read-based phasing algorithm designed to accurately assign alleles to parental haplotypes using sequencing reads. General comments: 1. The manuscript would greatly benefit from the inclusion of a flowchart or schematic overview of the TinkerHap algorithm. Given that the method incorporates multiple components—including read-based phasing, pairwise distance-based unsupervised classification, and optional integration with statistical phasing tools like ShapeIT—a visual diagram would help readers grasp the workflow more intuitively. Major comments: 1. The authors are missing experiments for long-read based phasing. How does TinkerHap performs with ShapeIT on PacBio long-reads? I would suggest the authors using the same phasing method class as their short-read analysis: TinkerHap+ShapeIT; TinkerHap; WhatsHap; HapCUT2; ShapeIT. Also I believe ShapeIT is capable to take long-read SNV/INDEL calls as vcf. 2. Following up on the point 1, the experimental design of this study is quite skewed. WhatsHap is not suitable for short-read sequencing data. It does not make sense to apply WhatsHap on short-read data. 3. I would caution the authors to read and potentially compare with SAPPHIRE (https://doi.org/10.1371/journal.pgen.1011092). This is a method that developed by the ShapeIT team for incorporating long-read sequencing data and ShapeIT. 4. To better justify the hybrid strategy, I recommend adding an analysis of sites where TinkerHap and ShapeIT disagree. Are these differences due to reference bias, read coverage, variant type, or true ambiguity? Such an evaluation would help users understand when to rely on the read-based output vs. ShapeIT, and enhance confidence in the merging strategy. Minor comments: 1. I could see the versions of the software in the supplementary github, but I think it is also important to include those in the manuscript. For example, shapeIT 2-5 are having quite different functions. The citation for ShapeIT in the manuscript is for ShapeIT 2, but the program that has been used is for ShapeIT 5. 2. Need to mention the benchmarking hardware information for runtime comparison. 3. "...a novel and unique phasing algorithm..." -> "...a novel phasing algorithm..."
3. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Arang Rhie
  
  The authors present TinkerHap, a tool that accepts a variant call set and read alignment, and assigns heterozygous variants and reads to a particular haplotype based on a greedy pairwise distance-based classification. It accepts a pre-phased VCF as an option to further extend phased blocks. The results sound neat with statistics making it look the greatest compared to current state-of-the-art read alignment based phasing methods such as HapCut2, WhatsHap, and ShapeIT which uses statistical inference from reference panel data. However, there are several aspects the authors need to address to make their results more compelling. 1. The benchmarking was only performed on MHC Class II, which is a relatively small and easy to phase region based on the high level of heterozygosity. How does the statistics look when applied to the whole genome? After generating the phased read set, what % of reads can be accurately assigned to the original haplotype in the whole genome scale? To benchmark the latter, I would recommend doing it on HG002 phased variants and reads by using the HG002Q100 genome (https://github.com/marbl/hg002) - i.e. map the classified reads and calculate the coverage and accuracy based on where the reads align to. I would be curious to see how the MHC Class II phased read alignment looks like on the HG002Q100 truth assembly, on each haplotype. 2. When showing benchmarking results, key features are missing - 1) number of heterozygous variant sites are used for phasing, in addition to the Phased % (what's the denominator here?), 2) number of phase blocks, phase block NG50 and total length and 3) Show the NGx length distribution by plotting the cumulative covered genome length as a function of the longest to shortest phase block. 3. After phasing the variants (and reads), are the authors accurately able to type the HLA Class II genes? The goal of MHC phasing is to accurately genotype the HLA-genes. It is unclear to me why the authors applied their phasing on the 1,040 parent-offspring trios. I agree that it is 'phasable', however, it is unclear what the motivation here is - the MHC Class II is particularly known to have linked HLA types (e.g., HLA-DRB3 and HLA-DRB5 are inherited together depending on the HLA-DRB1 type, while in some haplotypes HLA-DRB3 is entirely missing), and depending on the HLA types and because the reference is incompletely representing this locus, there are multiple tools developed for genotyping this locus. I would be more convinced if the authors could show the HLA genotyping accuracy together based on their phasing method. 4. Is it possible to use additional data types to further extend the phase blocks, by using datasets such as low coverage PacBio data in addition to the short-read WGS? How about phasing with linked-reads or Hi-C? Both Whatshap and HapCut2 are specifically designed to combine such short and long-range datasets, giving the advantage of using such tools. 5. The authors claim their method is free from reference bias, which I strongly disagree. Using a bam file aligned to a reference inherently has the issue of mapping biases, so any such tools are limited by the reads that aligns incorrectly. Repeats, especially copy number variable region with collapses in the reference are very difficult to accurately phase. Any large structural variant not properly represented in the reference will cause problems due to unmapped reads. 6. In Methods, 2nd section - I would suggest to use allele 1 and allele 2 instead of 'reference' and 'alternative' in the equation and the code. This will increase the number of heterozygous 'phasable' variants that does not carry any reference allele.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.16.638517v1
www.biorxiv.org www.biorxiv.org

Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their complexity remain speculative, relying on limited data and extrapolation from shallow sequencing. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 Gbp of Nanopore long-read and 122 Gbp of Illumina short-read data to a single forest soil sample.Results Our hybrid assembly reconstructed 837 metagenome-assembled genomes (MAGs), including 466 high- and medium-quality genomes, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that over 10 Tbp would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify over 11,000 biosynthetic gene clusters (BGCs), >99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf135), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ameet Pinto
  
  The manuscript provides long-read mock community datasets from GridION and PromethION sequencing platforms along with draft genomes of mock community organisms sequenced on the Illumina Platform. The entire dataset is available for reuse by the research community and this is an extremely valuable resource that the authors have made available. While there are some analyses of the data included in the current manuscript, it is largely limited to summary statistics (which seems appropriate for a Data Note type manuscript) and some analyses of interest to the field (e.g., de novo metagenome assembly). It would have been helpful to have a more detailed evaluation of the de novo assembly and parameter optimization, but this may have been outside the scope of a Data Note type manuscript. I have some minor comments below to improve clarity of the manuscript.
  
  Minor comments: 1. Line 28-29: Would suggest that the authors provide the citation (15) without the statement in parenthesis or revised version of statement in parenthesis.
  
  "DNA extraction protocol" section 2. The last few lines were a little bit unclear. For instance: "45 ul (Even) and 225ul (Log) of the supernatant retained earlier…" It was a bit confusing. Possibly because the line "The standard was spun…before removing the supernatant and retaining." seems incomplete. I would suggest that the authors consider posting the entire protocol on protocols.io - as is quite possible that other groups may want to reproduce the sequencing step for these mock community standards. This would be particularly helpful as the authors suggest that the protocol was modified to increase fragment length.
  
  "Illumina sequencing" section: 3. Suggest that the authors improve clarity in this section by re-structuring this paragraph. For instance, early in paragraph it is stated that the pooled library was sequenced on four lanes on Illumina HiSeq 1500, but later stated that the even community was sequenced on a MiSeq.
  
  "Nanopore sequencing metrics" in results: 4. Table 2, Figure 3a. - please fix this to Figure 1a. 5. Figure 1B: The x-axis is "accuracy" while in this section Figure 1b is referred to as providing "quality scores". Please replace "quality scores" with "accuracy" for consistency. 6. Figure 1C: Please provide a legend mapping colors to "even" and "log". I realize this information is in Figure 1B, but would be helpful for the reader. Finally, there is no significant trend in sequencing speed over time. Considering this, would be easier to remove the Time component and just have a single panel with the GridION and PromethION sequencing speed for both even and log community in the same panel. It would make it easier to compare the different in sequencing speeds visually.
  
  "Illumina sequencing metrics" in results: 7. Table 5 is mentioned before Tables 3 and 4. Please correct this.
  
  "Nanopore mapping statistics" in results: 8. For Figure 2, consider also providing figure for the even community. 9. Further, it would be helpful to get clarity on where the data for Figure 2 is coming from. Is this from mapping of long-reads to mock community draft (I think so) or from the kraken analyses.
  
  "Nanopore metagenome assemblies" in results: 1. It is unclear how the genome completeness was estimated. 2. The consensus accuracy data is provided for all assemblies combined. Would be helpful if there was some discussion on accuracy of assemblies as a function of wtdgb2 parameters tested. There is some discussion of this in the "Discussion section", but would be helpful if this was laid out clearly in the results, with an additional appropriate figure/table.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their complexity remain speculative, relying on limited data and extrapolation from shallow sequencing. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 Gbp of Nanopore long-read and 122 Gbp of Illumina short-read data to a single forest soil sample.Results Our hybrid assembly reconstructed 837 metagenome-assembled genomes (MAGs), including 466 high- and medium-quality genomes, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that over 10 Tbp would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify over 11,000 biosynthetic gene clusters (BGCs), >99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf135), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Lachlan Coin
  
  This is a great data resource, and will be invaluable to the community for testing/developing approaches for metagenome assembly. The aims are well described. Aside from a few queries I have below, the conclusions are largely supported by data shown; the manuscript is well written, and there are no statistical tests presented.
  
  Major comments: It seems that species assignment was done in two ways, one by using Kraken on the contigs (with a database of many bacterial/viral/fungal genomes) ; and also by mapping the reads directly to the illumina assemblies of the isolates in the mixture. It would be useful to be clearer in the results which approach was used in reporting the results. E.g. the sentence " We identify the presence of all 10 microbial species in the community, for both even and log samples, in expected proportions(Figure 2). " presumably relates to the analysis just mapping to the draft illumina assemblies?
  
  Also, It seems a little surprising that there were no false positive identification of species not present in the mixture. Is this because this analysis is based on mapping to the draft illumina isolate assemblies only (see previous comment). Or, if based on kraken assignment of contigs, perhaps repetitive and/or short contigs were filtered out?
  
  Could the authors present more statistics on the quality of the nanopore metagenomic assemblies, including the presence of misassemblies, any chimeric contigs, checkM completeness results; indel errors, mismatch errors, etc.
  
  Also, can the authors confirm that the assemblies were done on the full nanopore dataset (rather than, for example, on each isolate separately after mapping the reads to each isolate draft illumina assembly).
  
  The authors write : " For the even community, using wtdgb2 with varying parameter choices, we were able to assemble seven of the bacteria into single contigs." , however this does not seem to be borne out by figure 3? I could only see 4 species with at least one single contig assembly. Perhaps the authors could spell out which species have a single contig assembly?
  
  Minor Comments:
  
  In abstract "even and odd communities" should be ' evenly-distributed and log-distributed communities for clarity (this term is otherwise unclear to casual reader of abstract)
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.28.656579v1
www.biorxiv.org www.biorxiv.org

EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ju Xiang
  
  This paper proposes an inductive graph neural network model EssSubgraph for prediction of mammalian essential genes by integrating protein-protein interaction (PPI) networks with multi-omics data. Experimental results demonstrate the performance of methods, with additional validation showing effective cross-species prediction and biological consistency of predicted essential genes through functional enrichment analysis. This work is interesting, but some questions need to be clarified before publication. (1)The literature review lacks discussion about inductive vs. transductive graph learning approaches. Expanding this background would better contextualize the model's technical contributions. (2)While PCA dimensions for expression features were optimized (Figure 2A-B), other key hyperparameters like sampling depth (K-hop) deserve similar systematic evaluation to ensure optimal configuration. (3)What is RuLu? How does the author handle the issue of sample imbalance? Does CONCAT mean that two vectors are connected end-to-end to become a vector? If yes, does it mean that the number of rows of W is set to 1 in order to generate the final prediction output? (4)How to perform the sampling of nodes in EssSubgraph? The explanation of 'Subgraph' in the method name is not sufficient. (5)What are 'Edge perturbation' and 'feature perturbations'? How to perform? What is the performance of the algorithm in this article when only the network structure is used or only gene expression data is used? Or say, on the basis of the network, does adding gene expression data bring performance improvements, and vice versa? (6)The computational efficiency analysis focuses on memory usage but omits critical metrics like training time and scalability with respect to batch size or sampling strategies. Is it appropriate to directly compare 'Memory efficiency and network scalability'? The same method may require different amounts of memory and computation time when using different encoding technologies. (7)Minor revisions: --"and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." --Lines 244-251, "We used the EssSubgraph model mentioned above." The logical relationship here needs to be optimized. --"The model is an inductive deep learning method that generates low-dimensional vector representations for nodes in graphs and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." It is not clear. --Suggest to supplement statistical data on 'high density'. In terms of existing networks, they generally may not be called high-density. --Placing the perturbation curves of different methods in the same figure is more convenient for comparing the stability of different methods.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Yuchi Qiu
  
  Predicting essential genes are critical for identifying disease-associated genes. In this work, the authors EssSubgraph to predict essential genes by combining PPI and transcriptome data. EssSubgraph utilizes a GraphSAGE structure with subgraph sampling techniques to produce accurate, efficient, and scalable predictions. The method was tested and compared with multiple GNN-based models on 1) essential gene prediction, 2) predictions with randomly permuted node and edge features, and EssSubgraph shows advanced performance in accuracy, efficiency, and scalability. The author also performed GO analysis to show the interpretability of EssSubgraph to pick up genes with critical biological functions. Further analysis in predicting unseen genes and cross-species gene exemplified the strong generalizability. Overall, this work developed a novel and advanced GNN-based model with comprehensive studies. However, some clarifications are necessary to improve the paper readability. 1. The authors may give an overview about method motivations. For example, the authors may show method of DepMap and its limitation, then use this as motivation to describe why EssSubgraph is better. It looks like essential genes are very context specific, the authors may clarify what information is used to define essential genes? 2. The authors may introduce their method's unique features such as graph sampling, and its modifications to GraphSAGE. 3. The GNN model description of EssSubgraph is not clear enough. What kind of graph aggregation is used? Is the aggregation layer coupled with residual layer, and how many layers are used? What is the structure after all aggregation layers? I recommend creating an illustration of network architecture showing all these details. 4. Many PPI networks are cell-type- or species-specific. How was those cell-type and species information used in this work? 5. Line 150-152: clarification needed. 6. Line 222, should "learned linear transformation" be "learnable linear layer"?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.21.665218v1
www.biorxiv.org www.biorxiv.org

GFFx: A Rust-based suite of utilities for ultra-fast genomic feature extraction

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractGenome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust’s strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark results demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Andrew Su
  
  This paper describes GFFx, a new fast and efficient toolkit for working with GFF files. The tool describes a notable advance over curent state of the art, and the manuscript overall is well-written. I have only the following minor suggestions for consideration:
  
  In figure S1 and the corresponding discussion, the authors test GFFx on 4 different GFF annotation databases of differing sizes, and differences between the performance is attributed solely to the different dataset sizes. The authors should consider subsetting the largest annotation database (hg38) to more smoothly track how performance and memory use vary with annotation database size, and to confirm there are no organism-specific effects that could underlie the observed differences.
  
  The authors should consider changing the line charts in figures 2 and 3 to bar charts — I think the line implies a linear relationship between the tools along the x-axis that is not intended.
  
  For the purposes of benchmarking, the authors used random sampling to extract subsets of the benchmark datasets (e.g., lines 85 and 107). The authors should confirm that the exact same subsets were used when running each tool.
  
  In addition to depositing the code and benchmarks on Github, the authors should also deposit snapshots in an archival data repository (like Zenodo).
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractGenome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust’s strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark results demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Xingtan Zhang
  
  The overall research appears comprehensive; however, further attention to the tool's capabilities and methodological rigor would strengthen its validity and broader applicability.
  
  In the "Performance benchmark in annotation indexing" section, the authors utilized genome annotations from four species (Homo sapiens hg38, Pungitius sinensis ceob_ps_1.0, Drosophila melanogaster dm6, and Arabidopsis thaliana tair10.1) as representatives for benchmarking and subsequent analyses. Nevertheless, a robust GFF processing suite should ideally demonstrate reliability across a broader spectrum of genome types, irrespective of their frequency of use. To enhance the generalizability of GFFx and cater to a wider user base, it is recommended that additional genomes—such as those of Triticum aestivum, Mus musculus, and Sus scrofa—be included in the benchmarks. This would better validate the tool's robustness across species with varying genome complexities.
  
  While the 20-kb interval length used in the region-based retrieval benchmarks is biologically relevant, corresponding to typical gene sizes, it does not fully capture the diversity of genomic query scenarios. To comprehensively assess GFFx's performance across diverse genomic contexts, it is suggested that supplementary benchmarks be conducted using interval lengths of 10 kb and 100 kb. This would help validate the tool's robustness across varying interval scales, which is critical for its practical utility in diverse research workflows.
  
  To further broaden the software's applicability, it is recommended to incorporate an additional functionality that enables the extraction of the number of reads covering specific intervals from BAM files based on positional information derived from GFF3 files, thereby facilitating the calculation of sequencing depth. This feature would be analogous to the functionality provided by bedtools coverage, enhancing GFFx's utility in integrating genome annotation data with sequencing read coverage analyses.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.08.08.669426v1
www.biorxiv.org www.biorxiv.org

On the path to reference genomes for all biodiversity: lessons learned and laboratory protocols created in the Sanger Tree of Life core laboratory over the first 2000 species

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractSince its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally-resolved reference genome assemblies for over 2000 species. Tree of Life has at its core multiple teams, each of which are responsible for key components of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at protocols.io.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Lars Podsiadlowski
  
  The Authors provide a profound overview over their aim to generate genome information for a wide range of species in the tree of life project. As a scientist with hands on experience on genome sequencing, I greatly appreciated all the information here, especially detail on the differences experienced with different taxa, as this is probably the most important lesson here, that there is high variation and strategies must be adapted to that. I am also happy that many of the approaches are also available as detailed online protocols, which really helps a lot in practical work. The selected examples of size profiles also give a good impression on what differences can be expected, e.g. with different extraction methods applied to the same species. Although detailed, I think that the authors provide a lot of relevant information here and would not change that. I did also not spot any errors or flaws in the text.
  
  One thing that might be changed is the title. From first reading it I expected to hear also about assembly strategies, as well as some comparisons and oddities of the yielded genomes. It is great to have the manuscript as it is, but I like to see it better reflected in the title that the main focus here is on the wet lab part, especially the extraction of good quality DNA/RNA.
  
  I have some issues with the figures: Fig. 7: there is no mention in the legend about the y-axis scale - I assume from the text that it refers to Gigabases? Figs. 8,9, 11-15: It is a bit confusing until I realised the log scale of the numbers. I would prefer to see it not with a log scale, but in a similar way as Fig. 6, with percentages on display, and an accompanying species number somewhere on the side. In the way it is shown now, the failed proportion looks so small and gives a wrong impression. Maybe overthink the colors, I would prefer another color for the Pass ULI, which is more similar in tone with Pass, because at the moment pass ULI and fail are similar in tone and brightness and appear as being opposed to the green "pass", while the difference between "fail" and the rest should be more pronounced in my view.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractSince its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally-resolved reference genome assemblies for over 2000 species. Tree of Life has at its core multiple teams, each of which are responsible for key components of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at protocols.io.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Yuan Deng
  
  The manuscript focuses on the entire experimental processes involved in the generation of high-quality genomes and proposes a set of standardized and modular experimental process protocols. The innovation of these protocols is that they can be flexibly combined according to different taxa, tissue types and sample quality, which greatly improves the flexibility and efficiency of the experiment and provides a reference experimental process for researchers in this field to follow. The manuscript also explore the specific challenges and solutions of different taxa in the experimental procedure of sample processing, DNA extraction, shearing, cleaning, Hi-C and RNA extraction, providing valuable guidance for future research. Meanwhile, the manuscript reviews the experimental protocols for the production of genome data of more than 2,000 species, which is in line with the journal's focus on biological big data. Therefore, I consider the subject matter and content of this work are appropriate for publishing in this journal. I only have some minor requests for revision:
  
  1.Sample processing: (1) Sampling of rare and endangered species: for such a large-scale study of the "Tree of Life", it is bound to involve some species that are difficult to obtain conventional tissues, therefore the manuscript may include a section on how to select suitable tissues for subsequent experiments, especially for rare species. And is it possible to provide a prioritized list of tissues selection based on the difficulty of extracting high-quality DNA? (2) Processing and extraction of unconventional tissues: accordingly, it is recommended to add content regarding sample processing and extraction procedures for unconventional tissues, e.g., any particular methods to improve the quality of DNA extraction. (3) Sample contamination problem is often overlooked yet critical: how to reduce sample contamination problems in large-scale sample processing and other experimental processes? How to exclude sample or experimental contamination from data?
  
  2.Analyzing method limitations: while the manuscript mentions some challenges that may be encountered in the processing of samples from various taxa, there is little discussion on the limitations of those experimental methods. It is recommended to expand the content of the limitations of the methods, such as some methods may not work well for certain types of samples, or some steps may have factors that affect the accuracy of the results, so that readers can have a more comprehensive understanding of the scope of application and potential problems of the method.
  
  3.The manuscript is currently organized according to the experimental procedures, but some of the more relevant components could probably be consolidated to reduce redundant information and improve the readability. The authors studied the experimental conditions for different taxa in long read sequencing and Hi-C library preparation, but fail to emphasize their relevance in the introduction.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.11.648334v1
www.biorxiv.org www.biorxiv.org

Improving the Reliability and Quality of Nextflow Pipelines with nf-test

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Katalin Ferenc
  
  1) General assessment of the work.
  
  It is a very nice addition to the scientific community, an important step towards standardizing the development and maintenance of software for bioinformatics pipelines. It is not a trivial task to adapt unit testing concepts to pipelines. nf-test has already been used by the community and has been in a feedback loop with the users. Thus, its usability has been constantly improving, both through the efforts of the developers and additional plugins from the user base, highlighting the ease of contribution to the nf-test software base. The text is well written and easy to follow. However, some concepts could be better described and discussed for the readers.
  
  2) Specific comments for revision:
  
  a) Major comments; - The authors should refer to pytest-workflow in the introduction, along with NFTest, as both are used for comparison. - Test coverage is helpful to identify which lines are vulnerable to changes. For the calculation of the test coverage in nf-test, indirect tests are considered. Does it mean that if a single integration test is written, then all called modules are considered covered? Please clarify or argue why this is a good strategy. - An interesting idea in nf-test is to use snapshot testing for modules, workflows, and pipelines. As the authors mention, this has been used in web development. According to the cited reference, it is especially used for frontend code and has been noted as a quick but fragile way of testing. This is because snapshot testing does not provide insight into the correctness of the code, but only asserts that there was no change. It is beneficial that this test checks for unexpected changes that unit tests might miss. In the "Code reduction through snapshot testing" section, the authors highlight cases when snapshot testing results in failed tests: 1) when there is a change in the code due to a bug, and 2) when default parameters are modified. We understand that snapshot testing in the context of pipeline development is useful in two scenarios: 1. when the pipeline itself is being refactored, the output of each module should stay the same. In this case, snapshot testing is used to fix the output of the tools, and a failing test highlights that the Nextflow code wrapping the tools is incorrectly integrated (i.e., connected to each other). 2. pipeline / module versioning requires knowledge about changes in the underlying tools. In this case, snapshot testing helps because any failure in the tests flags a change. As there is no oracle, one would not know if the bug was introduced or fixed. However, from the pipeline development perspective, the only thing that matters is that there should be a new version. According to our understanding, in any other case, a more traditional approach should be preferred, where there is an oracle knowing about expected file formats, content, or errors. Otherwise, there is a risk of adding many tests that unnecessarily fail, causing increased development time. Please add explicit discussion about these scenarios, or other ones based on your insights, highlighting when snapshot testing is applicable/appropriate during pipeline development. Please add a summary of other types of tests (e.g., assertions about file or channel content, verification of tool execution given input data, and error handling checks) that can be run within the nf-test framework. b) Minor comments: - In the "evaluation and validation" section, the authors describe that they ran tests in nf-core/modules between github versions. Please clarify that these modules were already covered by tests. - Table 4 is referenced in the Discussion section. It would be better to move the comparison between tools to the Results section. - On page 16, typo: "queuing system" - Figure 2 title typo: "nf-tet" - Figure 2: please add comments about the time cost of adding tests during the development, as it is highlighted on the figure. - Page 22 typo: "savings areis calculated" - Abstract: "Build on…" should be "Built on…" - Shouldn't TM2 linked to M3 be TM3 in Figure 1?
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jose Espinosa-Carrasco
  
  The article presents nf-test, a new modular and automated testing framework designed specifically for Nextflow workflows, a widely used workflow management system in bioinformatics. nf-test aims to help developers improve the reliability and maintainability of complex Nextflow pipelines. The framework includes very useful features such as snapshot testing, which assesses the computational repeatability of the results produced by the execution of a pipeline or its components and smart testing which optimises computational resources by only executing tests on the parts of the pipeline that were modified, reducing overall run time. Notably, nf-test can be integrated into CI workflows and has already been adopted by the nf-core community, demonstrating its utility and maturity in real-world scenarios
  
  General comments:
  
  The manuscript could benefit from reordering some sections to follow a more consistent structure and by removing redundant explanations. I think it would be nice to include one limitation of nf-test, the fact that reproducing previous results does not necessarily imply biological correctness. This point is not entirely clear in the current version of the manuscript (see my comment below). Another aspect that could improve the manuscript is the inclusion of at least one reference or explanation of how nf-test can be applied outside nf-core pipelines, as all the provided examples are currently restricted to nf-core.
  
  Specific comments:
  
  On page 3, the sentence "Thus, maintenance requires substantial time and effort to manually verify that the pipeline continues to produce scientifically valid results" could be more precise. I would argue that identical results across versions do not guarantee scientific validity; they merely confirm consistency with previous outputs. True scientific validity requires comparison against a known ground truth or standard.
  
  On page 4, in the sentence "It is freely available, and extensive documentation is provided on the website", I think it would be nice to include the link to the documentation.
  
  In the "Evaluation and Validation" section (page 8), it would be helpful to briefly state the goal of each evaluated test, as is done with the nf-gwas example. ou could include something similar for the nf-core/fetchngs and modules examples (e.g. to assess resource optimization through smart testing). Also, the paragraph references the "--related-tests" option, which could benefit from a short explanation of what it does. Lastly, the order in which the pipelines are presented in this section differs from the order in the Results, which makes the structure a bit confusing.
  
  The sections titled "Unit testing in nf-test", "Test case execution", "Smart testing and parallelization", "Snapshot testing", and "Extensions for bioinformatics" seem more appropriate for the Materials and Methods section, as they describe the design and functionality of nf-test rather than reporting actual results. Please ignore this comment if the current structure follows specific journal formatting requirements that I may not be aware of.
  
  The Snapshot testing discussion in the Results section feels somewhat repetitive with its earlier explanation. Consider combining both discussions or restructuring the content to reduce duplication.
  
  On page 11, the sentence "In these cases, MD5 sums cannot be used and validating the dynamic output content can be time-intensive" is not entirely clear to me, does it mean that it is time consuming to implement the test for this kind of files or that the validation of the files is time consuming?
  
  On page 12, the sentence "Second, we analyzed the last 500 commits..." is confusing because this is actually the third point in the "Evaluation and Validation" section, as mentioned before. reordering would improve clarity.
  
  On page 14, the authors state "However, changes (b) and (c) lead to incorrect output results without breaking the pipeline. Thus, these are the worst-case scenarios for a pipeline developer." While this is mostly true, I would also add that a change in parameters may produce different, but not necessarily incorrect, results—some may even be more biologically meaningful. I suggest to acknowledge this.
  
  Typos:
  
  In the abstract: "Build on a similar syntax as Nextflow DSL2" should be corrected to "Built on a similar syntax as Nextflow DSL2".
  
  In the legend of Figure 2 (page 19): "nf-tet" should be "nf-test".
  
  In the legend of Table 2: "Time savings areis calculated..." should be "Time savings are calculated..."
  
  Recommendation:
  
  Given the relevance and technical contributions of the manuscript, I recommend its publication after addressing the minor revisions summarized above.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.25.595877v1
www.biorxiv.org www.biorxiv.org

CryoDataBot: a pipeline to curate cryoEM datasets for AI-driven structural biology

3
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Nabin Giri
  
  The paper presents a flexible, integrated framework for filtering and generating customizable cryo-EM training datasets. It builds upon previously available strategies for preparing cryo-EM datasets for AI-based methods, extending them with a user-friendly interface that allows researchers to enter query parameters, interact directly with the Electron Microscopy Data Bank (EMDB), extract and parse relevant metadata, apply quality control measures, and retrieve associated structural data (cryo-EM maps and atomic models).
  
  While the manuscript improves upon Cryo2StructData and similar data pipelines used in ModelAngelo/DeepTracer, the innovation claim would be strengthened by a deeper technical comparison, for example quantifying the performance impact of each quality control step in isolation. Some filtering and preprocessing concepts (e.g., voxel resampling, redundancy handling) are not entirely new, so a more explicit discussion of how CryoDataBot's implementations differ from prior work and why these differences matter would improve the manuscript. I do not think its challenging to change the resampling or the grid division parameter on the scripts provided by Cryo2StructData github repo or scripts available in ModelAngelo github repo.
  
  The benchmarking is mainly limited to ribosome datasets. While this choice is understandable for demonstration purposes, the generalizability to other macromolecules (e.g., membrane proteins, small complexes) is not shown. This can include a small-scale test on a different class of structures (e.g., protein's C-alpha positions, backbone atom position or amino acid type prediction (more difficult one) could strengthen the claim of broad applicability. Since the technical innovation limited, this can help to improve the paper.
  
  The authors state that CryoDataBot ensures reproducibility and provides datasets for AI-method benchmarking. However, EMDB entries can be updated over time (e.g., through reprocessing, resolution improvements, model re-fitting, or correction of atomic coordinates). In my opinion, in the strict sense, reproducibility (producing identical datasets) depends on versioning of EMDB/PDB entries. Without version locking, CryoDataBot ensures procedural reproducibility but not data immutability. The manuscript should either explain how reproducibility is maintained (e.g., version control, archived snapshots) or clarify that reproducibility refers to the workflow, not necessarily the exact dataset content, unless version dataset are provided, as done in Cryo2StructData.
  
  Some other concerns: (1) The "Generating Structural Labels" section has missing technical details. Please provide more information on how the labels are generated, including labeling radius selection, and how ambiguities are resolved if any encountered. A suggestions on how the user should determine the radius and also the grid size (64^3 or other) would be beneficial. (2) The manuscript states on the adaptive density normalization part : "This method is more flexible and removes more noise than the fixed-threshold approaches commonly used in prior studies." What does noise and signals mean here? - there is a separate body of AI-based works developed for reducing noise such as DeepEMhancer, EMReady to name few. Any metric to support this claim? (3) The manuscript states: "To assess dataset redundancy, we analyzed structural similarity between entries based on InterPro (IPR) domain annotations." Is this a new approach introduced here, or an established practice? How does it compare with sequence-based similarity measures? Or Structure-based similarity such as Foldseek? (4) The statement "underscoring the dataset's superior quality and informativeness" is strong. Is it possible to provide more concrete, quantitative evidence to support this, ideally beyond the U-Net training metrics.? (5) Is there a case where there is multiple PDB IDs for the cryo-EM density map? If so how is a specific atomic model chosen in such case?
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Dong Si
  
  This paper discusses CryoDataBot, which creates cryoEM datasets for the use of training with the abilities to filter out based on redundancy, MMF and other user defined parameters. Here are some comments:
  
  The data labeling just has helix, sheet, coil, and RNA. The labeling should also consider DNA and other structures.
  
  The introduction of a Volume Overlap Fraction (VOF) score to validate map-model fitness (MMF) is a novel method to assess global alignment. However, VOF relies on summing and binarizing 2D projections which may have limitations. It is not clear how sensitive the VOF score is to the binarization process or how it handles complex, non-globular shapes. The paper would be strengthened if the authors could provide more justification for this specific metric over other global 3D correlation scores. An analysis of specific examples of map-model pairs that were discarded by the VOF score but not by the Q-score would be informative.
  
  The authors acknowledge the trade-off between higher precision and lower recall that results from overly stringent filtering. While increased precision clearly benefits tasks like model refinement, the resulting reduced recall could be a significant hinder de novo modeling which is dependent upon capturing the entirety of a structure, even with lower confidence. This point could be elaborated on. Is this an area for future work, .e.g. developing pre-configured filtering settings for various downstream tasks, like a Precision vs. Recall bias setting? This might increase utility based on application.
  
  The retraining of CryoREAD is a practical validation of the pipeline's utility for RNA modeling, however the experimental dataset used is exclusively from ribosomes. Ribosomes were selected because they contain both protein and RNA and are abundant in the EMDB but they may not represent the full diversity of RNA structures. The authors rightly note that training set composition affects performance. It would be helpful to further discuss the potential shortcomings of an exclusively ribosome-based training set and possible impact to the retrained CryoREAD model's use validating other classes of RNA.
  
  The author should consider benchmarking on the other SOTA protein-RNA-DNA modeling tools. Right now it is only benchmarked on their own CryoREAD which is just a RNA/DNA modeling tool.
  
  I tried installing CryoDataBot and looks like it requires python version 3.8 or higher but isn't specified anywhere in the paper or the site.
  
  Many references and citations are off and wrong.
3. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Ashwin Dhakal
  
  The authors introduce CryoDataBot, a GUI‐driven pipeline for automatically curating cryo EM map / model pairs into machine learning-ready datasets. The study is timely and addresses a real bottleneck in AI driven atomic model building. The manuscript is generally well written and benchmarking experiments (U Net and CryoREAD retraining). Nevertheless, several conceptual and presentation issues should be resolved before the work is suitable for publication:
  
  1 All quantitative tests focus on ribosome maps in the 3-4 Å range. Because ribosomes are unusually large and RNA rich, it is unclear whether the curation criteria (especially Q score ≥ 0.4 and VOF ≥ 0.82) generalise to smaller or lower resolution particles. Please include at least one additional macromolecule class (e.g. membrane proteins or spliceosomes) or justify why the current benchmark is sufficient.
  
  2 The manuscript adopts fixed thresholds (Q score 0.4; 70 % similarity; VOF 0.82) yet does not show how sensitive downstream model performance is to these values. A short ablation (e.g. sweep the Q score from 0.3-0.6) would help readers reuse the tool sensibly.
  
  3 Table 1 claims CryoDataBot "addresses omissions" of Cryo2StructData, but no quantitative head to head benchmarking is provided (e.g. train the same U Net on Cryo2StructData). Please add such a comparison or temper the claim.
  
  4 For voxel wise classification, F1 scores are affected by severe class imbalance (Nothing ≫ Helix/Sheet/Coil/RNA). Report per class support (number of positive voxels) and consider complementary instance level or backbone trace metrics.
  
  5 In Fig. 4 the authors show that poor recall/precision partly stems from erroneous deposited models. Quantify how often this occurs across the 18 map test set and discuss implications for automated QC inside CryoDataBot.
  
  6 The authors note improved precision but slightly reduced recall in CryoDataBot-trained models. This is explained, but strategies to mitigate this tradeoff are not discussed. Could ensemble learning, soft labeling, or multi-resolution data alleviate the recall drop?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.09.675185v1
www.biorxiv.org www.biorxiv.org

Reproducible processing of TCGA regulatory networks

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.Findings We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed.Conclusions tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jérôme Salignon
  
  This manuscript presents tcga-data-nf, a Nextflow-based pipeline for downloading, preprocessing, and analyzing TCGA multi-omic data, with a focus on gene regulatory network (GRN) inference. The workflow integrates established bioinformatics tools (PANDA, DRAGON, and LIONESS) and adheres to best practices for reproducibility through containerization (Docker, Conda, and Nextflow profiles). The authors demonstrate the utility of their pipeline by applying it to colorectal cancer subtypes, identifying potential regulatory interactions in TGF-β signaling. The manuscript is well-written and well-structured and provides sufficient methodological details, as well as Jupyter notebooks, for reproducibility. However, there are some areas that require clarification and improvement for acceptance in GigaScience, particularly regarding the scope of the tool, the quality of the inferred regulatory networks, the case study figure, benchmarking, statistical validation, and parameters.
  
  Major comments:
  
  While the pipeline is well designed and executed, the overall impact of the tool feels somewhat limited, especially for a journal like GigaScience, due to its pretty specific application to building GRNs in TCGAs, the relatively small number of parameters, the support of only 2 omics type, and the lack of novel algorithms. To increase the impact of this tool I would recommend adding functionalities, such as:
  
  o Supporting additional tools. A great strength of the pipeline is the integration with the Network Zoo (NetZoo) ecosystem. However, only three tools are included from NetZoo. Including additional tools would likely increase the scope of users interested in using the pipeline. In particular, an important weakness of the current pipeline is that it is not possible to conduct differential analysis between different networks, which prevents users from identifying the most significant differences between two networks of interest (e.g., CMS2 vs CMS4). The NetZoo contains different tools to conduct such analyses, such as Alpaca 1 or Crane 2, thus this may be implemented to make the pipeline more useful to a broader user base.
  
  o Adding parameters. A strength of the pipeline is the ability to customize it using various parameters. However, as such the pipeline does not offer many parameters. It would be beneficial to make the pipeline a bit more customizable. For example, novel parameters could be: adding options for excluding selected samples, using different batch correction methods, different methods to map CpGs to genes, additional normalization methods, and additional quality controls (e.g., PCA for methylation samples, md5sum checks). These are just examples and do not need to be all implemented but adding some extra parameters would help make the pipeline more appealing and customizable to various users.
  
  The quality of the inferred regulatory networks is hard to judge. There are no direct comparisons with any other tools.
  
  o For instance, it is mentioned in the text that GRAND networks were derived using a fixed set of parameters, but it could be helpful to show a direct comparison between GRNs built from your tools with those from GRAND. This could reveal how the ability to customize GRNs using the pipeline's parameters helps in getting better biological insights.
  
  o Alternatively, or in addition, one could compare how networks built by your method fare in comparison to networks built from other methods, like RegEnrich 3 or NetSeekR 4, in terms of biological insights, accuracy, scalability, speed, functionalities and/or memory usage.
  
  o Another angle to judge the regulatory networks would be to check in a case study if the predicted gene interactions between disease and control networks are enriched in disease and gene-gene interactions databases, such as DisGeNet 5.
  
  Figure 2 needs re-work:
  
  o Panel A and C: text is too small. "tf" should be written TF. "oi" should have another name. These panels might be moved to the supplements.
  
  o Panel D is confusing. Without significance it is hard to understand what the point of this panel is. I can see that certain TFs are cited in the main text but without information about significance, these may seem like cherry-picking. The legends states: Annotation of all TFs in cluster D (columns) to the Reactome parent term. "Immune system" and "Cellular respondes to stimuli" are more consistenly involved in cluster D, in comparison to cluster A.. However, this is a key result which should be shown in a main figure, not in Figure S6. I would also recommend using a -log scale when displaying the p-values to highlight the most significant entries.
  
  o Panel E is quite confusing; first, the color coding is unclear. For instance, what represents blue, purple and red colors? Second, what represents the edges' widths? I would recommend using different shapes for the methylation and expression nodes to reduce the number of colors, and adding a color legend. I would also consider merging the two graphs and representing in color the difference in the edge values so the reader can directly see the key differences.
  
  Benchmarking analysis could be included to show the runtime and memory requirement for each pipeline step. It would also be beneficial to analyze a larger dataset than colon cancer to assess the scalability.
  
  Statistical analysis: If computationally feasible, permutation testing could be implemented to quantify the robustness of inferred regulatory interactions. Also, in the method section, it should be clarified that FDR correction was applied for pathway enrichment analysis.
  
  Minor comments:
  
  I am not sure why duplicate samples are discarded in the pipeline. Why not add counts for RNA-Seq and averaging beta values? I would expect that to yield more robust results.
  
  It is a bit unclear in what context the NetworkDataCompanion tool could be used outside the workflow. It is also unclear how it helps with quality controls. Please clarify these aspects.
  
  The manuscript is well-written, but words are sometimes missing or wrongly written, it needs careful re-read.
  
  The expression '"same-same"' is unclear to me.
  
  In this sentence: "Some of "same-same" genes (STAT5A, CREB3L1"…, I am not sure in which table or figure I can find this result?
  
  Text is too small in the Directed Acyclic Graph, especially in Figure S4. Also, I would recommend adding the Directed Acyclic Graphs from Figure S1-S4 to the online documentation.
  
  Regarding the code, I was puzzled to see a copyConfigFiles process. Also, there are files in bin/r/local_assets, these should be located in assets. And the container for the singularity and docker profile is likely the same, this should be clarified in the code.
  
  It is recommended to remove the "defaults" channel from the list of channels declared in the containers/conda_envs/analysis.yml file. Please see information about that here https://www.anaconda.com/blog/is-conda-free and here https://www.theregister.com/2024/08/08/anaconda_puts_the_squeeze_on/.
  
  Additional comments (which do not need to be addressed):
  
  Future work may consider enabling the use of the pipeline to build GRNs from other data sources than TCGA (i.e., nf-netzoo). Recount3 data is already being parsed for GTEx and TCGA samples, so it might be relatively easy to adapt the pipeline so that it can be used on any arbitrary recount3 dataset. Similarly, it could be useful if one could specify a dataset on the recountmethylation database 6 to build GRNs. While these unimodal datasets could not be used with the DRAGON method they would still benefit from all other features of the pipeline.
  
  Using a nf-core template would enable better structure of the code and increase the visibility of the tool. Also using multiple containers is usually easier to maintain and update than a single large container, especially when a single tool needs to be updated or when modifying part of the pipeline. Another comment is that the code contains many comments which are not to explain the code but more like quick draft which makes the code harder to read by others.
  
  References 1. Padi, M., and Quackenbush, J. (2018). Detecting phenotype-driven transitions in regulatory network structure. npj Syst Biol Appl 4, 1-12. https://doi.org/10.1038/s41540-018-0052-5. 2. Lim, J.T., Chen, C., Grant, A.D., and Padi, M. (2021). Generating Ensembles of Gene Regulatory Networks to Assess Robustness of Disease Modules. Front. Genet. 11. https://doi.org/10.3389/fgene.2020.603264. 3. Tao, W., Radstake, T.R.D.J., and Pandit, A. (2022). RegEnrich gene regulator enrichment analysis reveals a key role of the ETS transcription factor family in interferon signaling. Commun Biol 5, 1-12. https://doi.org/10.1038/s42003-021-02991-5. 4. Srivastava, H., Ferrell, D., and Popescu, G.V. (2022). NetSeekR: a network analysis pipeline for RNA-Seq time series data. BMC Bioinformatics 23, 54. https://doi.org/10.1186/s12859-021-04554-1. 5. Hu, Y., Guo, X., Yun, Y., Lu, L., Huang, X., and Jia, S. (2025). DisGeNet: a disease-centric interaction database among diseases and various associated genes. Database 2025, baae122. https://doi.org/10.1093/database/baae122. 6. Maden, S.K., Walsh, B., Ellrott, K., Hansen, K.D., Thompson, R.F., and Nellore, A. (2023). recountmethylation enables flexible analysis of public blood DNA methylation array data. Bioinformatics Advances 3, vbad020. https://doi.org/10.1093/bioadv/vbad020.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.Findings We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed.Conclusions tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Xi Chen
  
  Fanfani et al. present tcga-data-nf, a Nextflow pipeline that streamlines the download, preprocessing, and network inference of TCGA bulk data (gene expression and DNA methylation). Alongside this pipeline, they introduce NetworkDataCompanion (NDC), an R package designed to unify tasks such as sample filtering, identifier mapping, and normalization. By leveraging modern workflow tools—Nextflow, Docker, and conda—they aim to provide a platform that is both reproducible and transparent. The authors illustrate the pipeline's utility with a colon cancer subtype example, showing how multi-omics networks (inferred via PANDA, DRAGON, and LIONESS) may help pinpoint epigenetic factors underlying more aggressive tumor phenotypes. Overall, this work addresses a clear need for standardized approaches in large-scale cancer bioinformatics. While tcga-data-nf promises a valuable resource, the following issues should be addressed more thoroughly before publication: 1. While PANDA, DRAGON, and LIONESS form a cohesive system, they were all developed by the same research group. To strengthen confidence, please include head-to-head comparisons with other GRN inference methods (e.g., ARACNe, GENIE3, Inferelator). A small benchmark dataset with known ground-truth (or partial experimental validation) would be especially valuable. 2. Although the manuscript identifies intriguing TFs and pathways, it lacks confirmation through orthogonal data or experiments. If available, consider including ChIP-seq or CRISPR-based evidence to reinforce at least a subset of inferred regulatory interactions. Even an in silico overlap with known TF-binding sites or curated gene sets would help validate the predictions. 3. PANDA and DRAGON emphasize correlation/partial correlation, so they may overlook nonlinear or combinatorial regulation. If feasible, please provide any preliminary steps taken to capture nonlinearities or discuss approaches that could be integrated into the pipeline. 4. LIONESS reconstructs a network for each sample in a leave-one-out manner, which can be demanding for large cohorts. The paper does not mention runtime or memory requirements. Adding a Methods subsection with approximate CPU/memory benchmarks (e.g., "On an HPC cluster with X cores, building LIONESS networks for 500 samples took Y hours") is recommended to guide prospective users. 5. Currently, the pipeline only covers promoter methylation and standard gene expression, yet TCGA and related projects include other data types (e.g., miRNA, proteomics, histone modifications). If possible, offer a brief example or instructions on adding new omics layers, even conceptually. 6. Recent methods often target single-cell RNA-seq, but tcga-data-nf is geared toward bulk datasets. Please clarify limitations and potential extensions for single-cell or multi-region tumor data. This would help readers understand whether (and how) the pipeline could be adapted to newer high-resolution profiles. Minor point: 1. Provide clear guidance on cutoffs for low-expressed genes, outlier samples, and methylation missing-value imputation. 2. Consider expanding the supplement with a "quick-start" guide, offering step-by-step usage examples. 3. Ensure stable version tagging in your GitHub repository so that readers can reproduce the exact pipeline described in the manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.05.622163v1
www.biorxiv.org www.biorxiv.org

The enduring advantages of the SLOW5 file format for raw nanopore sequencing data

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTNanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment is large and complex. This can be stored in two alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT’s previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed and simplicity of nanopore signal data analysis, management and storage. To inform this choice, we present a comparative evaluation of POD5 vs SLOW5. We conducted benchmarking experiments assessing file size, analysis performance and usability on a variety of different computer architectures. SLOW5 showed superior performance during sequential and non-sequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than SLOW5. We demonstrate that SLOW5 file writing is highly parallelisable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT’s core file format.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf118), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jan Voges
  
  Comments to Author: Synopsis: The manuscript builds on the authors' previous work introducing the SLOW5 format for Oxford Nanopore signal data as an improvement over the FAST5 format. Since then, Oxford Nanopore Technologies (ONT) has introduced its own new format, POD5. This paper directly compares SLOW5 and POD5. The authors claim that SLOW5 provides higher reading speeds for both sequential and random access, writing speeds sufficient to keep pace with data acquisition in sequencing machines, comparable file sizes with no significant storage penalty, a simpler implementation with fewer dependencies. The paper is clearly written, includes extensive supplementary information, and references the source code for all tools used in the experiments. Comments: - Sequential access performance: To me it is unclear whether SLOW5's advantage in sequential access originates from its file layout or from the use of mmap I/O versus traditional I/O. A small ablation study, forcing both SLOW5 and POD5 tools to use the same I/O method on platforms with currently large performance differences, would clarify where the performance gain originates from. - Figure 4: While POD5's dependency structure is indeed more complex than that of slow5lib, the current tree representation exaggerates this complexity. Many common packages (e.g., Python, zlib) appear multiple times as dependency of multiple other packages. A dependency graph where each package appears only once would be a more informative representation. - Figure 5: POD5 versions prior to 0.1.0 appear to be preview releases (and are even marked as such on GitHub). Breaking changes during early previews are normal, so including them in the same visual space as stable versions risks being misleading. - Figure 5: Breaking change at version 0.1.12: The timeline indicates a breaking change at POD5 version 0.1.12 which seems particularly relevant as the latest breaking change after version 0.1.0. However, this change is not reflected in the POD5 compatibility matrix on the right. An explanation of what type of breaking change occurred would clarify its impact and help readers assess compatibility risk. - Random access "walker strategy": A brief explanation comparing it to SLOW5's index-file approach would improve accessibility without requiring readers to consult external documentation.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTNanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment is large and complex. This can be stored in two alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT’s previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed and simplicity of nanopore signal data analysis, management and storage. To inform this choice, we present a comparative evaluation of POD5 vs SLOW5. We conducted benchmarking experiments assessing file size, analysis performance and usability on a variety of different computer architectures. SLOW5 showed superior performance during sequential and non-sequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than SLOW5. We demonstrate that SLOW5 file writing is highly parallelisable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT’s core file format.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf118), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Wouter De Coster
  
  The authors describe the SLOW5 format and its benefits compared to the standard POD5 format for storing raw sequencing data from nanopore sequencers. The paper is well written and easy to understand. The advantages of the SLOW5 format are clear, and the comparison is adequately executed and described. However, the developers seem unable to persuade others to adopt it widely, and change might need to come from ONT themselves, who may be most concerned about disrupting their existing workflows, especially for parallel writing during sequencing. Nevertheless, the authors seem to have also addressed that issue, as demonstrated with a simulation experiment.
  
  Please find my specific suggestions below.
  
  Sincerely, Wouter De Coster
  
  Major: While I understand that the software name SLOW5 was an initial variation of the FAST5 format, I don't think that the words 'slow' or the number '5' are particularly appropriate descriptions or helpful in making a case for using the file format, as it is neither slow nor related to HDF5. However, once a name is chosen, I understand the reluctance to change it. Additionally, it seems the evaluations are conducted using the binary BLOW5 format. Wouldn't it then make more sense to emphasize BLOW5 in the text and title?
  
  Minor: I would italicize the 'make' tool for users unfamiliar with build tools in the Usability section, as it is a rather strange sentence if reading 'make' as a verb, not a tool. Perhaps the same could be applied to other dependencies in that section for consistency. Then again, the primary target audience will probably understand what 'make' means in this context.
  
  There is a typo in the benchmarking procedure section: 'confoudning'.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.06.30.662478v1
www.biorxiv.org www.biorxiv.org

GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Amichai Painsky
  
  This paper introduces a Good-Turing (GT) estimation scheme for relative gene expression estimation and cell-cell distance estimation. The proposed methods, namely GTestimate, claims to improve upon conventional normalization methods by accounting for unobserved genes. The idea behind this contribution is fairly straightforward - since the relative gene expression is of large alphabet, a GT estimator is expected to preform better than a naive ML approach. However, I am not convinced that the authors applied it correctly. First, the proposed GT estimator (as appears in (GT)) in the text), assigns a zero estimate to unobserved genes (Cg = 0). This contradicts the entire essence of using a GT estimator. Second, it makes no since to use this expression for every Cg > 0. In fact, any reasonable GT based estimator applies GT for relatively small Cg, and ML estimator for large Cg. See [1] for a through discussion. The choice of a threshold between "small" and "large" Cg's is subject to many studied (for example [2], [1]), but it makes no sense to use the above expression for any Cg. Finally, notice that if N_{Cg} > 0 for some g but N_{Cg+1} = 0, the proposed estimator is not defined. There exists several smoothing solutions for such cases (for example [3]), but they need to be properly discussed. to conclude, I am not sure what is the effect of these issues on the experiments in the paper, which makes it difficult to assess the results.
  
  REFERENCES
  
  [1] A. Painsky, "Convergence guarantees for the good-turing estimator," Journal of Machine Learning Research, vol. 23, no. 279, pp. 1-37, 2022. [2] E. Drukh and Y. Mansour, "Concentration bounds for unigram language models." Journal of Machine Learning Research, vol. 6, no. 8, 2005. [3] W. A. Gale and G. Sampson, "Good-Turing frequency estimation without tears," Journal of quantitative linguistics, vol. 2, no. 3, pp. 217-237, 1995.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Gregory Schwartz
  
  In this manuscript, Fahrenberger et al. propose a new scRNA-seq normalization method to more accurately report UMI counts of individual cells. They specifically use a Good-Turing estimator, compared with a more commonly used Maximum Likelihood estimator, to adjust raw UMI counts. Using their own cta-seq, a cell targeted PCR-amplification strategy, as ground truth, they compare their estimator with a traditional size-corrected estimator. Furthermore, they illustrate downstream changes using their method, including changes to clustering results and spatial transcriptomic readouts. The manuscript was a clear read and presents an interesting alternative solution to an often overlooked, but important, problem. However, there are some aspects of the manuscript that need to be addressed. Some major content missing includes comparisons with more widely-used normalization methods throughout the manuscript, and better ground truth data sets in their downstream analysis. Specific comments are as follows:
  
  l. 34: To my knowledge, most groups do not use a single division by total UMI count as the only normalization. Seurat has NormalizeData, but also heavily promotes scTransform, a completely different method. Many use log transform (as I believe was done here), some use quantile transform, others use regression techniques etc. It was odd to see these standard normalizations missing in comparisons. The authors should use such standard procedures to demonstrate the superiority of GT.
  
  l. 42: Is there a justification for the successor function being applied within the frequency ((cg + 1) / total) instead of outside ((cg / total) + 1) as is expected with the Good-Turing estimation?
  
  Furthermore, there is typically a smoothing function for erratic N_cg values, which I would expect with single-cell data. In the methods there is a brief mention of linear smoothing, but that would imply that the GT equation is misleading and oversimplified. The actual equation should be included in the main text to avoid confusion.
  
  l. 58: Compared to 16,965 reads average per cell, what is the equivalent for the ultra-deep sequencing (not 23 million reads, as that is not 7.4 fold increase)?
  
  I am not entirely convinced on the use of cta-seq as a ground-truth for the cells, especially in comparison with ML. The authors should show that cta-seq has similar UMI and gene count distributions to more popular scRNA-seq technologies (e.g. 10x Chromium) or the application may be specific to cta-seq only.
  
  l. 110: Instead of using unknown classification data sets, there are existing cell-sorted data sets with ground truths (many even on the 10x website). The authors should use these data sets to compare downstream analysis.
  
  l. 125: The spatial transcriptomic results were very subjective, with no statistical hypotheses. The entire manuscript is missing any sort of statistics when comparing methods, which is a major flaw and should be rectified. Here specifically, the color scale stops at 3, but does this carry over to the relative differential expression? The claim is that it is constant, but if they are all greater than 3 then they must be quite variable, so it is surprising to see such a constant value of 0. Maybe the complete color scale should be shown on all figures to clarify this.
  
  From my understanding of the manuscript, the 18 cells for analysis and comparison were chosen based on a typical Seurat analysis. This technique introduces a range of biases into the comparison and makes the argument a bit circular.
  
  For a bias example, the top 2000 most variable genes were used, suggesting that entire classes of genes may be ignored even when highly or lowly expressed, such as housekeeping genes.
  
  There also appears to be many steps that were not entire justified outside of a "typical analysis", for example excluding a cluster in the analysis (just because it was not that large?), only selection 18 cells (why 6 from each cluster?), removing cells with less than 1000 expressed genes or over 8% mitochrondrial reads (this may be an issue, and removing specific cell types or proliferating cells, this should be a bivariate removal with justification). All of these filterings remove generalizeability of GT.
  
  Supplementary Figures in the text hyperlink to the main figures which is confusing. More importantly, the caption of Supplementary Figures read "Figure" rather than "Supplementary Figures".
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.02.601501v2
Oct 2025
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

ColoPola: A polarimetric imaging dataset for colorectal cancer detection

1
1. GigaScience 18 Oct 2025
  
  in Gigascience Annotations
  
  Availability of Supporting Source Code and Requirements
  
  DOME annotations are also available in the DOME registry here https://registry.dome-ml.org/review/futlrtl5w4
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC12530094/
Sep 2025
www.biorxiv.org www.biorxiv.org

A high-quality reference genome for the Ural Owl (Strix uralensis) enables investigations of cell cultures as a genomic resource for endangered species

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes have a wide range of applications. Yet, we are from a complete genomic picture for the tree of life. We here contribute another piece to the puzzle by providing a high-quality reference genome for the Ural Owl (Strix uralensis), a species of conservation concern and efforts affected by habitat destruction and climate change.Results We generated a reference genome assembly for the Ural Owl based on high-fidelity (HiFi) long reads and chromosome conformation capture (Hi-C) data. It figures amongst the best avian genome assemblies currently available (BUSCO completeness of 99.94 %). The primary assembly had a size of 1.38 Gb with a scaffold N50 of 90.1 Mb, while the alternative assembly had a size of 1.3 Gb and a scaffold N50 of 17.0 Mb. We show an exceptionally high repeat content (21.07 %) that is different from those of other bird taxa with repeat extensions. We confirm a Strix characteristic chromosomal fusion and support the observation that bird microchromosomes have a higher density of genes, associated with a reduction in gene length due to shorter introns. An analysis of gene content provides evidence of changes in the keratin gene repertoire as well as modifications of metabolism genes of owls. This opens an avenue of research if this is related to flight adaptations. The population size history of the Ural Owl decreased over long periods of time with increases during the Eemian interglacial and stable size during the last glacial period. Ever since it is declining to its currently lowest effective population size. We also investigated cell culture of progressive passages as a tool for genetic resources. Karyotyping of passages confirmed no large variants, while a SNP analysis revealed a low presence of short variants across cell passages.Conclusions The established reference genome is a valuable resource for ongoing conservation efforts, but also for (avian) comparative genomics research. Further research is needed to determine whether cell culture passages can be safely used in genomic research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf106), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jianbo Jian
  
  The authors provide a high-quality reference genome for the Ural Owl (Strix uralensis), these genomic resources are valuable for conservation and evolution. The manuscript is well-written, and the scientific story with cell culture for conservation is interesting. I have some questions or comments as following: 1、 in abstract， the N50 is contig or scaffold？ 2、For the GenomeScope analysis, the estimated genome size is 1.29 Gb with low heterozygosity (0.2%). The assembled genome size is 1.38 Gb. Could there be duplicated genome sequences in the assembly, or did the genome survey evaluation exclude some k-mers? What were the parameters used in GenomeScope2 (e.g., was the -h parameter set to its default value)? 3、How do you perform the decontamination？ 4、For the Hi-C contact map, due to some chromosomes are considerably larger while others are much smaller, it is suggested that the larger chromosomes could be displayed independently from the smaller ones to enhance clarity and interpretation.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes have a wide range of applications. Yet, we are from a complete genomic picture for the tree of life. We here contribute another piece to the puzzle by providing a high-quality reference genome for the Ural Owl (Strix uralensis), a species of conservation concern and efforts affected by habitat destruction and climate change.Results We generated a reference genome assembly for the Ural Owl based on high-fidelity (HiFi) long reads and chromosome conformation capture (Hi-C) data. It figures amongst the best avian genome assemblies currently available (BUSCO completeness of 99.94 %). The primary assembly had a size of 1.38 Gb with a scaffold N50 of 90.1 Mb, while the alternative assembly had a size of 1.3 Gb and a scaffold N50 of 17.0 Mb. We show an exceptionally high repeat content (21.07 %) that is different from those of other bird taxa with repeat extensions. We confirm a Strix characteristic chromosomal fusion and support the observation that bird microchromosomes have a higher density of genes, associated with a reduction in gene length due to shorter introns. An analysis of gene content provides evidence of changes in the keratin gene repertoire as well as modifications of metabolism genes of owls. This opens an avenue of research if this is related to flight adaptations. The population size history of the Ural Owl decreased over long periods of time with increases during the Eemian interglacial and stable size during the last glacial period. Ever since it is declining to its currently lowest effective population size. We also investigated cell culture of progressive passages as a tool for genetic resources. Karyotyping of passages confirmed no large variants, while a SNP analysis revealed a low presence of short variants across cell passages.Conclusions The established reference genome is a valuable resource for ongoing conservation efforts, but also for (avian) comparative genomics research. Further research is needed to determine whether cell culture passages can be safely used in genomic research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf106), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Luohao Xu
  
  This manuscript provides a high-quality genome of Ural owl which is of evolutionary and ecological importance, as well as cell cultures that is worth exploration for endangered species. But Oral owl does not seem to be an endangered species?
  
  One chromosomal fusion was identified, but it is very important to specific which chromosome. The chromosomes are very conserved in birds. The authors should follow the chromosome nomenclature according to chicken chromosome homology (http://pnas.org/doi/10.1073/pnas.2216641120).
  
  "bird microchromosomes have a higher density of genes" is already known for 20 years, so no need to confirm again.
  
  It is very speculative to link karatin gene expansions to flight adaptions. I suggest to revise this statement throughout the manuscript.
  
  The first paragraph lacks any citations. And the statements are not fully accurate because there are already reference genomes in Strigiformes (owls), some of which were generated by the bioEarch project.
  
  L120, I don't think this is true?
  
  L131, remove million?
  
  L158, again, the authors need to make sure that those chromosomes are homologous to chicken chromosomes. It is known that the 10 smallest microchromosomes are difficult for assembly due to HiFi sequencing dropout (Huang 2023 PNAS). I am curious whether the 10 smallest microchromosomes (or dot chromosomes) have been correctly assembled? The figure 3 does not seem to show this information.
  
  For the 17 lost genes, are they lost in all reference genomes, or just "supported by more than one reference genome" (L260)?
  
  It is not surprising to me that kerain, immune and olfactory receptor genes are independently expanded in different bird lineages.
  
  L284-285, this statement is not true, because females also have a Z chromosome. Maybe the sequence coverage of the Z chromosome can be used to confirm the sex.
  
  L361, cite B10K publications.
  
  L370, "identified" should be "confirmed"?
  
  L378, this is a bit misleading, because it is clear that barn owls have microchromosomes.
  
  L382, "mainly composed of centromeric satellite DNA", and L387-388 are not true. LINEs the LTRs should still be the major repeat contents.
  
  L395-396, "In birds, microchromosomes possibly originate from chromosome fission.", this is not true, again see Huang 2023 PNAS.
  
  The paragraph starting from L394 is already well know. No need to discuss this. Overall, the discussion part needs to be streamlined, including the paragraph at L434 and L455
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.22.633903v4
www.biorxiv.org www.biorxiv.org

WaveSeekerNet: Accurate Prediction of Influenza A Virus Subtypes and Host Source Using Attention-Based Deep Learning

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3:Weihua Li
  
  (1) In the abstract, the statement 'WaveSeekerNet achieves scores of up to the maximum 1.0 across all evaluation metrics, including F1-score (Macro Average)' appears to slightly deviate from the actual experimental results. (2) In data preprocessing, the reasoning behind selecting and keeping the earliest collected sequence when duplicate sequences are encountered should be explained. (3) Compared with Figure 4, Figure 5 demonstrates performance improvements in most cases, but why is this not observed for some results in Figure 4d? (4) Could the oversampling/undersampling methods employed in the study introduce any potential biases to the analysis? (5) Given that VADR can provide viral classification and annotation information—which serves as the benchmark in this study, what specific advantages does WaveSeekerNet offer for subtype classification? (6) The paper employs 10-fold cross-validation to assess generalizability, yet the data processing section describes a temporal split (pre-2020 for training). Could the "Model Training and Testing" section provide further clarification on this?
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:Slim Fourati
  
  Nguyen HH and collaborators trained an ensemble-like deep learning model on HA and NA sequences extracted from GISAID (sequences collected from 1902 to 2019) to predict 1/influenza subtype and 2/host source. Their model was tested on HA and NA sequences collected from 2020 to 2025 and showed improved accuracies compared to other deep learning models. The article is of good quality, with well-documented methods and with proper use of a test set that would mimic real case use of the model (the model would be used on future sequences) and the use of a standard metric to assess the accuracy of the model (F1-score, Bal. Acc, MCC). The figures and tables support the conclusions of the article.
  
  I only have two minor edits that I would suggest to the authors: 1. In the first paragraph of the introduction, the authors explain why predicting host sources is important (for active surveillance and our preparedness for future pandemics). Can the authors explain why predicting influenza subtype is also crucial? 2. lines 573-575. The authors argue that their model is better suited to predict rare variants than previous models like MC-NN. Do the authors think this is only the result of the upsampling of those sequences?
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:Will Dampier
  
  The manuscript presented by Nguyen et al. is well written, well researched, and well executed. The use of this new "wavelet style" neural network shows both an increased training efficiency and improved accuracy at detecting influenza subtypes for surveillance. However, I think their comparison to a 'plain' Transformer model does not take advantage of the improvements in pre-training and transfer-learning that have become standard practice in deep-learning. I have also included some stylistic suggestions to improve the figures as presented. After addressing these comments, I believe that this will become a very strong manuscript.
  
  Major Comments:
  
  The authors present a comparison between their new wavelet architecture and a standard transformer architecture using a one-hot encoded vector of amino-acids. I believe that this is the correct 'null model' to compare your wavelet architecture to, however, it does not represent the 'state of the art' in utilizing transformers for sequence analysis. As I'm sure the authors are aware, the disadvantage of transformers is that they take an extensive amount of training (they note the transformer only models take 2-4X more training epochs to converge). However, the advantage they bring is that they can be extensively trained for one task and then transfer that learning to another related task. A number of models have been pre-trained on giant collections of proteins Asgari et al, https://doi.org/10.1371/journal.pone.0141287 and Rives et al https://doi.org/10.1073/pnas.2016239118 which then allow one to transfer that knowledge to different domains with fewer examples such as demonstrated in Dampier et al https://doi.org/10.3389/fviro.2022.880618. It would be interesting to see whether your wavelet model defeats these pre-trained models with transfer learning. If you showed that, you could argue that there is no need for the extensive expense of 'foundational models'.
  
  The authors discuss that there is a significant imbalance in the training set and they used up-sampling and limiting to balance out the class representation. Since the classes are not equally represented, the model may not be equally able to predict each class. And the high metrics may only be a representation of its ability to predict the popular classes correctly. The authors should include an additional set of figures (supplemental is fine) that show the metrics broken out by Subtype. It would also be interesting to see a graph of the class-size (before up-sampling) vs F1-score (or another metric) on that class. This could provide lower-bounds for how many samples are needed to train the model.
  
  Minor Comments:
  
  Figures 3, 4, and 5: These would benefit from a linked y-axis. It is hard to compare across A/B/C/D when the axes have different y-limits.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.25.639900v3
www.biorxiv.org www.biorxiv.org

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.Findings PathoGFAIR is a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualisations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but one sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high Colony-forming unit (CFU) and low Cycle Threshold (Ct) values.Conclusions PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but one tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios. PathoGFAIR homepage: https://usegalaxy-eu.github.io/PathoGFAIR/
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ann-Katrin Llarena
  
  Nasr and colleagues present an, at times, well-written manuscript with an interesting and robust pipeline that includes well-known softwares (you must make sure to cite the authors of these). However, the manuscript is, quote "...a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing". Its repeated how well it works, they even compare it to other software in table 1 (without proper benchmark). These initial statements are however not supported by the findings. The Salmonelal from the spiked samples are, as expected from food matrix present in low quantity), difficult to do more than state that the genus is present, and only a fraction of the samples can actually "complete" the entire pipeline. Also, the benchmarking is not really benchmarking (compare and measure this software against other competing software). No such comparison is done, and even though the intention of PathoGFAIR as stated throughout the paper, is detection and analysis of metagenomic samples, the benchmarking is done on isolate based wgs. It is also evident that the authors are not microbiologists as the manuscript is riddled with taxonomical misunderstandings about the vast genus Salmonella and when to use capital letters and italics. I am also lacking a proper discussion here on the results found in the spiking experiment in light of current EU legislation on Salmonella. Can this pipeline help in this regard? Sensitivity and specificity metrics are also lacking.
  
  Abstract: "foodborne pathogen data" / "metagenomic Nanopore pathogenic data" - suggest to rewrite, as what I think you are trying to say is " initially developed to detect foodborne pathogens from metagenomic nanopore data, the workflow can be used to detect any pathogen." "Colony-forming unit and Cycle Threshold values." rewrite sentence, I do not completely understand what you are trying to say. what is "sufficient colony forming units?" It will vary as well between pathogens (infection dose varies). You could rather state your sensitivity of the pipeline here - even though i think that sampling prep, library prep and seq influences that more than the bioinformatics. "In any sample": did you test all matrixes? "sample is isolated or incubated before seq" you cannot isolate a sample, but you isolate a bacteria from a sample. unprecise language.
  
  Introduction: In general, its well written, but a bit unprecise here and there. The authors also rely a lot on the following words: "rapid" "accurate". "outbreaks and epidemics" - rewrite, these are the same. "efforts to mitigate their spread and ensure food safety" again, complementary terms - rewrite. "global public health authorities" we do have everything from local to global food safety and public health authorities, I think one should highlight this. There is a difference between for instance EFSA and ECDC. "isolation can be complex"? do you mean complicated or work intensive? "The utilisation of Nanopore sequencing data, as exemplified in studies like [7]," citing practices like this is not really reader friendly. Suggest to write what they actually did in seven (as for instance the detection of blah in blah as shown in 7). "Once (meta)genomics data has been generated, bioinformatics approaches enable the rapid and accurate detection"; repetition of chapter above. You write in the former chapter that "the utilisation of nanopore data" which also includes bioinformatics of course. SURPI and Sunbeam is freely available? https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-019-0658-x https://chiulab.ucsf.edu/surpi/
  
  "PathoGFAIR: pathogen identification and tracking from metagenomics". Im not convinced that it can perform tracing in an outbreak where only a few SNPs are allowed. PathoGFAIR does not really speed up the process of sampling, does it. Actually, it takes more time to extract crude dna from a sample than to place it in a enrichment broth or do a dilution series, so the presteps are not really a part of this. "Tracking pathogens" - again, if species level is the lowest rank it can go to, its not enough to perform tracking.
  
  Overview chapter "input data is seq data generated w nanopore" basecalling is not included in the workflow? How is this performed? It affects the quality of the reads, so its nice to know what you did. The chapter is very wordy, and contains a lot of fill-words with salespitches almost. I would recommend rewriting it, for instance: Chapter that starts with subsequently and describes the different workflows and how they work together can be compressed. And the last three sections are salespitching.
  
  WF1: Preprocessing: How stringent filtering and quality control are implemented in the workflow? How good quality do you need for the wp2-4 to work sufficiently well? Did you test? Food vehicle animal? What is that - do you mean that if you extract dna from bovine meat, you map to bovine genome? "a tool ten times faster etc etc." is discussion and should be removed from what I think is materials and methods even though the title of the section is workflow 1. What is a food host? Kalamari database includes many foodborne pathogens, such as Shigella, E. coli, Campylobacter etc etc. how can you just remove all reads that match to this database? Table 1: Innuendo is based on isolate WGS, and not intended for WGS. Also, it has its own built in wgMLST schema employed using chewbbacca, so it definitely has allele-abased pathogen identification. Its intended for illumina data. Victors are strictly a platform to analyse virulence factors and not intended even for taxonomic profiling, and its webinterface doesn't work. IDseq has step-by-step guides available on their webpage, so I think that qualifies as a tutorial. You can also contact them (user support). I guess the same is true for OneCodex, as you actually pay for that one. So the table is unprecise at best and should be corrected (I didn't go through Submeam, SURPI or PAIPline specs to try to check if you got it correctly). Rewrite this. Further, I think you should only include systems / pipelines that are intended for metagenomics. You have a footnote * that I cannot see in the table as well.
  
  WF2 taxonomy profiling: The first sentence needs rewriting. Two sentences from "Although Kraken2 is a tool design…….." belongs in discussion. WF3: Medaka consensus pipeline : "This task is performed using neural networks applied from a pileup of individual sequencing reads against a draft assembly. " what draft assembly did you use here to create a consensus sequence? Actually, its not polishing contigs, its assemblying them? Again, there is some descriptions of the software which belongs in the discussion, say the perks one gets from using this tool over the other. I do not however get how screening for virulence genes = pathogen identification. The thing is that in a complex food matrix or faecal samples from animals, things like stx phages will also be present. These are not stec pathogens unless the phage is inside an e.coli. How do you make sure of the host for such mobile genetic elements as these virulence and amr genes often are located on? Seeing as this is the basis of your pathogen detection?
  
  WF4: A bit again on choosing software over the other that is discussion food. Wf4/wf5: I am worried about the reliance on snp based technics for nanopore reads. Is the quality good enough to achieve sufficiently robust results? Easily adaptable workflows Last section is repetition (about each wf operating independently) Use cases: Data generation: Please revise how to write Salmonella names correctly. They should be in italics for genus, species and subspecies names, while the serovar/serotype is non italic and capital letter. So the correct term would be: * Salmonella enterica subsp. enterica serovar Houtenae, or in short; Salmonella Houtenae. * The strain DSM554 is of serovar Typhimurium, and this should referenced like this: Salmonella enterica subsp. enterica serovar Typhimurium strain DSM 554 First two sentences are contradictory to eachother? Sentence starting "15 samples were incubated"; don't start sentence with number, it looks like 33.15 How much meat did you use? What CFU/g does these ct values translate too? Its important to know the sensitivity relative to legislation. The limit is zero in 100grams, but I don't assume you tested 100g? What does adaptive sampling mean? To exclude chicken DNA? The point v sentence under description of supplementary table t1 is a bit weird punctuation Gene-based pathogen identification: Working with meat to detect low abundance pathogenic bacteria is challenging without enrichment of the expected pathogen with selective methods. Just incubating it a x temperature might work for some bacteria, but others need special atmosphere (campylobacter, clostridia) and nutrients. How do you accommodate this? Figure 2 B: The grey bares samples ? why are they collapsed in the left corner? And shy are sdhA and mucD highlighted? Also, please put genes in italics. the grey bars on the right (y-axis) are not annotated? To which reference genome are the barplot in d referring to? I can see for instance in f that there is a number of snps or variants for the Houtenae and Typhimurium, but not Salamae, was the latter used as reference? "an AIDA autotransporter-like protein, only found in Enterica strain samples but not in samples spiked with Houtenae or Salamae strains." All these strains are of the subspecies enterica Figure 3: punctuations a bit off here and there. Why do you operate with cfu/ml? You added it to meat? It should be cfu/g? It would be nice with a presentation of the resistance panel of the three spiked strains before presenting the amr genes. "Similar but inverse relations are observed for CFU/mL value (Figure 3 C & D), with a threshold for VF and AMR gene detection at 106 ." cfu/ml of what? The rinse? Added ml? I don't even know how much meat were included in the dna extractions. "The further the samples are from these thresholds, the higher the number of VF genes and AMR genes identified. Indeed, the three top scattered dots with identified VF genes between 250 and 300 (Figure 3 A, C, E) are the samples with the highest number of reads, higher CFU/mL value, and a relatively lower Ct value compared to other samples." The tendency is ok, but not all. For instance, you have several exceptions here for both amr genes and vf genes. Maybe mark the dots after say spiked strain/enrichment or not?
  
  Discussion bit here : "enerally, allowing samples to incubate for a short period before se quencing enhances microbial growth, resulting in higher CFU/mL values and lower Ct values. This increase in microbial concentra tion improves the efficiency of direct sequencing by providing more genetic material for analysis, facilitating faster and more accurate pathogen detection. "
  
  Allele-based pathogen identification: "Salmonella enterica subspecies enterica serovar typhimarium (NC_003197.2)": see earlier comment on writing correct taxonomically for Salmonella. "However, given the diversity among Salmonella subspecies in the samples, a high number of complex variants and SNPs were anticipated. " You only operate with ONE subspecies of Salmonella - S. enterica subsp. enterica. That's the relevant subspecies, and contains over 2500 serovariants. I don't understand this process; in an outbreak setting you are dependent on tracing, i.e. showing that you isolates are clonal. Pathogfair relies on mapping to a reference genome, but that again relies on isolation of suspected isolate and building a high quality assembly for the allel-based pathogen identification to work. Its not enough to just show that you have that or that serotype, you will have to show that they are clonal (i.e. separated by a limited number of SNPs, say max 20 snps over the full length of the chromosome). This method cannot do this. Samples with prior pathogen isolation: Do understand you correctly that you now exstract dna from isolates? Not whole samples matrix? If so, how is this benchmarking a pipeline intended for metagenomics sequencing? If you were to extract dna from feces/ food and then use your pipeline, that would be benchmarking. However, this doesn't prove that your pipeline works as you intend it to/or claim that it does. How were the samples prepared? If isolates, extraction method and sequencing techniques? Species name is written non-capitalized first letter, so Campylobacter jejuni. All gene names should be italicized. Suggest rewriting sentence: The wet lab procedures performed to isolate and prepare these samples for sequencing adhered to standard microbiological techniques, including cultivation, enrich ment, and isolation steps" to reflect actual sequel; enrichment, cultivation and isolation and verification." Conclusion: If for use for solely isolates, I think assemblies are a better way to go than this pipeline; its more reliable for clonality analysis needed in outbreaks. "We further supported the scientific community by introducing new 46 benchmark samples, making them publicly available. This demonstrates our significant investment of time and resources, providing valuable assets for future research." There are now 82000 c. jejuni just on ncbi, of which 600 are complete. Salmonella genomes are clocking on 524500 assemblies on enterobase. The contribution of these strains are not because they are new samples, but because your isolates represent data from an underrepresented region of the world, namely Palestine.
  
  Supplmentary figure s4 is cropped so that x-line annotation is not visible. SFigure 5 Midpoint root amr phylogenetic tree? Supplementary table 1: its unclear for me if you added this amount of bacteria or it was the result of after 1h or 24h enrichment. Also, I don't understand how much meat you used for the dna extraction. Same goes for ct values.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.Findings PathoGFAIR is a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualisations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but one sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high Colony-forming unit (CFU) and low Cycle Threshold (Ct) values.Conclusions PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but one tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios. PathoGFAIR homepage: https://usegalaxy-eu.github.io/PathoGFAIR/
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Federico Zambelli
  
  The authors present PathoGFAIR, a set of Galaxy workflows for the metagenomic analysis of shotgun Nanopore sequencing from isolated and non-isolated pathogens in contaminated food samples. They complement their work by analysing and releasing two datasets, one from isolated and the other from non-isolated samples, with the primary objective of illustrating the potentiality of the workflows. These datasets could also be used as benchmarks for future works.
  
  The manuscript is generally well-written, and the authors highlight the advantages of the proposed workflows in Table 1 by comparing them to similar solutions. The workflows are well integrated into the Galaxy network, are available on the three main usegalaxy instances, and provide a thorough tutorial through the Galaxy training platform. A notable advantage of PathoGFAIR over similar workflows is that, thanks to Galaxy, the final user can easily tailor them by replacing any tool in the workflow with others available in the Galaxy ecosystem. This also allows easy updates for the tools in the workflows.
  
  A few minor points that, if addressed, in my opinion, could further strengthen the manuscript:
  
  1 - The rationale behind the tool selection in each of the four workflows is not always clear. While insights are present for workflows 1 and 4, this is not true for workflows 2 and 3. The reader would benefit from understanding why one tool has been preferred over another for the same task, even more so, given the possibility to modify the workflows easily, when this preference could be the other way around in particular use cases or conditions.
  
  2—One of the main factors for a successful metagenomic analysis is the correctness, completeness, and up-to-dateness of the reference data. The authors should briefly describe how PathoGFAIR addresses this in Galaxy.
  
  3—While this workflow is clearly stated to be tailored for shotgun metagenomic sequencing, the authors contrast this approach only with targeted sequencing. Instead, they should also discuss the 16s rRNA metagenomic approach, for which Nanopore kits are available, and why PathoGFAIR has been limited to the analysis of shotgun data.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.26.600753v2
www.biorxiv.org www.biorxiv.org

First chromosome-level genome assembly of the colonial tunicate Botryllus schlosseri

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Botryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Cristian Canestro
  
  TO THE AUTHORS
  
  In this MS entitled 'First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)', Olivier De Thier and colleagues report the first chromosome-scale assembly of this colonial ascidian specie, paying special attention to differences with previous published assemblies and importantly between haplotypes. The MS is very well written, very easy and pleasant to read. This provides data of great quality and very relevant not only for the ascidian/tunicate community, but to the field of genome structural evolution. I firmly recommend it for publication, although I think that the authors could discuss it in deeper detail. Specially, I miss for instance a more elaborate discussion of the results in our understanding of the similarities and differences between clades that have been published in the last years (I have not been able to find some relevant articles in this regard cited in the bibliography). I also feel that a deeper analysis of the differences between haplotypes could be very interesting, unless they are artifactual effects of the assemblies. As mentioned below, unless this is part of a longer story for a different MS beyond the scope of this one, I encourage the authors to validate some of the differences they find between haplotypes, and try to correlate the structural variations, with differences in gene counts between haplotypes, and to explore whether these differences could be correlated with aspects of biological relevance. I miss, for instance, Venn diagrams with gene contents between previous assemblies, and the haplotypes/haploid genome here reported. In any case, I firmly recommend this MS for publications, since most of my suggestions are not intended to interrogate the results of the MS, but to improve it, but I also understand that some may go beyond the scope of this MS.
  
  Minor points: Introduction Page 1: "the basic body plan of adult tunicates is highly conserved across the entire subphylum [3]". This sentence, which could be OK for ascidians, probably provides a highly simplified vision of Tunicate adult morphologies, specially comparing the divergent morphologies of Thaliaceans and Appendicularians. Please, elaborate the sentence.
  
  To understand the comparisons between the data of this MS and previously reported genomes, it seems crucial to understand well the meaning of the "clades and subclades". Please, include in the introduction (or where needed), how are defined those clades, which are their origins and biological/geographical differences, … and all the critical information that will specially help non-tunicate readers to understand the results.
  
  Results: The authors refer to the presence of large-scale genomic palindromes in Bs1 and Bs3. But it is unclear what are these structures. I suggest to please provide some more detailed explanation about the palindromic nature of these regions.
  
  The data of haplotype-resolved assemblies is very interesting. I wonder if it is possible to somehow measure the amount of heterozygosity between haplotype 1 and 2, and those versus the previous versions of the genome, to better understand intra and inter-variation between subclades? The differences of the size of some regions between Colombera and this study, and even between haplotypes 1 and 2, are very interesting. I would find more informative to merge the three graphs of Figure S9 into one single graph, so we can also easily compare the different in sizes of the haplotypes with the haploid. If some of those differences are actually due to deletions, that would deserve further analysis. If this analysis is not part of another ongoing project that will be published somewhere else, I suggest identifying with a dot-plot some of those differences, specially between haplotypes, and validate with long-reads crossing those regions whether some of the deletions are real or artifactual. Please, include the dotplot graph together with the two haplotypes in figure S10. In those cases that could be real, it would be very interesting what genes are gone, and if those are not placed somewhere else in the genome as result of translocations, or those genes are actually gone and could explain some of the differences reported in the gen count between haplotypes.
  
  The authors mentioned the presence of multiple structural variations, although some of which could be artifactual of miss-assemblies. Interestingly, the plot of the synteny blocks between the two haplotypes in figure S11 shows some of those structural variations, including cases of: - deletions: for instance, there are "blank" regions in Bs1A and Bs3A with no lines, which may reflect areas that are not present in the haplotype B. - duplications and translocations within chromosomes or between chromosomes of different haplotypes. Just looking to this plot, I wonder how the distribution of chromosomes between haplotypes is done. For instance, I see that Bs7B shares a duplicated synteny block with chromosomes Bs10B and Bs14B, but not with Bs10A and Bs10B, which means that the duplications are intra-haplotype present in B but not in A. But I wonder if it is possible that Bs10B and Bs14B could be in fact switched to haplotype A, and therefore there would be no duplication nor deletion in one of the haplotypes, just a simple translocation. I may be wrong in the interpretation, but I'm curious to understand the graph. In any case, again, as mentioned above, it would be worthy to validate some of those variations with long reads, which could illuminate the biological relevance between the haplotypes and discard potential artifactual errors of the assemblies.
  
  I notice that in figures 7 and S13, some lines are thicker than others. Is this because many "thin" lines are overlapped, and they look like a "thick" line. Otherwise, the visual effect of different thicknesses could be misleading. Please, clarify.
  
  In the analysis of the Hox cluster the authors say "[…] our new assembly revealed that B. schlosseri's Hox genes are not scattered. Instead, eight of them were clustered on the second largest scaffold (Bs2), whereas two other ones are found on the 15th largest scaffold (Bs15)." Generally, the description of the Hox gene in a cluster refers to the fact they are in the vicinity, with near not many other genes in between Hox genes. Therefore, I would not describe that eight Hox genes are clustered by the simple fact that they are in the same chromosome (maybe even in different arms).
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBotryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Tilman Schell
  
  Review of
  
  First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)
  
  from
  
  Olivier De Thier, Marie Lebel, Mohammed M. Tawfeeq, Roland Faure, Philippe Dru, Simon Blanchoud, Alexandre Alié, Federico D. Brown, Jean-François Flot and Stefano Tiozzo
  
  Comments to the authors
  
  De Thier et al. present a high-quality chromosome scale de novo assembly of the tunicate Botryllus schlosseri from mainly PacBio HiFi and Arima Hi-C reads. Further WGS Illumina and ONT data was applied to resolve assembly errors or support the correctness of the assembly structure. Structural and functional annotations are conducted thoroughly. Downstream analyses include a synteny comparison of different Tunicata based on ancestral linkage groups and Hox genes.
  
  The manuscript is well written and methods are mostly described to ensure reproducibility. Despite the good shape of the manuscript, I would like to give some remarks, which should be addressed in a revised manuscript before publication.
  
  General remarks
  
  I like the quote in the beginning of the introduction.
  
  The authors conducted downstream analyses with different related tunicate genome assemblies on chromosome level. For assembly metrics, there is a comparison regarding BUSCO assessment only. I would point out the high quality of the B. schlosseri assembly in Table 2 and 4 by comparison with the other chromosome level and annotated tunicate genome assemblies as well.
  
  I am not an expert regarding tunicates, so please excuse my basic, curiosity driven question: In the results section "The laboratory model Sub-clade A1" you state that a part of COI is used as a barcode to differentiate ascidian species. In the introduction you state that wild colonies are able to fuse resulting in mixed genotypes. Since sample E derived from the wild at some point, it might be theoretically possible to have not only mixed nuclear genotypes but mixed mitotypes too. Depending on how old sample E is and how fast fixation of a mitotype can happen within a colony, this might be reflected in your data. Furthermore, this thought could be expanded to nuclear genotypes, which could hamper scientific findings.
  
  Contamination filtering was based on a sequence similarity search and taxonomic assignment of blobtools only. Despite blobtools/blobtoolkit was applied I was not able to find a blobplot in the supplemental files. I would like to encourage the authors to add blobplots before and after contamination filtering at least to the supplement. In my opinion, blobplots are most powerful when considering GC content and coverage in the first place - especially, when dealing with taxa, which are underrepresented in public databases. Therefore, using taxonomic assignment only for contamination filtering might generate false positives (e.g. conserved sequences across the tree of life with taxonomic assignment different than Chordata but with similar GC and coverage as the target) and false negatives (e.g. short sequences of the assembly, which couldn't be assigned with different GC and coverage as the target).
  
  In the paragraphs "Results and Discussion" (Haplotype-resolved assembly) as well as in "Methods" (Haploid genome assembly) you use the term "haploid assembly" multiple times. I find this term misleading, since the genome is not haploid and the assembly represents both haplotypes at the same time. I assume that primary contigs from hifiasm were used to generate this assembly. Therefore, I would suggest to e.g. call this assembly "based on primary contigs", "non phased", "haplotype mixed" or "haplotype unresolved" (as opposite to "haplotype resolved").
  
  Particular remarks
  
  Results and Discussion
  
  Sequencing and genome size estimation
  
  Table 1 Please specify what "round 1" and "round 2" are referring to. Was one library sequenced twice or were two different libraries created and sequenced?
  
  Haploid genome assembly
  
  "We identified 28 contigs that belong to spore-forming unicellular parasites of the microsporidia group [32]. This represents the first report of this fungal group in a tunicate species." Is this identification based on blobtools taxonomic assignment? This is not described in the methods. Furthermore, can you rule out that identification or taxonomic assignment is false positive? If not you should tune down the second sentence and maybe discuss this.
  
  "We then performed Hi-C scaffolding using YaHS [34], which reduced the number of contigs to 256, before [...]" Technically, scaffolding with yahs can only increase the number of contigs because original (hifiasm) contigs are split because of the Hi-C signal (at least as long the option --no-contig-ec isn't applied). I would substitute "contigs" with "sequences".
  
  "Finally, a manual curation was performed, resulting in an assembly made up of 16 major scaffolds [...]" Is there any previous study on the karyotype of B. schlosseri? If so, citing it here would strengthen your results. Otherwise, I would recommend to state the karyotypes or the number of chromosome scale scaffolds of other tunicates here and discuss, if your findings are in line.
  
  Table 2 Please substitute "No. of scaffolds" with "No. of sequences". Please add the contig N50 values. As pointed out above, I would like to see a comparison to the other chromosome level tunicate genome assemblies here, instead of showing basically the same stats twice.
  
  "[…] highlighted the presence of two large-scale genomic palindromes located within Bs1 and a smaller one in Bs3 (Figure 3)." The figure shows the presence but maybe you can highlight them in the figure and the caption even more?
  
  "To find out whether these palindromes may result from assembly artifacts [40], we checked the localization of the duplicated BUSCO genes along the chromosomes and did another run of CRAQ [...]" You could support your findings by showing an even coverage distribution within the palindromes, which is similar to the coverage distribution of whole assembly. Either as a histogram or a zoomed in version of the read coverage across reference as in the outer layer of the circos plot could show this nicely.
  
  Methods
  
  Sampling, DNA isolation, and sequencing
  
  "HiFi PacBio long reads" Please provide more details on how PacBio libraries (was it actually one library sequenced twice or two different libraries?) were created and sequenced. Were low or ultra-low protocols used? On which machine was sequencing conducted?
  
  RNA-seq data
  
  Is downloading public data a method? In any case you should cite the original papers and provide a list of accession numbers (supplement) but I would remove this paragraph and add the information to the paragraph "Genome annotation", e.g. "Public available RNA-seq reads [23, 25, 8] were aligned to the soft-masked assemblies [...]"
  
  Data preprocessing
  
  Depending on how the PacBio libraries were created and which PacBio machine was utilized for sequencing, you should state how HiFi calling was conducted (e.g. Sequel II) and how PCR adapter and duplicates were filtered out (e.g. ultra-low).
  
  Haploid genome assembly
  
  "To this aim, contigs were aligned to the NCBI nucleotide database (accessed 2023 March 18) using BLAST+ [78]" Please state the version of BLAST+.
  
  "Finally, a BLASTN search for fragments of the mitochondrial genome among the contigs was performed using the published complete mitochondrial genome of B. schlosseri (RefSeq NC_021463.1) [28]." Were the fragments filtered out based on the blast search? Please explain what was done in detail. Which hits were considered (e.g. cutoffs)? The mitochondrial genome of E was assembled with NOVOPlasty, which is by the way not stated in the methods but in the results only. Was the assembled mt genome of E added to the assembly, once the fragments were filtered out?
  
  Haplotype-resolved assembly
  
  If I understand correctly, the rapid curation pipeline was applied but no dual-curation was conducted. When aiming for haplotype-resolved assemblies, I would recommend to apply this method, e.g. concatenating both haplotypes and creating a combined contact map of haplotype 1 and 2, which can be curated as usual, with the advantage of being able to exchange (parts of) sequences between the haplotypes. In some cases phasing from hifiasm is not correct and can be easily corrected with this approach.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBotryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jerome Hui
  
  In this manuscript, De Thier and colleagues reported the chromosomal level genome assembly of tunicate Botryllus schlosseri (Pallas, 1766) sub-clade A1. The methods used in this study are standard. B. schlosseri has been used as laboratory model in certain places to understand asexual development and regeneration for decades. Despite there was a draft quality genome published a decade ago (eLife 2013, 2:e00569), the authors here produced a high-quality phased genome based on modern technologies. In terms of genomic resources for this laboratory model, this is important and useful. The authors have also carried out analyses, including repeats, synteny, and Hox cluster genes. I also think some of these results are interesting. Below are my comments and suggestions for the authors to consider which hopefully can further improve the manuscript.
  
  Given the authors merged the results and discussion into one section, I would expect more discussion for several parts, including:
  
  a. Repeats - For now, the analysis is quite standard and the main text is relatively descriptive. The question to me is what have we learnt from understanding the repeats from B. schlosseri genome? The authors should tell the readers.
  
  b. Synteny analyses - This is an interesting finding. Extensive chromosomal rearrangement has also been discovered in other animals in recent. Can the authors further discuss these events?
  
  c. Hox gene analyses - Again, it is quite descriptive. Tunicates are well known for dispersed Hox cluster for decades. So what have we learnt from the situation of B. schlosseri which I would be glad to see if the authors can discuss them.
  
  Figure S14
  
  The authors should also show the bootstrap values on the key nodes.
  
  In addition, the authors should also use one more method to construct the Hox gene tree in addition to Maximum Likelihood method.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.29.594498v1
www.biorxiv.org www.biorxiv.org

Haplotype-resolved reference genomes of the sea turtle clade unveil ultra-syntenic genomes with hotspots of divergence

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Xiaoli Liu
  
  (1)It is recommended to add keywords such as "conservation genomics" or "adaptive evolution" to better align with the content. (2)In the background section, after discussing the current status of sea turtles and existing genomic research, the study's content is introduced directly without adequately explaining why it is necessary to sequence the genomes of the remaining five species of sea turtles on top of the existing partial genomic data. The introduction of the research objectives appears somewhat abrupt. (3)Last line of page four"Previous analyses in particular of the……within this ancient clade [34,38]"：When introducing the broad context of genomics and biodiversity conservation, it is important to provide detailed explanations for key concepts such as 'genomic synteny' and 'colinearity'. Although these concepts are covered later in the analysis of the turtle genome, providing initial elaboration can help readers better understand subsequent content. (4)Page 6 Section 2.2:The range of this quality value, 38.7, is incorrect. Please verify carefully. (5)Result 3.1：High conservation at the chromosomal level is supported, but repetitive sequences must be excluded from synteny analysis. (6)Section 3.4, Second Paragraph：The reliability of PSMC in low-diversity species, such as N. depressus, may be limited; it is recommended to validate findings with other methods, such as MSMC2. (7)It is recommended to include a detailed description of sample selection in the methods section, covering aspects such as geographic distribution, population size, and sample collection methods, to demonstrate the representativeness and reliability of the selected samples.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Brendan Reid
  
  The authors of this work provide a fantastic addition to the genomic resources currently available for marine turtles with five new, apparently high-quality reference genomes. These new resources enable a number of interesting cross-species analyses in this group, including phylogenetic reconstruction, inference of demographic history, and identification of hotspots of diversity and divergence. I though this paper was quite clearly written and easy to read overall, and I have one major and a few more minor comments/suggestions.
  
  Major comment: there is an extensive literature on hybridization among marine turtle lineages (see Vilaca et al. 2021, https://doi.org/10.1111/mec.16113, for a recent genomic example), with lots of evidence for ancient gene flow after initial lineage divergence as well as recent hybridization. The authors do not really mention this phenomenon at all, and since I think it has a lot of bearing on all of the results it would make sense to re-think your findings in light of the fact that some level of gene flow has occurred. Would extensive synteny/lack of genomic rearrangements potentially enable hybridization? Is overall low divergence among lineages potentially a function of gene flow? Are regions of high divergence the result of selection (as you suggest), or could these regions potentially be resistant to gene flow? I believe that IQtree assumes a strictly bifurcating tree, and gene flow can influence PSMC inferences (see Mazet et al. 2016, https://doi.org/10.1038/hdy.2015.104) - how would gene flow among lineages affect your inference of divergence dates and demographic histories?
  
  MInor commentsL [note - line numbers would have been helpful for providing comments on specific items! I will refer to the lower-left page numbers and paragraph instead]:
  
  page 3, paragraph 2: Some of the applications you refer to here don't seem terribly germane to the relevance of "genomic resources" in management and conservation per se, and several are just methods using some kind of genetic data ... e.g., "abundance"/close-kin mark recapture doesn't require full genomes (and the reference you cite used microsat data), and the "community"/eDNA applications don't generally rely on genomes but instead on databases of a few (usually mitochondrial) genes. Either include methods that truly benefit from the development of high-quality reference genomes or broaden this to something like "growth in molecular ecology techniques".
  
  page 4, paragraph 2: last sentence is a bit of a run-on, could break this up a bit.
  
  page 10, paragraph 3: for me, the ROH methods need some additional explanation and interpretation. The more detailed methods indicate that the ROH were identified on the basis of lower-than-average heterozygosity rather than true homozygosity - I can understand why this might have been done (since the baseline level of heterozygosity varies across species) but it still seems a bit arbitrary and could risk mistaking stretches with simply low variation for IBD tracts. I wonder if a ROH-detection method like ROHan that explicitly incorporates baseline genomic heterozygosity into its model would be more appropriate for comparing results across species and could give different results. I also question a bit the interpretation of these low-diversity tracts as evidence of inbreeding per se. The authors do not comment much on the length distributions of these ROH - given that many of them are quite short I would expect that if there was mating between close kin it probably happened far back in the past and the IBD tracts have been broken up by recombination.
  
  page 11, paragraph 2: for PSMC analyses it is important to note the method assumes that differences in coalescence time/Ne across the genome result from demography alone. If portions of the genome are under balancing/diversifying selection (such as the areas of high diversity that you detect in this study), the local Ne for inferred these regions would be expected to be larger than the rest of the genome, which could lead to the spurious detection of population expansion or contraction (more likely a contraction for balancing selection). See Boitard et al. 2022 (https://doi.org/10.1093/genetics/iyac008) for a more detailed treatement. I would try excluding the regions putatively under diversifying selection and re-run PSMC to see if your inferences change.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Laura Caquelin
  
  Summary of the Study The authors aimed to create high-quality reference genomes for five sea turtle species to better understand their genetic diversity, evolutionary adaptations, and ecological traits. They used haplotype-resolved, chromosome-level reference genomes and gene annotations to reveal conserved genome structures, genetic hotspots linked to immune response and sensory evolution, and patterns of demographic expansion. Their findings highlight areas of genetic diversity critical for adaptation and conservation efforts.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: Investigation of multi-copy gene family enrichment in genomic hotspots of sea turtles.
  
  Outcome: Significant enrichment of "MHC", "Immunology-related", "G-Protein Coupled Receptor" (GPCR), "Olfactory Receptor" or "Zinc-Finger" in genomic hotspots with high genetic divergence, diversity, and gene density.
  
  Analysis method outcome: Fisher's exact test followed by Benjamini-Hochberg correction
  
  Main result: "Following functional annotation of the genes found in these hotspots, we found enrichment for multi-copy gene families coding for proteins with functions in immune response, olfactory receptors (ORs), zinc fingers, and G-protein-coupled receptors (GPCRs_ (Fig 4c, Tables S6 & S7). This included enrichment of immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 13 (adjusted p < 10-42, 10-47, 10-79, 0.01, respectively), MHC genes, Immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 14 (adjusted p < 10-24, 10-6, 10-2, 10-10, 10-52, respectively) and Immunology-related genes and GPCRs in chromosome 24 (adjusted p < 10-3 and 10-3, respectively)." (page 10).
  
  Availability of Materials a. Data
  
  Data availability: Open
  
  Data completeness: Complete
  
  Access Method: Repository
  
  Repository: https://git.imp.fu-berlin.de/begendiv/sea_turtlegenomes
  
  Data quality: The data files have been shared and appear sufficient for running the analyses. However, no metadata is provided to describe the content, structure, or origin of the files which limits interpretability and reusability. b. Code
  
  Code availability: Open
  
  Programming Language(s): R (for the enrichment test)
  
  Repository link: https://git.imp.fu-berlin.de/begendiv/sea_turtlegenomes
  
  License: MIT license
  
  Repository status: Public
  
  Documentation: Short README, describe only the presentation of the directory.
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 14.7.4
  
  Programming Language(s): R
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: R version 4.4.1/RStudio 2024.09.0
  
  Results
  
  5.1 Original study results
  
  Results 1: The main results are presented in Figure 4 and the numerical p-values are available on supplementary table 6 and table 7.
  
  5.3 Steps for reproduction -> Run the code "enrichment_test.R" shared on Git - Issue 1: Files needed to run the code are not shared in the Git repository: "GCF_009764565.3_rDerCor1.pri.v4_genomic.longest.aa.tsv", "hotspots_chr13.longest.aa.tsv", "hotspots_chr14.longest.aa.tsv", "hotspots_chr24.longest.aa.tsv". -- Resolved: These analysis data are not shared in the internal Gigascience FTP server or the Git repository. After request, the authors uploaded all the files into the Git repository.
  
  5.4 Statistical comparison Original vs Reproduced results - Results: The table S6 and S7 was reproduced: -- Supplementary table S6: see screenshot from R console -- Supplementary table S7: see screenshot from R console
  
  Comments: The original R code "enrichment_test.R" simply stored the p-values results in a value object. To simplify the comparison process, directly obtain the final table, and ensure reproducibility while minimizing errors, we implemented the creation of the table.
  
  ------------------ Start of R code ------------------ Creating final tables Corresponding to supplementary table S6 table_S6 <- data.frame( enrichment = c("MHC", "Immunology", "GPCR", "Olfactory", "Zinc-finger"), Chr13 = c(p_mhc13, p_immune13, p_gpcr13, p_or13, p_zinc13), Chr14 = c(p_mhc14, p_immune14, p_gpcr14, p_or14, p_zinc14), Chr24 = c(p_mhc24, p_immune24, p_gpcr24, p_or24, p_zinc24))
  
  Corresponding to supplementary table S7 Create a vector of names for rows and columns ( ! warning the pvalues in fdrs are not in the same order as the table S7) enrichment <- c("MHC", "Olfactory", "GPCR", "Immunology", "Zinc-finger") chromosomes <- c("Chr13", "Chr14", "Chr24")
  
  Reorganizing fdrs in a matrix table_S7 <- matrix(fdrs, nrow = length(enrichment), byrow = TRUE) rownames(table_S7) <- enrichment colnames(table_S7) <- chromosomes
  
  Organizing rows as the original table S7 library(dplyr) table_S7 <- as.data.frame(table_S7) # Convert matrix to data frame table_S7 <- table_S7 %>% slice(match(c("MHC", "Immunology", "GPCR", "Olfactory", "Zinc-finger"), enrichment)) ------------------- End of R code -------------------
  
  Errors detected: The statement "MHC genes, Immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 14 (adjusted p < 10^-24, 10^-6, 10^-2, 10^-10, 10^-52, respectively)" (page 10) appears to contain an error. Specifically, the p-value for Olfactory Receptors (5.583367e-10) is greater than the threshold of 10^-10, suggesting that this value should instead be below 10^-9. Therefore, the threshold for Olfactory Receptors should be revised to 10^-9.
  
  Statistical Consistency: The p-values are consistent (see screenshot from R console).
  
  Conclusion
  
  Summary of the computational reproducibility review The inferential statistics for the objective "Investigation of multi-copy gene family enrichment in genomic hotspots of sea turtles" were successfully reproduced using the original analysis code provided by the authors. The input data needed to run the code were initially unavailable but were subsequently shared through the Git repository. An inconsistency was noted in the text of the manuscript reporting a threshold for Olfactory Receptors, where the stated 10^-10 should be revised to 10^-9 based on the observed p-value (5.583367e-10).
  
  Recommendations for authors While the original analysis code was successfully used to reproduce the results, we recommend improving the documentation to enhance clarity and reproducibility. In particular: -- Code annotation: The scripts would benefit from more detailed comments within the code to clarify the logic of each step. This would greatly help users follow the analyses more easily and understand the purpose of specific commands or operations. -- README file: The current README provides only a general overview. We suggest expanding it to include: --- A brief description of each script or analysis pipeline. --- An indication of which figure, table, or result in the manuscript each script corresponds to. --- Clear instructions on how to execute the analyses in the correct order, if applicable. -- Metadata: For the datasets used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis. -- Data availability: At this stage, we have only verified the reproducibility of one part of the study. To facilitate full reproducibility of the entire study, we recommend sharing all necessary data files required to run every script present in the repository.
  
  These improvements would make the repository significantly more user-friendly and would strengthen the reproducibility of the study.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.26.644878v1
www.biorxiv.org www.biorxiv.org

CNSistent integration and feature extraction from somatic copy number profiles

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Sampsa Hautaniemi
  
  Streck and Schwarz present a method, CNSintent, for consistent segmentation of copy-number data. The utility of the tool is demonstrated using three large cancer cohorts and a neural network classifier built upon the consistently segmented data. CNSintent can facilitate solving an important biomedical problem: the advanced analysis of copy-number data. The authors are lauded for their excellent Python code and thorough documentation. While the contribution is timely and likely important, there are several areas for improvement.
  
  The manuscript's readability could be better. There are typos, textual errors, and inconsistencies in figure captions, such as incorrect figure references or mismatched values between the text and figures. The "Consistent Segmentation" section is difficult to follow. It is unclear whether this step involves merging pre-existing breakpoints in the data to produce new, longer segments or if larger segments, such as whole chromosomes, are split into smaller, constant-sized segments. The writing suggests that segments are first merged and then split; however, later in the manuscript, they appear to be used separately. In our testing, combining these approaches did not yield meaningful results. Since consistent segmentation is the method's most critical step, we strongly suggest clarifying this section.
  
  The manuscript is unbalanced in its content, with excessive focus on the tool's application and the discoveries derived from it, rather than on the tool itself. This reduces the clarity of the key message. We recommend compressing the application section (deep learning in cancer classification) while expanding the tool description with additional explanations.
  
  It is also unclear what type of data the authors are using in the cancer classification section. To improve clarity, this information should be explicitly included in the methods section, detailing the sequencing strategy and copy-number tools used for each cohort.
  
  The methods section would benefit from a more detailed explanation of the CNSintent steps. Both Figure 1 and the text leave some parts unclear, particularly in the "Consistent Segmentation" section. Additionally, methods such as random forest and UMAP are only briefly mentioned in a supplementary figure rather than being described in the methods section. Moving these descriptions to the methods section would improve clarity.
  
  Figures are generally clear, but improving color differentiation would be beneficial. For example, in Figure 1, the dark red and dark orange shades are too similar, making them difficult to distinguish. A more optimized color scheme with slightly lighter tones (i.e., increased luminance) would enhance readability.
  
  The introduction promotes copy-number signatures; however, these signatures rely on segment lengths and unique breakpoints, which vary between samples. Since this method enforces consistent segmentation and breakpoints across all samples, its applicability to copy-number signatures is unclear. This should be discussed in the Discussion section or removed from the introduction.
  
  Out of curiosity: Is it possible to prioritize one type of segmentation over another? For instance, if both WGS and WES data are available, can CNSintent be configured to prioritize WGS calls? Similarly, some tools provide highly precise breakpoint calls that are valuable for detecting fusion genes or rearrangements. In such cases, it would be useful to prioritize these calls and harmonize results from other tools accordingly.
  
  Terminology Clarifications:
  
  Blacklist, blacklisted regions, gap regions, mask: These terms should be used consistently, particularly since blacklists can be applied at different processing stages. Notably, PCAWG blacklists samples, not regions. Segmentation: The term is commonly used in CNV analysis to refer to inferring continuous genomic segments from raw read counts or probe intensities. Here, it has a slightly different meaning—computing consistent breakpoints across all samples—so a more explicit definition would be helpful. Breakpoint merging/clustering: If these terms are synonymous, choosing one would improve readability. Coverage: Since "coverage" often refers to sequencing depth, a critical quality metric in DNA sequencing, it might be clearer to use "copy-number coverage" or a similar term. For example, the sentence "Next, samples with low coverage were removed using the…" could be ambiguous if read without context.
  
  At the end of the subsection "Explainability and the Effect of SOX2 Gene," the phrase "which exhibits significant local amplification in LUSC" should be revised to "which exhibits significant focal amplification in LUSC." The correct terminology is "focal" rather than "local," as established in Beroukhim et al. (2010).
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ellen Visscher
  
  The paper introduces a python package for imputation, filtering, segmentation, feature extraction and visualisation of CNA profiles. It explains some of the elements of the package, and then demonstrates how data from multiple cohorts can be processed and combined using the package preprocessing pipeline. The authors then use processed data from 3 different cohorts to perform cancer type prediction using a CNN. From this, they get an interesting result to find a biomarker that differentiates two different lung cancers. Throughout, they show visualisations using their package. The package itself seems well documented and designed to be used. There is some clarification required in the methods section specifically around the CNN training and the models therein. There is also one major question of whether all the preprocessing steps are actually required for the downstream CNN analysis. Overall, however, this is a well written manuscript, providing a useful software tool for further analysis of CNA data.
  
  Major comments: - CNN section- how are the segments decided- is it based on all the training data, or just data in a batch? - Throughout the results pertaining to figure 3A-C, you call it test accuracy- to be clear is this is based on your CV hold outs? This should be reworded everywhere to reflect this. As cross validation indicates, this is not a test set and is a validation set- which is also the way you use it. - Regarding the above, you have a comment saying: "the best test accuracy without cross-validation was 92.34%". Could you please clarify what you mean by this. Only in the CNN section do you describe your training approach, which does not mention a test or separate validation set. - It reads slightly unclearly- you have a section called "model transfer", but are you training 3 different models- one per dataset? You only have one figure for training results which suggests one dataset, but then you have this section called model transfer? - Re all the above, please dedicate a small subsection in methods making this clearer. Are there dedicated test sets? If your main results are for aggregated data, then what are you testing on to ensure generalisability? What is the point of training the 3 different models on 3 different datasets? Perhaps it would make more sense to hold one dataset out as your test set. In some ways, that is what the model transfer is showing, but it would be less confusing to clarify that aim instead of suddenly introducing 3 models. - If the CNN architecture is essentially the same as in Attique et. al., the performance is basically the same and they use only CNs a gene locations- how does this demonstrate that the preprocessing from CNSistent is necessary or advantageous for this task? Maybe having a result which combines CN calls naively over gene locations and comparing to this across the aggregate datasets would be a good way of comparing? I.e showing that preproccessing does offer an advantage when combining different datasets together? Also because this is what you argue in your abstract. For this analysis you would have to make sure you also compare across the same samples to differentiate between filtering/other preprocessing steps. - In Figure 3I, you say "notice the similarity of chromosome 3 pattern for the correctly classified LUSC samples (red) and the misclassified ones (orange)". This is confusing because the orange and red are not similar. In fact for this whole section, it seems that figure 3I does not align with what you are saying?
  
  Minor comments/errors: - Clarification on why CNSistent needs a reference genome if it's dealing with segments? How is this information used- is it just for the known gaps? - Your caption of Supplementary Figure 1 has a typo about a breakpoint at 16 instead of 14. - You do not explain how you use the knee pt to filter (i.e is it samples above/below the knee pt.) - Your CNN graphic is difficult to interpret and non-standard. - CNN section should clarify at the beginning what the input is and what the output is (i.e a prediction that a sample belongs to a particular cancer type) before explaining the architectural details. - Even though you control for class imbalance, some cancer types are so poorly represented it is unlikely a CNN could learn that, you do kind of mention this in the discussion, but maybe some sort of minimum threshold for inclusion would make sense. - For Fig2D you refer to it as GND, but the axes/title says hemizygosity-are these things equivalent? E.g could have 3-3, low hemizygosity but not diploid? Or if it's aggregated across the whole genome its assumed equivalent? - There is a grammatical error "Runtimes decreased in a near-linearly with the number of compute cores" - You make a comment that "We therefore suspect some TCGA lung cancers might be cases of co-occurring adeno and squamous carcinomas." This is a possibility but given pleiotropy of many phenotypes- it may also be that the biomarker is not always unique to squamous carcinomas.
  
  Suggestions/Nice to haves: - Maybe make it clearer inside the paper what visualisations come with CNSistent. Looking at the software documentation, there's obviously a lot of useful visualisations that come with that- and some of them you have used in Figure 3 for e.g. - Given there are more total CN callers, maybe good to mention somewhere how CNSistent would work for total CNs only. - You remove profiles that you say are uninformative, could you not include this and then just show how accuracy correlates with no. of break-pts (for e.g). In some ways one might think that there could be useful information in few alteration profiles- because those alterations might be more upstream/causal. - The aggregation step could maybe affect downstream analysis. I.e taking the average could introduce CNs that were never called. Even using min/max- this implies a constant copy number in that region, which may lose information- e.g if it is a functional region having two diff CNs across gene might imply non-functionality. Did you explore the effect of aggregation step? Perhaps taking a small enough resolution of segment types would account for this anyway.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Stefano Monti
  
  This is a well-written paper that aims to develop a tool that can integrate SCNA from large datasets possibly generated using different platforms to identify alteration patterns that are often undetected in smaller data subsets. Authors have used CNN-based method for integrating the data, extracting features and predicting cancer types from SCNA profiles. The tool has the potential to significantly simplify the integration and analysis of large scale SCNA studies. However, some (hopefully addressable) weaknesses are noted:
  
  The choice of a classification task as the (only) way to evaluate the proposed method is questioned. I would argue that the most important use of SCNA detection is in support of mechanistic investigations, by identifying novel candidate loci likely to harbor tumor suppressors (copy losses) and oncogenes (copy gains). This type of analysis is hardly mentioned in the manuscript, and it is not clear how well the proposed tool would support it. I surmise it can, but the authors should discuss (and present results about) it.
  
  If we were to focus on the task of recurrent SCNA detection, then meta-analysis approaches (where separate analyses are performed on each of the datasets, and only the results are integrated) would need to be considered as an alternative to the approach here proposed (e.g., application of GISTIC to each of PCAWG, TCGA, TRACERx separately, followed by meta-analysis integration of the results). I am not saying meta-analysis would be superior, but the authors should discuss it, and possibly evaluate it.
  
  The reported metrics to quantify the quality of the integration are insufficient to assess the results. There is some lack of clarity about the classification accuracy results reported, since it is not clear whether all the components of the model building were adequately brought into the cross-validation (or train/test) loop. More specifically, when reporting the accuracy of the cancer type classification, it is reported that 1 megabase segmentation yields the best results. It is not clear if this size selection was performed within the train set only (and/or within the CV loop) or across the entire dataset. If the latter, this may significantly affect the accuracy results, which could not be deemed (unbiased) "test set" results. This should be clarified, and if the segment size selection was indeed performed outside the train/test split, accuracy measures should be computed again by performing the segment size selection properly (which of course it would mean a potentially different size would be selected for each of the folds).
  
  Comparisons with other methods: The authors only compare their method to random forest (RF). Related to the previous point: I presume the RF model used the segment size that was optimized for the CNN model (i.e., 1Mb). If this is the case, it would be an unfair comparison, since RF might favor a different size. Also, additional classifiers should be evaluated (e.g., Elastic Net, SVM, etc.).
  
  There is no sufficient discussion of existing tools/methods. This should be corrected (see also my comment about meta-analysis approaches).
  
  Metadata effects: Age influences the copy number alterations. The authors don't consider age or any other metadata and their implication in the classification task.
  
  Run time statistics and user requirement: While the authors report runtime curves per command (S Fig 6), it is difficult to translate this to total runtime. It would be useful if runtime for the entire training of a model were reported. Additionally, if available, comparison of run time stats with the established model that they cite would be useful.
  
  IG-based explanation. I found this section sort of perfunctory, not sufficiently justified, and adding little to the manuscript. IG is computationally expensive, and it does not provide any way to statistically quantify the found associations. Simpler methods, such as testing for association between SCNA occurrence and cancer type should be evaluated and compared to.
  
  Model selection: No adequate justification of why they picked CNN for this task when the referenced paper itself claims the DNN architecture performs better. Not sure but is this because of the varying segment size? Again, this is not clearly stated. https://pmc.ncbi.nlm.nih.gov/articles/PMC9203194/#tab1
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.23.630118v1
www.biorxiv.org www.biorxiv.org

Using synthetic RNA to benchmark poly(A) length inference from direct RNA sequencing

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jesse Daniel Brown
  
  This manuscript addresses a relevant and timely question: benchmarking poly(A) tail-length estimation tools (BoostNano, tailfindr, nanopolish, and Dorado) using synthetic RNA standards (Sequins) with known tail lengths. Poly(A) tail-length estimation is increasingly important for understanding mRNA stability, processing, and regulation at the single-molecule level. As direct RNA sequencing expands in use, reliable methods to measure poly(A) tail lengths are needed. The study's desiign—leveraging Sequins as a "gold standard" to benchmark tools—is strong and fills an area is need in current literature. The analysis is thorough in its basic comparisons, and the results are likely to be useful to researchers who need to choose suitable software for poly(A) tail analysis. However, the manuscript would benefit from deeper contextualization, more rigorous statistical methodology, and clearer reporting of computational details. Ensuring reproducibility and providing clearer guidance on interpreting the results in real biological contexts would strengthen the mannuscript. The suggestions below are aimed at making the study more valuable to the community. For this reason, my recommendation is Revisions ARE Needed
  
  Introduction
  
  Abstract: ★★★★☆ (4/5) Actually in place of the introduction, it has it strengths: The introduction adequately outlines why polyadenylation is biologically important and why direct RNA sequencing provides a unique opportunity for poly(A) tail-length estimation. It justifies the use of Sequins as synthetic standards, which is a robust approach to derive ground-truth tail lengths.
  
  Areas for Improvement:The introduction could better connect poly(A) tail-length estimation to downstream applications. For instance, mention how accurate tail-length estimation could improve understanding of mRNA decay rates, translation efficiency, or isoform-specific regulation.
  
  Adding references that contextualize poly(A) tail dynamics in broader biological phenomena would help readers understand the significance. For example, it is almost a necessity to cite work such as "Roles of mRNA poly(A) tails in regulation of eukaryotic gene expression" by Lori A. Passmore & Jeff Coller (2022, Nature Reviews Molecular Cell Biology) which provides a comprehensive analysis of poly(A) tail dynamics and their impact on mRNA decay, stability, and translation regulation. P & C (2022) also expands on these principles by discussing the mechanistic underpinnings of poly(A)-mediated decay and translation regulation, making it a broader and more recent contribution to polyadenylation biology, which the authors should consider.
  
  Grammar of the abstract: Error: "There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano." Suggestion: "Several tools are currently available for poly(A) tail-length estimation, including well-established methods like tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano."
  
  Error: "which lie within 12% of the correct value." Suggestion: "that lie within 12% of the correct value."
  
  Clarify the library preparation steps to avoid confusion about the "direct" nature of RNA sequencing. The text currently implies that no reverse transcription is required, but then references an ONT Reverse Transcription Adapter. Distinguish between a full-length cDNA synthesis step (not required) and the use of a poly(T)-containing adapter for sequencing library preparation.
  
  Methods
  
  Methods: ★★★★☆ (4/5) The methods section has its strengths; the data sources and preparation (Sequins spiked into host RNA) are clearly described. Versions of tools are provided, enhancing reproducibility.
  
  Areas for Improvement are statistical analysis, comparisons and tests, hardware and computation details, and understanding of run time differences. Currently, the study models distributions as normal and uses mean and SD, but no normality tests or justification for these choices are presented. Consider performing normality tests or using nonparametric measures. Additionally, providing confidence intervals or other robust statistics (median, interquartile ranges) would clarify variability.
  
  For the comparisons and tests, the authors should explain why you chose root mean square error (RMSE) minimization and other metrics. Could alternative tests, like Wilcoxon signed-rank tests or paired t-tests (Wilcocoxon: this non-parametric test is suitable for paired comparisons when the assumption of normality is not met. -useful to compare the predicted tail lengths from each tool against the expected lengths, especially if the data distribution is skewed.), be used to compare the distribution of tail-length estimates more rigorously? Paired t-Test, because this test could be applied if the normality assumption holds, providing a straightforward way to assess whether the mean difference between predicted and expected values is statistically significant. (If so, justification should be provided for why or why not)
  
  There are some additional metrics to explore: ---Median Absolute Deviation (MAD): Consider adding MAD as it is robust to outliers and could complement RMSE to provide a better understanding of central tendencies and variability. ---Mean Absolute Error (MAE): MAE is another alternative that simplifies the interpretation by focusing solely on the magnitude of errors without squaring them, potentially offering more intuitive insights for readers. The authors should address testing for normality, explicitly stating whether normality tests were conducted on the data (e.g., Shapiro-Wilk or Kolmogorov-Smirnov tests). If normality is confirmed, justify the use of parametric tests like RMSE or t-tests. If not, justify why non-parametric tests (e.g., Wilcoxon) were not employed or discuss plans to include them in future studies.
  
  Explain the choice of statistical methods over time by discussing how the choice of statistical tests aligns with the study's goals. For example, emphasize whether the focus was on understanding overall error distribution, tool consistency, or accuracy in predicting specific tail lengths.
  
  The authors could use visual representations of error complementing the statistical tests with visual aids such as boxplots, violin plots, or Bland-Altman plots to illustrate the error distributions and discrepancies between predicted and actual tail lengths across tools.
  
  The authors should provide hardware and computational details like providing explicit details on the computational environment—CPU/GPU models, RAM, OS—for each tool's run. While the Git-hub read me suggests how to run the system, it lacks any details about system requirements. Readers need this to understand runtime differences and attempt to replicate performance measurements.
  
  The authors should consider tool parameterization and indicate if any specific parameters (beyond defaults) were used in tailfindr, nanopolish, Dorado, or BoostNano runs. If no changes were made from defaults, state this explicitly.
  
  Results
  
  The result's strengths are that they are presented clearly, showing density distributions and discussing short-tail anomalies. The identification of Dorado as a preferred tool due to speed, integration, and conservative filtering is well-supported by the data. The study acknowledges that all tools achieve broadly similar accuracy, differing mainly in runtime and filtering criteria, which is a practical insight for users.
  
  The results have areas for improvement: Regrading the short-tail reads explanation, the authors attribute short (<10 nt) poly(A) tails to truncated transcripts or mis-priming. For this reason, it is suggested that the authors strengthen this discussion with additional evidence or reasoning. For instance, is there a correlation between read quality and short-tail length estimates? Do truncated reads consistently align to internal A-rich stretches? Multiple peaks in distributions: Some density plots (Figure 1) show multiple peaks or shoulder peaks. Discuss potential reasons for these patterns. Are they related to tool-specific biases, read quality, or adapter/poly(T) truncation? Application Context: The results focus on method performance, but it would help readers to understand how these differences might influence downstream tasks. For example, if a method overestimates poly(A) length slightly, how could this affect conclusions about RNA stability or differential tail-length analysis between experimental conditions? Figures and tables: Figure 1: Clear density plots, but consider adding vertical lines at expected tail lengths (30 nt and 60 nt) to guide interpretation. Splitting the figure into separate panels for R1 and R2 or using insets might clarify multiple peaks. Figure 2: The IGV snapshots are informative. Enhance interpretability by adding annotations (arrows or boxes) highlighting truncated vs. full-length reads. Increase font sizes for readability. Figure 3: Useful comparison of reads filtered by Dorado but retained by BoostNano. Add a brief note or labeling to indicate expected tail lengths. Discuss possible reasons for Dorado's conservative filtering here or in the main text. Tables: Provide definitions for abbreviations (nt, CPU, GPU) in captions. For Table 2, adding confidence intervals around the mean tail-length estimates would strengthen statistical rigor. For Table 3, specify hardware details as recommended above.
  
  Grammar Mistakes and errors in the results section: Results Section: Sentence: "The four methods display a similar pattern in the density distribution, with a prominent normal-like peak near the expected poly(A) length, but also with a over-representation of shorter poly(A) tails, ranging at approximately ~0-10 nt (Figure 1)." Issue: "a over-representation" Correction: "an over-representation"
  
  Sentence: "We expected that these shorter peaks were derived from either fragmentation of the transcript, mis-priming of internal poly(A) stretches or degradation of the poly(A) tails." Issue: tense mismatch ("expected" vs. "were derived"). Correction: "We expect" -- "were derived", loses context and tense contformity-- therefore the sentence should be adjusted- "We hypothesize that these shorter peaks are derived from either fragmentation of the transcript, mis-priming of internal poly(A) stretches, or degradation of the poly(A) tails."
  
  Sentence: "Interestingly, upon investigating these earlier peaks, we found that Dorado excludes reads which are retained in the analysis by BoostNano, despite them being classified as passed reads (Figure 3)." Issue: Ambiguous pronoun "them." (them could incorrectly identify three possible targets in the sentence) Correction: "Interestingly, upon investigating these earlier peaks, we found that Dorado excludes reads retained in the analysis by BoostNano, even though these reads are classified as passed reads (Figure 3)."
  
  Sentence: "Therefore, Dorado appears to be a more conservative approach than BoostNano." Issue: No grammar issues, but the statement could be more precise. Suggested improvement: "Thus, Dorado demonstrates a more conservative approach compared to BoostNano."
  
  Sentence: "In order to determine which normal distribution fit the peak best, we found the parameters (mean, SD) which minimize the root mean square error between the candidate normal distribution and the density distribution for an interval of 10 nt to the right of the mode." Issue: Verb tense consistency ("fit"). Correction: "To determine which normal distribution fits the peak best, ..."
  
  Sentence: "The peaks also lose their normal-like behavior for larger values." Issue: Could use a more formal tone. Correction: "The peaks also deviate from their normal-like behavior at larger values."
  
  Sentence: "Next, we compared the computational time required by each method to predict the tail-length of 4000 reads." Issue: Hyphenation of "tail-length." Correction: "Next, we compared the computational time required by each method to predict the tail length of 4,000 reads."
  
  Sentence: "BoostNano also offers the option of using the Application Programming Interface (API) call instead of the direct method, which omits the file copy implemented in the direct approach, reducing the run time to 8 m 8 s." Here, the sentence is extremely overwritten which cuases a lack of clarity. Correction: "BoostNano offers an alternative API-based method, which skips the file copy step of the direct approach, reducing the runtime to 8 minutes and 8 seconds."
  
  Discussion
  
  Discussion: ★★★☆☆ (3/5) The discussion as its strengths as it correctly identifies that Dorado's advantages (speed, integration with basecalling) make it appealing as a default choice. The authors acknowledge that all tools are within a similar accuracy range, suggesting the deciding factor may be speed or integration rather than raw performance differences. HOWEVER- there are areas for improvement: Further dissect the limitations of each tool. For example, BoostNano shows good SD but slightly off mean for R1; what does this mean for its use cases? Address the discrepancy between tailfindr, nanopolish, and Dorado in terms of how they define and detect poly(A) boundaries. Why does Dorado not evaluate start/end positions of poly(A) tails in event space, and how might this influence results? Include a brief discussion about how results might generalize to more complex transcriptomes. Real samples have varying GC content, fragment lengths, and potentially modified bases. A short commentary acknowledging these factors would show awareness that synthetic standards cannot capture the full complexity of natural RNA opulations. For these reasons, it is suggested that the authors suggest future directions. For instance, how could tool developers incorporate these findings to improve their methods? Could future benchmarking sets include a gradient of tail lengths to better understand length-specific biases?
  
  Grammar Mistakes and errors in the discussion section: Sentence: "BoostNano and tailfindr tools provided estimation of the starting and ending positions of the poly(A) tails in event space while this information was absent in Dorado outputs." Issue: "provided estimation" should be "provide estimation" to align with present tense. Correction: "BoostNano and tailfindr tools provide estimation of the starting and ending positions of the poly(A) tails in event space, while this information is absent in Dorado outputs."
  
  Sentence: "On the R1 dataset, BoostNano showed a tighter distribution with the smallest SD, but its peak was the furthest from the correct value." The issue here is that the test results are still speaking about genneral truths leading to verb tense inconsistency; "showed" should match other verbs in the section. Correction: "On the R1 dataset, BoostNano shows a tighter distribution with the smallest SD, but its peak is the furthest from the correct value."
  
  Sentence: "tailfindr had the most accurate estimation but also the largest error interval."
  
  The issue here is the verb tense mismatch; "had" should be consistent with present tense to show truth, not past truth. Correction: "tailfindr has the most accurate estimation but also the largest error interval."
  
  Sentence: "Furthermore, Boostnano is more lenient in keeping reads for poly(A) estimation than Dorado."
  
  Issue: "Boostnano" capitalization error; it should be "BoostNano." Correction: "Furthermore, BoostNano is more lenient in keeping reads for poly(A) estimation than Dorado."
  
  Sentence: "Overall, our results suggest that the four tools investigated in this study - BoostNano, tailfindr, nanopolish and Dorado have similar performance with their accuracy varying from one dataset to the other, with a potential length bias."
  
  Issue: Missing commas for clarity; replace "with their accuracy varying from one dataset to the other" for conciseness. Correction: "Overall, our results suggest that the four tools investigated in this study—BoostNano, tailfindr, nanopolish, and Dorado—have similar performance, with accuracy varying across datasets and showing potential length bias."
  
  Sentence: "Therefore, we expect Dorado to be implemented as the default method of poly(A) tail estimation in the near future, with the rapid estimation timeframe, comparable estimation lengths to other tools, conservative nature and the added benefit of ease of obtaining this information during basecalling."
  
  There are several issues here including verbosity and lack of parallelism. Correction: "Therefore, we expect Dorado to be implemented as the default method for poly(A) tail estimation, given its rapid estimation timeframe, comparable accuracy to other tools, conservative nature, and ease of integration with basecalling."
  
  Sentence: "This work demonstrates the value of having access to synthetic RNA molecules with known poly(A) tail-lengths for validating the accuracy of poly(A) tail estimation algorithms."
  
  Issue: The phrase "validating the accuracy of" could be simplified for readability. Correction: "This work demonstrates the value of synthetic RNA molecules with known poly(A) tail lengths for validating poly(A) tail estimation algorithms."
  
  Sentence: "As methods improve, we anticipate that these datasets will be valuable for assessing improvements in estimation of poly(A) tails."
  
  Issue: "improvements in estimation of" is awkward. Correction: "As methods improve, we anticipate that these datasets will be valuable for assessing advancements in poly(A) tail estimation."
  
  References need to be added to accomodate the suggested material review, but existing references are good-
  
  NEEDS REVISION Jesse Daniel Brown PD AASU
  
  Note:
  
  I previously reviewed this paper previously in Research Hub and you can read these comments via the Research Hub review page here: https://www.researchhub.com/paper/8634403/using-synthetic-rna-to-benchmark-polya-length-inference-from-direct-rna-sequencing/reviews#threadId=55398.
  
  The original preprint linked to the Research Hub review is here: https://doi.org/10.1101/2024.10.25.620206
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Christoph Dieterich
  
  In this manuscript, the authors present a benchmark to assess the performance of different tools designed for estimation of polyA tail length from Nanopore direct RNA-sequencing data. These tools include tailfindr, nanopolish, Dorado and Boost Nano. Benchmarks on tools and algorithms to analyze Nanopore data, both third party tools and official ONT releases, are of utmost importance for the field. The use of synthetic constructs with known ground truth is recommended as well. Consequently, this study has the potential to provide a significant contribution to the field.
  
  In the current form, I can however not recommend it for publication in GigaScience. My major concerns are: a) Use of only RNA002 data. This chemistry is outdated and thus the Benchmark is only relevant for old, possibly already published data. A comprehensive Benchmark should also include RNA004 and available tools there (at least Dorado). b) The current data set only contains two polyA tail length, which are relatively short and do not cover longer polyA tails that are common e.g. in mammalian cells. A proper Benchmark should show the performance of the analyzed tools over a range of polyA tail lengths.
  
  Minor comments: 1) Abstract: "All four tools generate mean tail-length estimates which lie within 13% of the correct value." The value of 13% is given in the Abstract from the submission system, wherease the abstract in the Main text says 12%. Which value is correct? 2) Background, first paragraph: the role of the polyA tail in RNA circularization, which is required for efficient translation of cellular mRNAs is not mentioned. Reference is missing for "is increasingly recognised as a dynamic process which influences timing and degree of protein production." 3) Background, second paragraph: Chiron seems to be a relatively old basecaller (no models for new chemistries). It should be mentioned here that it is required for BoostNano. 4) Mis-priming of internal polyA sites may an important confounding (and currently overlooked) source of errors in Nanopore sequencing. This should be quantified properly and analyzed in more detail (length of these stretches, influence of other nucleotides within the A-rich stretch, etc.). Should be done as well on whole transcriptome data with more possible mispriming sites. 5) Why do the authors think that the poly(T) stretch of the RTA might be truncated? This is composed of DNA oligos, which should be quite stable 6) What are the parameters for filtering used by Dorado and BoostNano? Can the authors explain, why the filtered reads differ? 7) Dorado seems to systematically underestimate polyA tail length. Is this true also for data generated with RNA004 chemistry and longer polyA tails?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.25.620206v1
www.biorxiv.org www.biorxiv.org

Nanopore- and AI-empowered microbial viability inference

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jakob Wirbel
  
  Summary: Urel and colleagues present a novel computational method to predict viability from metagenomic sequencing data, using the Nanopore squiggle as input. The manuscript is well-written and present an interesting new application, bolstered in particular by the application of explainable AI. However, I have some concerns regarding the generalizability of their method, detailed below.
  
  Major: The way the authors try to exclude contamination in their C. abortus experiment is not optimal, since contaminatants might be at low abundance and therefore not assemble well (especially with the relatively low sequencing output overall). Instead, it would be better to map reads against the reference genome for C. abortus and check if reads predicted to be viable map or if they are unmapped in this test. Maybe viable reads instead map against a database of known contaminants, like skin-resident microbes or other known kit contaminants. (This could potentially bolster their model performance)
  
  The authors claim that their method generalizes well from E. coli to C. abortus, which were killed in two different ways (UV and heat shock). However, if I understood correctly, their extracted DNA was left in the lab for 5 days. During this time, could exposure to sunlight over time have led to similar chemical reactions (meaning twists/kinks in the DNA as well as pyrmidine dimers)? This might be a point to discuss or it could be easily tested by incubating the DNA of the heat-killed C. abortus in the dark.
  
  What is the time-frame of DNA degradation in which the model works best? The authors left the DNA for 5 days, but metagenomic samples are usually processed quite quickly. How would the model perform on samples that were only kept for 1 day after initial killing? At which time of incubation does the model not generalize anymore? For a potential application, it might be useful to know if DNA is viable or not, even if the cells died relatively recently (and in the dark).
  
  Code availability: The github looks great, but as a potential user of their method, I would not want to train my own model. Is it possible to host the model, maybe on Zenodo, so that it could be more useful as an application?
  
  Minor: Lines 96-100 read a bit like a Nanopore commercial and are not really relevant for this paper Line 182: shouldn't heat shock at 120 C inactivate enzymes? Line 206: it is curious to keep the default cutoff just because the results are fine. Why not optimize the F1 score, for example? Fig1B seems to indicate that a probability threshold of 0.48 or something would give a higher F1 score. The decision to keep the threshold at the default value seems arbitrary Line 275: interesting hypothesis. Did you observe quicker decay of pore viability in the dead versus the alive run? Could you provide the pore scan information over the time of the sequencing run as a supplement, maybe, to back up this hypothesis? Line 311: the number does not match the one in the table Line 331: the dead reads are very short. Could you compare just the length of the reads with the viability predictions? Are shorter reads more likely to be predicted to be non-viable? Fig 3a: what does normalized count mean? How about a standard histogram or density plot? Line 442: The most recent version of dorado is v0.8.2.; did you mean v0.4.2? Please adjust.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Finlay Maguire
  
  In this paper the authors train a ResNet-based model to predict whether individual 10,000 sample chunks of nanopore signal data originate from live or killed bacterial isolate cultures. From live and UV-killed (at exponential phase) E. coli K-12 cultures DNA was extracted and sequenced using separate R10.4.1 flowcells on a MinION. Signal data from each read in the live and dead extractions were then processed by discarding the first 1,500 samples and dividing the remaining signals into 10,000 sample chunks. These were then split into a balanced 60:20:20 train, test, and validation datasets with the constraint that no two chunks from the same read would end up in the same dataset (e.g., chunk 1 and chunk 2 of 1st read in the killed culture would hypothetically be separated into train and test). During this they also explored/compared the impact of chunk size, model architecture, and performance of a sequence based model using the E. coli data. With a nicely performed class-activation map and masking approach they then identified the signal regions most strongly associated with dead-predictions (such as twisting/kinking/pore blockage of DNA around pyrimidine dimers). Finally, they applied their trained model to a live and heat-killed Chlamydia abortus culture and compared their results to stained microscopy and propidium monoazide PCR measures of viability. They found equivalent performance on the C. abortus data to their E. coli data (despite a different killing-method and taxa).
  
  The manuscript is well written and the methods are clearly described (including well documented code and deposited data). The authors explainability methodology is excellent although it would have been nice to see a bit more in-depth interpretation of those results. The authors have also presented a convincing case that nanopore signal data does contain information that can be used to distinguish signal chunks from live and dead bacterial monocultures. This methods has the potential to be useful in clinical and environmental genomics if it can be extended to more heterogeneous metagenomic samples. However, despite the title and framing of this manuscript (i.e., "metagenomics"), their analyses do not involve any metagenomic data and their results so far do not demonstrate if this is fesible. Currently, the overall framing (and title) of the manuscript is not appropriate given the work performed at this point. Similarly, given that both E. coli and C. abortus "dead" cultures resulted in median read length less than half the live cultures, the authors do not fully make the case that the signal and ResNet approach is actually required relative to simpler baseline models. Finally, although they did evaluate performance on a complete separate dataset, the authors should at least explore/quantify the correlation of live/dead prediction across chunks of the same read given the default expectation of non-independence of signal chunks from the same read.
  
  Major - Although the title and framing of the paper suggest that the authors are classifying live and dead bacteria in metagenomic datasets, the actual experiments and method developed are entirely based around sequencing of cultured clonal bacterial isolates. Metagenomic datasets are going to have considerably more heterogeneity in viability, species composition, and DNA signal characteristics. Given this, the paper's title, introduction, and parts of the discussion are a bit of an oversell and inappropriate. This manuscript should be revised to more clearly reflect the work actually performed.
  
  This paper doesn't establish whether a ResNet + Signal approach actually outperforms a much simpler baseline. For example, given there is a clear extraction and median read-length differences between live and dead samples, it is possible that a much simpler logistic model using basic features such as read length and/or translocation could perform equivalently.
  
  Although the C. abortus analysis demonstrates limited impact of leakage, I'm still a bit concerned that the potential non-independence of chunks from the same read (i.e., chunk 1 and chunk 3 of the same read are more likely to share similar live/dead signal characteristics than Chunk 1 and 3 of different reads). By not having multiple chunks of the same read in the training, validation, or test datasets the authors may have avoided issues with longer-reads being more represented in their datasets. However, this has the potential to introduce data leakage between train and test set (which may impact generalisability when they attempt to extend this method to metagenomics). I think this paper would be improved by some exploration of the correlation of live/dead prediction across chunks of the same read. How often do different chunks of the same read disagree? How does this impact the overall performance of the model? Does taking the average prediction across chunks of the same read improve or degrade performance? Would this problem be better suited to a multiple instance learning approach (i.e., a live/dead label applied to all chunks from a single read) especially in more heterogeneous datasets? To what degree do longer reads with more chunks contribute disproportionately to the overall performance in the C. abortus dataset?
  
  Minor
  
  SRA records don't seem to be live yet (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=1123127)
  
  Are the actual pod5 files available?
  
  Read-level performance should be analysed and reported.
  
  Figure 1B: the test subplot numbers are almost too small to read - they may benefit from being its own panel.
  
  Plot axes labels are not always clear (e.g., Figure 3) percentage of what? Chunks? or Reads? It would be nice to see consistent capitalisation of labels and legends.
  
  Predictions on viable E. coli and viable C. abortus seems surprisingly similar (91.44% vs 91.34% viable and 8.56% vs 8.66% dead) despite different taxa, potentially underlying viable cell proportion, and output probability densities. This would benefit from further discussion/analysis - do misclassified chunks have any common characteristics? Would you expect the E. coli to have similar microscopy/PCR measured viability percentage as the C. abortus.
  
  Would be good to see a bit more discussion/exploration of impact of mixed live/dead cells given ~37.6% viability measure in the C. abortus sample (e.g., how well do models perform with different ratios of live/dead reads) - could potentially be achieved using in-silico spike ins).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.10.598221v2
www.biorxiv.org www.biorxiv.org

The Open Pediatric Cancer Project

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jacek Majewski
  
  Shapiro et al. describe the Open Pediatric Cancer Project, a dataset, web portals, and a Github repository to facilitate data access, analysis, and encourage collaborations using pediatric cancer omics data. While the concept is inspired, it does not constitute a significant advance over the previously described OpenPBTA project. The goal of the manuscript may be to provide a pointer to the updated datasets and web resources, but this does not seem like a sufficient reason to publish. As far as I can tell, all of the information in the manuscript is already provided on the OpenPedCan Bioportal (which is really useful, to be fair) and on GitHub. To publish a manuscript just as a pointer to that information does not seem justifiable in my opinion.
  
  Major Concerns:
  
  Novelty and Validity of Key Features:
  
  The manuscript highlights several key features of OpenPedCan, including data harmonization, multi-omic integration, reproducibility, scalability, versioned data releases, accessibility, alignment with WHO 2021 classifications, and the open-source framework. However, these features are not novel. Many of them represent standard practices in the field. Moreover, some claims appear questionable: * Reproducibility: While the authors claim reproducibility, using OpenPedCan's dockerized workflows would require significant computational resources (e.g., 98GB of CPU) or expensive cloud services (e.g., AWS). * Accessibility: The platform's interface requires users to have a Gmail account, limiting its accessibility. Alternative login options should be considered. * Open-Source Framework: The manuscript does not adequately address how the framework handles access to controlled data, such as those integrated from external sources like TARGET and TCGA, which may require restricted access permissions.
  
  Lack of Novel Methodologies and Findings:
  
  While OpenPedCan integrates data from existing workflows and portals (e.g., Gabriella Miller Kids First, TCGA), the manuscript does not clearly outline novel methodologies or scientific contributions. Most prominently, the submission appears to be an incremental extension of the previous manuscript describing OpenPBTA published in Cell Genomics 2023. The only potentially novel components appear to be proteomics and molecular subtyping based on methylation, but no specific examples or case studies demonstrating the novelty or impact of these contributions are provided.
  
  Redundancy with Existing Tools:
  
  The manuscript states that OpenPedCan serves as a community resource for addressing research questions and providing orthogonal validation datasets. However, there is nothing presented in OpenPedCan that cannot already be achieved with existing tools. This makes the claim somewhat redundant, as the platform largely serves as a data integrator rather than offering unique capabilities.
  
  Minor Concerns:
  
  Splicing Analysis Module:
  
  The manuscript refers to a splicing analysis module (Figure 2: OpenPedCan Analysis Workflow), but there is no further description or discussion of this module within the text. Further elaboration is needed.
  
  Incomplete Module Descriptions:
  
  The manuscript describes several analysis modules, but it should provide more comprehensive descriptions of the analysis modules, especially the Splicing Analysis module.
  
  Additionally, the Molecular Subtyping component, based on molecular and methylation data, is the only module with a clear methodological explanation.
  
  Further clarification on the methods used in other modules would be beneficial.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Stephen R Piccolo
  
  I love this type of work. This research will be invaluable to the wider research community of people studying pediatric cancers. It will save lots of time and frustration and move the field forward. The paper is well written. I have to admit that I am not well versed in all of the latest software tools and settings to use for processing all of the data types that the repository includes. So I cannot vouch for or against those. However, the tools that I am familiar with seem reasonable. I have a few comments / suggestions / questions.
  
  How is patient privacy maintained? Sorry if I missed this. The paper mentions the original sources of the data. However, if I understand correctly, OpenPBTA has reprocessed versions of the data. What processes are used to regulate access to versions of the data that must be kept secure? Perhaps I am misunderstanding the ideas behind how this works.
  
  Validation. It would be helpful if the paper could touch on the approach the authors use to ensure that data that they have (re)processed are valid. For example, are there any known findings that show up after the data have been reprocessed? Or are there other ways of assessing quality?
  
  The paper mentions TCGA and GTex. It also mentions that adult data are included. But I didn't see a clear rationale for doing this.
  
  The paper includes many links, some of which reference portions of the GitHub site. It would be best to display the URLs in the paper itself. It would also be useful to reference a Zenodo-archived version of the GitHub site so that there is a versioned record of the repository at the time of submission.
  
  Supplementary Table 1 has a tab with information about the patient metadata ("Biospecimen-level metadata and clinical data"). However, I didn't see details in the paper about how these were harmonized. How did the authors ensure that the metadata values come from disparate sources were used consistently? What expertise did they have? How did they resolve inconsistencies or missing data? Supplementary Table 1 indicates a definition and a data type for each of these fields. It would be much more useful to provide ontology term(s) for each of these fields so that the metadata were machine readable.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.09.599086v3
www.biorxiv.org www.biorxiv.org

A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection

4
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 4: Wai Yee Low
  
  Review of "A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection". This is an impressive work at the frontier of buffalo genomics. I truly enjoy reading the work and my questions/comments are aimed at improving it further. My detailed comments are below: Line 30: I think it is better you include the actual number of publicly available assemblies used to create the pangenome graph. Line 71: There is now a swamp buffalo reference genome with annotation too (NCBI accession: PCC_UOA_SB_1v2). Perhaps consider to cite the swamp buffalo ref https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae053/7753516 and rewrite the sentence to say a pangenome can be used for both swamp and river, but a single linear ref from either subspecies for read mapping is not good enough. Line 79: "highlighted" Line 82: What do you mean by "higher quality"? The assemblies have been discussed in this review: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.629861/full Line 105: Technically, the graph method for bovine species, which includes water buffalo, is being investigated by the Bovine Pangenome Consortium (BPC). However, nothing useful has been published on the buffalo graph but perhaps consider citing the BPC since your paper overlaps with it (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02975-0). Line 165: It will be good if you add a bit more context of the PanGenie method here as the researchers in buffalo community are not used to this. Additionally, it will be great if all code is made available on GitHub or as Supplementary Info. Line 170: To produce phase pangenome graph, don't you need all input assemblies to be phased? All are input assemblies phased? The UOA_WB_1 is locally phased, not phased throughout the genome. Line 235: "a list of 403 unrelated individuals." What does this translate to in terms that geneticists can understand? Do you mean siblings have been removed? Or individuals sharing the same grandparents were removed? Line 246: Can you please explain how did you get the coordinates to match between the GATK and PanGenie method? You'll need matching coordinates for concordance analysis. As I understand it, the GATK was based on UOA_WB_1? Line 254: Why these 3 chromosomes? Line 257: If you had not filtered for relatedness, how will it impact the selective sweep work? I think including some context will help the readers. Line 259: do you mean at least six samples per group? If yes, is 6 samples enough? Line 261: genotype quality less than 25 according to bcftools? Since you only used biallelic variants, please provide the breakdown between biallelic and multiallelic. Line 281: "… we first PacBio HiFi sequenced one female" Please rewrite this. Line 282: How common are these two breeds in percentage? Line 291: Is this already known? Perhaps cite the literature to show the agreement with previous studies? Fig 1D: This is a bit too small to see especially the SV distribution at the bottom. I can hardly see the median? Line 310: Why did you choose UOA_WB_1 as the reference? Line 311: the ~32.8 mil variants are comprised of SNPs as well? Fig 2: This is probably a panel of a figure but should not be the entire figure. The size of the circle indicates sample size but there should be a legend on the plot for this to say the sizes, right? Darker colour should be used to highlight the countries with samples instead of white? Maybe this could be a Supp figure too. Line 356: S Figure 4 and 5 should be main figures? You will need to annotate the abbreviation of sample-country in the legend of S Figure 5. Line 360: "To enable reuse we have made this dataset available …" The dataset should be made available to reviewers? Line 368: "76% of SNVs were called by both callers" 76% seem low. Also, called does not mean concordant. What is the concordance among called SNVs in both? Did the pangenome approach called most of the variants found in GATK? If not, what might be the reasons? Fig 3B: It is not immediately clear what the difference is, between non repetitive and repetitive regions. The overlapping text in the x-axes makes it hard to read. Line 390: "Analyses such as the study of selective sweeps or genome-wide association studies where low frequency variants are often filtered out will benefit less from the advantages of GATK, particularly given its longer run time." From here on, in this paragraph, it's Discussion, not Results. Line 418: Why human? Could you use cattle? Line 427: I tried the browser and not sure what I can learn from it. It will be helpful if there is a README with some examples on what can be explored. Line 450: How large before you considered it as larger variant? Is this ability to study larger variants still hold despite using only ~10 assemblies in the graph? The use of short reads for selective sweep study will still benefit from being able to incorporate these larger variants? As I understand it, the larger variants were found only from graph, not from the short reads. As such, the selective sweep may not be associated with any larger variants? Line 470: Fig S8 should be a main figure? Line 513: Instead of uniprot link, perhaps consider including this as Supplementary info or text. The info in the link may change in the future. Line 551: However, without scaffolding, the assemblies of Pakistani river buffalo may not be good enough to function as reference genomes for river buffalo? Line 552: When considering new bases, did you do this for each assembly independently or the new bases were discovered cumulatively? Line 581: Some of my questions at Line 450 can be discussed here. Line 586: Perhaps consider discussing the limitations of the small number of assemblies used to create the graph. As such, many SVs are likely still missing and we are still unable to properly assess allele frequency of these larger SVs. Additionally, while some SVs may not be considered as large in this work, it does not mean they have no impact.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Laura Caquelin
  
  SummaryoftheStudy This study used graph genomics to better characterize water buffalo genomes. By building a pangenome from new and existing assemblies, the authors analyzed 711 samples. These samples revealed structural variation. These results highlight the value of graph genomics. This method
  
  Scopeofreproducibility According to our assessment the primary objective is: to identify genomic variants within selective sweep regions in the water buffalo genome.
  
  Outcome: Enrichment of high-impact structural variants (SVs), insertions/deletions (indels) and single nucleotide variants (SNVs) in selective sweep regions.
  
  Analysis method outcome: Variants were compared between selective sweep regions and genome-wide. Fisher's exact test was used to assess enrichment of functional variants.
  
  Main result: "Prior to annotation, multiallelic variants were normalized by splitting them into separate biallelic entries, resulting in 6,159,686 indels, 28,669,966 SNVs, and 160,921 SVs entries. Within putative selective sweep regions we identified 208,862 indels, 997,500 SNVs and 6,748 SVs. Notably an enrichment of HIGH impact SVs, indels and SNVs were observed within selective sweep regions (Figure 5A, Supplementary Table S6), with 50-80% more variants in these areas having a HIGH impact compared to genome-wide. Among the high impact variants in selective sweep regions only 20% were SNVs, with the remainder being SVs and indels, suggesting high impact larger variants may underlie putative selective sweeps." (Lines 453 to 461)
  
  AvailabilityofMaterials a. Data
  
  Data availability: Open
  
  Data completeness: Complete, all data necessary to reproduce main results are available
  
  Access Method: Supplementary files - Repository: -
  
  Data quality: Structured b. Code
  
  Code availability: Shared for the review after request - Programming Language(s): R
  
  Repository link: -
  
  License: -
  
  Repository status: -
  
  Documentation: No documentation
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 14.7.4
  
  Programming Language(s): R
  
  Code implementation approach: Creating script according to the methodology description/Using shared code
  
  Version environment for reproduction: R version 4.4.1/RStudio 2024.09.0
  
  Results 5.1 Original study results
  
  Results 1: Results are presented in Figure 5A. 5.2 Steps for reproduction -> Reproduce the results The code was not shared initially, but as the data were provided and the test was a Fisher's exact test, I wrote code to reproduce the p-values.
  
  Issue 1: P-values for the SNVs variant as well as the « Modifier » impact class were not provided. -- Resolved: Authors provided an updated Supplementary table S6 with exact numerical p-values for each variant and each impact class. The code "variantEnrichAtPeaks.R" to generate the Figure 5A and the Supplementary table S6 was also shared. New version of the supplementary Table S6: (see screenshot)
  
  The comparison between the reproduced results and the original results was then performed using the shared code. (Notably, the results from the R script written allowed for the generation of the same p-value as the one presented in Figure 5A).
  
  Issue 2: In the script "variantEnrichAtPeaks.R", only the figures were generated, not the new supplementary Table S6 with the numerical p-values. -- Resolved: Some code lines was added in the function "makePlot" to generate this table in addition to the figure.
  
  Line 159 to 178 of the script "variantEnrichAtPeaks_RCC."
  
  Supplementary table S6 (add)
  
  summary_table <- df %>% mutate( Type = variantType, Genome_Wide_Prop = Genome_wide / sum(Genome_wide), Selective_Sweep_peaks_Prop = Sweep / sum(Sweep), Ratio_of_proportions = Selective_Sweep_peaks_Prop / Genome_Wide_Prop) %>% left_join(pval_df, by = "Impact") %>% select( Impact, Type, Genome_Wide = Genome_wide, Selective_Sweep peaks = Sweep, Genome_Wide Prop = Genome_Wide_Prop, Selective_Sweep peaks Prop= Selective_Sweep_peaks_Prop, Ratio of proportions= Ratio_of_proportions, Fishers exact P = p_value)
  
  return(list(plot = p, summary_table = summary_table))
  
  5.3 Statistical comparison Original vs Reproduced results - Results: Figure and table S6 were reproduced for each variant type and impact: -- SVs type: (see screenshot) -- Indels type: (see screenshot) -- And SNVs type: (see screenshot)
  
  Comments: The shared code was used to compute the p-values and generated the Figures. Minor numerical error discrepancy was observed for some p-values, likely due to rounding differences. The p-values in the original Excel file appear to be stored with less decimal precision than those computed in R. This difference is negligible and does not indicate a reproducibility issue.
  
  Errors detected: No error detected.
  
  Statistical Consistency: The results were successfully reproduced with the share code.
  
  Conclusion
  
  Summary of the computational reproducibility review The Fisher's exact tests for enrichment across variant and impact categories, presented in Figure 5A of the manuscript, were successfully reproduced using the data in supplementary table S6 and the shared code. Results were consistent with the original, with only negligible rounding differences in p-values.
  
  Recommendations for authors We were able to reproduce study with the data and information provided in the Figure 5A description. To further improve transparency and ensure full reproducibility of your manuscript, the following recommendations are suggested: -- Make the codes to reproduce all analyses in the paper openly available to allow anyone to reproduce the results. Ideally, provide a README or requirements.txt file describing how to run the analysis, including software versions, packages, and dependencies. -- Include statistical outputs, such as exact p-values, in supplementary materials when possible. This ensures clarity and eases verification. Ideally, provide metadata: For the datasets used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yi Zhang
  
  This manuscript presents the first high-quality, haplotype-resolved genome assemblies for two representative Pakistani river buffalo breeds (Nili Ravi and Azikheli), integrating them with existing assemblies to construct a water buffalo pangenome. The study leverages graph genomics to characterize structural variation (SV), identifying >140 Mb of non-reference sequence and 111,352 SVs. By genotyping of 711 global samples against this pangenome, the authors uncover population-specific selective sweeps linked to productivity, immunity, and adaptation traits, revealing potentially functional SVs, though these findings are limited by the absence of validation evidence and cross-study comparisons. The work highlights graph genomics as a transformative tool for integrative analyses of evolutionarily related species in an unbiased way and provides resources to accelerate buffalo breeding.
  
  General Comments 1.The study's methodology is rigorous, combining long-read assembly, graph-based genotyping (PanGenie), and population-level sweep scans. Nevertheless, the manuscript would benefit from discussion of graph limitations, such as bias against rare variants (Fig. 3B) and challenges in graph construction for species with karyotypic divergence. 2. The selection signature analyses were done across a number of population groups but the paper only showcases a limited selection of results. To strengthen the manuscript, the authors could concentrate on a consistent set of populations. This would enable a more in-depth examination of selective signals common across buffalo population groups or unique selective signals specific to certain groups. 3. It could be informative to conduct comparative analyses of selection signatures using variant datasets from PanGenie and GATK. This could reveal whether the pangenome approach might uncover important structural variants within selection signals that GATK fails to identify.
  
  Specific Comments 1. In Figure 1D and the main text, the rationale behind dividing the SVs into 40 sets is not clearly presented. If the interpretation is correct, the y-axis label of the bar graph should denote the number of SVs rather than size. Moreover, the main title "SVs Size Distribution" at the top seems more relevant to the box plots at the bottom. 2. Lines 325 - 326 state that the newly assembled pangenome graph exhibits a substantial increase in genome size compared to the existing reference genome. It is recommended that the authors describe the distribution of the 147,865,364 bp across the entire genome. Are they found more prevalent in specific regions of certain chromosomes? 3. In lines 410 - 412, there may be an issue with the citation of Table S2. The table contains 402 individuals, whereas the text mentions 282. 4. Figure 3 shows that, when using 30x samples in the variant calling comparison between Pangenie and GATK, there are still a large number of SNV variants detectable only by GATK. A more in-depth technical discussion of these differences would greatly enhance the reader's comprehension of these findings and the relative performance of the two methods. 5. To provide a more intuitive understanding of how SV can influence gene function and contribute to the traits, the authors could include a figure that displays an example gene structure along with the SV of interest within a selection signal peak.
4. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:Paul Stothard
  
  This well-written manuscript describes the generation of new genome assemblies for water buffalo and the construction of a pangenome graph that is used for variant calling and downstream analyses. The work is clearly described and the methods are appropriate given the goals of the study. The results are interesting and timely, and realistic limitations are stated. The manuscript should be of high interest to the water buffalo research community and to those interested in applying pangenome graphs to variant calling.
  
  I have minor comments that I believe should be addressed prior to publication.
  
  Minor comments:
  
  In the NCBI genomes database, the water buffalo assembly NDDB_SH_1 is listed as the current reference genome, not UOA_WB_1 as suggested in the manuscript. Perhaps the reference genome was recently reassigned?
  
  Lines 64-69: Lack of clarity regarding relationships among water buffalo populations: - Wording suggests single domestication event accounts for all domestic water buffalo. But, the river and swamp buffalo diverged prior to the domestication date. This is a contradiction. Clarify by mentioning that there were at least two independent domestication events (one for river buffalo and one for swamp buffalo). - Taxonomic terminology is inherently ambiguous for a few reasons, including: 1) The Bubalus arnee species comprises both wild river buffalo and wild swamp buffalo, which have not been assigned subspecies names. 2) Domestic water buffalo (including river and swamp buffalo) are assigned their own species name: Bubalus bubalis, despite being biologically the same species as Bubalus arnee. 3) Unlike their wild source populations, domesticated river buffalo and domesticated swamp buffalo are assigned their own species names, Bubalus bubalis bubalis and Bubalus bubalis carabanensis, respectively. - To address ambiguity regarding taxonomy and phylogeny of the buffalo populations, mention the full subspecies names (Bubalus bubalis bubalis, and Bubalus bubalis carabanensis).
  
  Line 82: "Although eight higher quality": higher quality than what?
  
  Line 177: Undefined acronym: "PAF".
  
  Line 216: "each unique biosamples": should be "each unique biosample".
  
  Line 272: Which SnpEff database was used for variant annotation?
  
  Line 286-287: Based on Table 1, the difference between the largest and the smallest water buffalo genome is 360 mega base pairs. That exceeds the length of the largest chromosome by almost 2 fold, and is 14% of the total length of the UOA_WB_1 reference assembly. This is a very large difference to observe between members of the same species. Considering that segmental duplications are often not accurately represented in genome assemblies, there is a strong possibility that some of the variants identified between these new high-quality assemblies and the other assemblies are simply assembly artefacts (failure of recently duplicated segments to be distinguished, etc.). At the very least, this should be addressed in the Discussion.
  
  Line 360-361: Elaborate slightly on what is in the dataset being shared.
  
  Line 420-421: Clarify which of these are human vs animal traits.
  
  Figure 1 A legend: The dots seem to all be the same size, which suggests that this is a scatter plot, not a bubble plot.
  
  Figure 1 C: "across the graph genome" sounds spatial; perhaps "proportion of variant types in the graph genome" would be clearer.
  
  Figure 1 D: It would be helpful to have the rows sorted to match the order in B.
  
  Figure 1 D: The low bars (i.e. small number of shared sites) are not easy to interpret. Perhaps the y-axis could be transformed to log scale or the number of variants could be added to the bars.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.04.652079v1
www.biorxiv.org www.biorxiv.org

Comparing Linear and Nonlinear Finite Element Models of Vertebral Strength Across the Thoracolumbar Spine: A Benchmark from Density-Calibrated Computed Tomography

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractOpportunistic assessment of vertebral strength from clinical computed tomography (CT) scans holds substantial promise for fracture risk stratification, yet variability in calibration methods and finite element (FE) modeling approaches has led to limited comparability across studies. In this work, we provide a publicly available benchmark dataset that supports standardized biomechanical analysis of the thoracic and lumbar spine using density-calibrated CT data. We extended the VerSe 2019 dataset to include phantomless quantitative CT calibration, automated vertebral substructure segmentation, and vertebral strength estimates derived from both linear and nonlinear FE models. The cohort comprises 141 patients scanned across five CT systems, including contrast-enhanced protocols. Phantomless calibration was performed using automatically segmented tissue references and validated against synchronous calibration phantoms in 17 scans. To evaluate model performance, we implemented a nonlinear elastoplastic FE model and compared it to two linear estimates. A displacement-calibrated linear model (0.2% axial strain) demonstrated excellent agreement with nonlinear failure loads (R = 0.96; mean difference = -0.07 kN), while a stiffness-based approach showed similarly strong correlation (R = 0.92). We evaluated vertebral strength at all thoracic and lumbar levels, enabling level-wise normalization and comparison. Strength ratios revealed consistent anatomical trends and identified T12 and T9 as reliable alternatives to L1 for opportunistic screening and model standardization. All calibrated scans, segmentations, software, and modeling outputs are publicly released, providing a benchmark resource for validation and development of FE models, radiomics tools, and other quantitative imaging applications in musculoskeletal research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf094), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Karan Devane
  
  The study uses an open-source dataset collected in a population representative of those who would benefit from opportunistic screening and included physiological variation (i.e. contrast enhanced images and pre-existing fracture), alongside validation of density and FE assessment calibration methods. The methods are described in detail, including software versioning schemes, and links to the software sources as relevant for use in replicating methods. Additionally, the enhanced dataset is being included alongside the publication. The primary purpose of this study was to prepare and make available a public dataset for use in continued testing and development of opportunistic screening methods. The data appears to be conservatively analyzed as such, and the authors make notes of existing limitations of the population and sample characteristics where applicable. Additionally, the phantomless calibration technique is validated within this dataset prior to use in support of the "generalizability of the approach" (178), though the applied sample for this is relatively small (n=17 with in-scan phantoms). The manuscript is well-written and easy to understand but I have a few suggestions and comments that need to be addressed.
  
  The data are well-controlled for the study cohort, however as mentioned by the authors (228-232), this cohort is biased towards individuals with pre-existing skeletal fragility, as indicated by the average lumbar T-score as assessed by DXA falling in the osteopenic range (-1.5, Table 1). Beyond this, the authors made use of multiple validated calibration techniques to support the use of their internal calibration scheme, as well as analysis of potential confounding variables such as contrast enhanced CT scans. Relative vertebral strength analysis (Figure 6, Table 2), however does not appear to be analyzed with respect to the fractures mentioned as present throughout the cohort (193). While differences in strength may be primarily explained by density or size, it is possible that the incidence of pre-existing fracture occurring in the thoracolumbar segment may influence adaptation of the other vertebrae in the region [1][2][3], and as such analysis for fracture inclusion may be warranted.
  
  The use of standardized FE modeling techniques supports the goal for reproducibility of assessment in clinical FE modeling. While the authors made efforts to enhance the reproducibility and generalizability of the dataset, they themselves note that the source population is not necessarily descriptive of a general population (lines 227-232). Though this population is representative of those indicated for opportunistic screening, the development of risk curves necessitates the inclusion of healthy individuals, and follow-up analysis to fully flesh out the use of opportunistic FE in clinical settings, however this analysis would require a much larger cohort, and are outside the scope of the current manuscript. Further, while 'voxel-models' are typically regarded as standard, tetrahedral element models may generally provide better representation of complex biological geometries [4]. All approaches to FE have drawbacks, and tetrahedral models may be less-optimal solutions compared to hexahedral elements for convergence and the possibility of artificial stiffening, the high prevalence of osteophytes and degradation [5], particularly in older populations where screening is indicated, may warrant the use of tetrahedral elements which capture the intricacies of vertebral geometry that impact FE derived strength [6]. While again potentially outside the scope of this study, it might be noted as an additional formulative variable for FE approaches to estimating fracture risk.
  
  Line 269 -> "… applications such as radiomics-driven [approach?] for opportunistic …" As fracture prevalence is included in the dataset, it may be worthwhile to include analysis of fracture-adjacent vertebra in the selection of surrogate vertebra for L1 in opportunistic screening. Does pre-existing fracture influence which vertebrae selected, and should this decision be made on a person-to-person basis, taking into consideration the particular condition of the vertebrae available in the scan?
  
  [1] https://pmc.ncbi.nlm.nih.gov/articles/PMC8752702/ [2]https://academic.oup.com/jbmr/article/39/12/1744/7825427 [3] https://pmc.ncbi.nlm.nih.gov/articles/PMC7697376/ [4]https://www.sciencedirect.com/science/article/pii/S0021929005003568 [5] https://link.springer.com/article/10.1007/s12565-010-0080-8 [6]https://www.sciencedirect.com/science/article/pii/S1529943018306466
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractOpportunistic assessment of vertebral strength from clinical computed tomography (CT) scans holds substantial promise for fracture risk stratification, yet variability in calibration methods and finite element (FE) modeling approaches has led to limited comparability across studies. In this work, we provide a publicly available benchmark dataset that supports standardized biomechanical analysis of the thoracic and lumbar spine using density-calibrated CT data. We extended the VerSe 2019 dataset to include phantomless quantitative CT calibration, automated vertebral substructure segmentation, and vertebral strength estimates derived from both linear and nonlinear FE models. The cohort comprises 141 patients scanned across five CT systems, including contrast-enhanced protocols. Phantomless calibration was performed using automatically segmented tissue references and validated against synchronous calibration phantoms in 17 scans. To evaluate model performance, we implemented a nonlinear elastoplastic FE model and compared it to two linear estimates. A displacement-calibrated linear model (0.2% axial strain) demonstrated excellent agreement with nonlinear failure loads (R = 0.96; mean difference = -0.07 kN), while a stiffness-based approach showed similarly strong correlation (R = 0.92). We evaluated vertebral strength at all thoracic and lumbar levels, enabling level-wise normalization and comparison. Strength ratios revealed consistent anatomical trends and identified T12 and T9 as reliable alternatives to L1 for opportunistic screening and model standardization. All calibrated scans, segmentations, software, and modeling outputs are publicly released, providing a benchmark resource for validation and development of FE models, radiomics tools, and other quantitative imaging applications in musculoskeletal research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf094), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Maria Prado
  
  The study presents a novel technique that could advance vertebral strength estimations using FE analysis. The authors clearly articulate the motivation for open benchmarking, covering spinal regions (T1-L6) that are not typically included in similar studies. The description and availability of both linear and nonlinear models support the method's broad utility. I value the authors' effort to share data and open-source resources, which enhances reproducibility.
  
  Suggestions are recommended to enhance the manuscript and clarify/expand some sections for future readers.
  
  (Lines 122-132) The justification for choosing 0.2% axial strain as the calibration threshold is somewhat empirical and based on only three representative samples (low, medium, and high vBMD). Please, expand on how representative these three samples are of the entire cohort and whether additional samples were tested to confirm generalizability.
  
  (Line 151-152) The manuscript notes that T12 (+2.2%) and T9 (-2.1%) exhibited the smallest deviation from L1, suggesting their potential as alternative targets. In addition to calculating these deviations, was any further analysis performed to support this conclusion? Consider expanding on whether more extensive validation or simulations would be necessary to robustly support T12 and T9 as substitutes for L1.
  
  (Lines 198-200) The description of cortical bone modeling is vague. It is not clear if the cortical bone was not modeled explicitly, but was implicitly accounted for. Clarification would be appreciated. Additionally, please comment on whether the method leads to under- or overestimation of strength in areas where cortical bone is predominant. Is this a limitation that might impact model predictions?
  
  (Line 314) Is there a specific reason why the posterior elements were included in the segmentation process? Previous studies have often omitted these structures from their models. A brief justification for their inclusion in the present work would be helpful.
  
  (Lines 322-323) Are there any references or prior studies that support the selection of the specific reference tissues used for phantomless calibration?
  
  (Lines 349-356) While equations for modulus and yield stress are provided, a short explanation of how these equations compare to other published models and why they were chosen could be more clearly included.
  
  (Lines 361-373) The explanation of the simulation procedure, while valuable, does not clearly state whether it was performed solely on the L4 vertebra (described as the reference image) or applied individually to each vertebral body. Please clarify this point. Additionally, although the loading and boundary conditions are described, the manuscript lacks detail on how endplate irregularities or variations in vertebral alignment were addressed.
  
  (Line 387) For the failure load calculation using the stiffness-based method, which specific vertebrae were used to measure height? Please clarify whether height measurements were taken from all vertebrae in the cohort, only from those included in the force analysis, or from a subset.
  
  (Lines 397-399) The "graph model" approach for intervertebral strength normalization is not explained in detail. While it appears that this method corresponds to the analysis presented in Figure 6, this connection is not clearly stated in the text.
  
  (Lines 122-144) In the section Linear models approximate nonlinear vertebral strength estimates, it is unclear how the nonlinear model itself was validated. The manuscript does not reference any experimental or literature-based benchmarks to support the accuracy of the nonlinear failure load predictions. Please clarify whether any validation against in vitro or in vivo vertebral failure data was performed or cited. If such validation is lacking, this should be acknowledged as a limitation and discussed in terms of its potential impact on the interpretation of the results.
  
  Minor suggestions:
  
  Terminology: The term "phantomless calibration" is well-used, but a brief definition upfront (in Abstract or Background) would help readers unfamiliar with the concept.
  
  (Line 59) The word "transparent" refers to a clearer modeling workflow?
  
  (Lines 87-89) Consider relocation of the statement ("By providing these outputs, we offer a ready-to-use reference..."), which seems confusing and cuts the flow of the text.
  
  FIGURES: Ensure axis labels, units, and legends in all figures (especially Fig. 4 and Fig. 6) are visible and explained.
  
  FIGURE 3A - C. The subtitle titles could lead to misinterpretation or confusion about what is being described.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.19.649449v1
www.biorxiv.org www.biorxiv.org

SPEX: A modular end-to-end platform for high-plex tissue spatial omics analysis

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Hongyoon Choi
  
  The manuscript introduces SPEX, a web-based platform designed for spatial omics data analysis. The authors highlight its user-friendly UI, modular analysis pipelines, and integration with open-source image data management systems. The platform supports image processing such as cell/nucleus segmentation, clustering, and spatial analysis. GUI-based approaches as well as python script-based modules increase usability for the broader research community. While the goals of the platform are commendable, and the integration of multiple analysis modules is a valuable contribution, there are critical shortcomings in the manuscript that must be addressed before publication. Several key weaknesses significantly limit the scientific rigor and impact of this work.
  
  One of the critical omissions in this manuscript is the lack of rigorous benchmarking against established tools. Though it demonstrated the comparison with other tools such as Squidpy, Giotto, and MC Micro, but there is no quantitative comparison to demonstrate its advantages over existing methodologies. In particular, spatial analysis such as CLQ is introduced as a different approach within the spatial biology analytics framework, but how does it compare to existing co-occurrence analysis methods? Additionally, similar analyses have been conducted using other tools (e.g., Mah, C.K., et al., Genome Biol 25, 82 (2024)), including in 'subcellular' colocalization. In this regard, concerns about its novelty arise. Moreover, as mentioned in relation to Bento, CLQ could also be applied to subcellular analysis?
  
  In this regard, for spatial co-occurence or other algorithms in SPEX, the authors should run identical datasets through both SPEX and existing tools to compare performance and biological insights. it is impossible to assess whether SPEX provides any meaningful improvement over existing platforms.
  
  The cell typing process is one of the most fundamental steps in spatial omics analysis. However, SPEX does not integrate a dedicated cell typing module, forcing users to use another tool or define cell types manually. The accuracy of all downstream analyses (clustering, spatial interaction, pathway analysis) depends on robust and reliable cell typing. It would be better to integrate with automated cell typing solutions to increase usability.
  
  The manuscript focuses almost exclusively on single-cell resolution data and high-dimensional imaging-based methods (e.g., IMC, MIBI, MERFISH). However, spot-based transcriptomics platforms such as Visium are widely used in the field. In this regard, SPEX does not provide modules tailored methodology for spot-based spatial analysis (such as deconvolution) or super-resolution or transforming cell-based analysis from spots (e.g. bin2cell in VisiumHD). Neighborhood analyses or spatially variable gene detection, etc. are specialized in whole-gene covered, spot-based methods, as well, for example.
  
  The manuscript does not clarify whether users can modify or extend the pipeline with custom Python scripts. Describing further this point, customization in this ecosystem with python script, for 'power-users' of this system could be helpful.
  
  The biological relevance of the SPEX platform remains unclear, as the case studies presented are not sufficiently rigorous. As mentioned above, comparisons with other tools based on quantification can clarify why SPEX is better than other published tools/ecosystems in certain aspect. Or meaningful biological findings and explanations based on this tool as a case study could be helpful. While the results demonstrate technical capabilities, the manuscript does not show how SPEX enables novel biological discoveries compared to existing tools.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Qianqian Song
  
  The manuscript presents an advancement in spatial omics analysis but needs improvements in Quantitative benchmarking, Computational scalability assessment, etc. With these revisions, SPEX has the potential to become a widely adopted platform in the spatial omics community. I have specific comments as below:
  
  1) While the manuscript provides a qualitative comparison of SPEX with other spatial omics tools (e.g., Squidpy, Giotto, Aquilla), quantitative benchmarking is missing. It is needed to include a performance benchmark comparing runtime efficiency, segmentation accuracy, and clustering resolution against existing tools. Also, it is necessary to show computational efficiency metrics (e.g., memory usage, execution time, scalability across datasets of varying sizes).
  
  2) The study presents compelling results, but there is no independent validation or interpretation of computational outputs using experimental methods.
  
  3) The manuscript does not discuss hardware requirements, processing speed, or computational limitations. It is needed to provide an assessment of SPEX's performance on different computing environments (e.g., local workstations vs. cloud computing vs. high-performance clusters).
  
  4) The Colocation Quotient (CLQ) method is well described, but the manuscript does not provide statistical validation (e.g., p-values, confidence intervals) for detected spatial relationships.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Ka Yee Yeung
  
  Li et al. presented SPEX (Spatial Expression Explorer), a web-based open-source end-to-end analysis platform offering modular design and a user accessible interface. The users demonstrated use cases in spatial transcriptomics (MERFISH lung cancer) and spatial proteomics datasets (tonsil, public multiplex ion beam imaging data). SPEX includes the following analytical modules 1. image processing modules includes a 4-step sequence (image pre-processing, single-cell segmentation, post-processing, feature selection). Image loading supports OMERO integration. Output is a cell by expression matrix in Anndata format. 2. clustering modules for both spatial transcriptomic and proteomic data. 3. spatial analysis module implements the CLQ (Colocation Quotient) method. 4. spatial expression analysis module includes differential expression and pathway analysis. SPEX supports visualization via Vitessce.
  
  The paper is well written, addresses a rising interest and critical need in the biomedical community. The reviewer would like to request clarifications on how extensible the modules are. The author mentioned a SPEX pipeline builder in which "modules are selected from a library and dragged into a visual pipeline map", and also mentioend the support for "flexible plug-in analysis modules". What are the packages available from the library? Can users import their own code or script or package? How to create new plug-in's?
  
  The reviewer is also wondering how do the users interact with the results? Can the user click on the resulting image and select regions of interest to zoom in?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.22.504841v2
www.biorxiv.org www.biorxiv.org

Aedes mosquito distribution across urban and peri-urban areas of Kinshasa city, Democratic Republic of Congo

2
1. GigaScience 17 Sep 2025
  
  in GigaByte
  
  Editors Assessment:
  
  In the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of the arboviruses that cause yellow fever, chikungunya and dengue in the human population. However systematic surveillance data on these species remains limited, hindering for entomological and modelling research and control strategies. This paper is one of a series of Data Release papers in GigaByte supported by TDR and the WHO describing datasets hosted in GBIF to tackle these data gaps in vectors of human disease data. To address this data deficiency this paper presents a geo-referenced dataset of 6,577 entomological occurrence records collected in 2024 throughout urban and peri-urban areas of Kinshasa in the Democratic Republic of Congo. The data collected using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Data auditing and peer review found the data well validated, but requested some additional fields and methodological details. This work and the extremely useful data provided representing an important step towards building a pan-African resource for Aedes mosquito data collection.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Sep 2025
  
  in GigaByte
  
  AbstractIn the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of medically important arboviruses, with major implications for yellow fever, chikungunya and dengue. However, systematic surveillance of these species remains limited, constrained by competing public health priorities such as malaria and other neglected tropical diseases. This gap in surveillance prevents the rapid detection of changes in the distribution, abundance and behaviour, particularly in rapidly urbanizing environments where breeding habitats are proliferating and ecological conditions are favourable for the establishment of these vectors. To address this gap, spatially explicit, small-scale data on Aedes populations in urban and peri-urban areas are needed to accurately assess transmission risk and develop targeted, evidence-based vector control strategies. Here, we present a geo-referenced dataset of 6,577 entomological occurrence records collected in 20224 throughout urban and peri-urban areas of Kinshasa city, DRC, using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Records include Aedes albopictus (n = 2,694), Aedes aegypti (n = 1939), Aedes vittatus (n = 2), and Aedes spp. (n = 1,942), each annotated with species, sex, life stage, reproductive status, and spatial coordinates. The dataset is published as a Darwin Core archive in the Global Biodiversity Information Facility (GBIF), and represents the most detailed, spatially explicit record of Aedes mosquito occurrence in Kinshasa to data, providing a robust foundation for entomological and modelling research to support data driven arbovirus vector control strategies in DRC.
  
  Reviewer 1. Bastien Molcrette
  
  Are all data available and do they match the descriptions in the paper?
  
  Correction needed in manuscript Table 1: row ‘Ae. spp (*unid)’ column ‘total’ should be 1942 (instead of 1932). Additional Comments: Aedes vittatus has only been observed and characterized twice in a full year, among 6577 samples: how confident are you that these samples have been correctly classified? Are there any other references for the observation of Aedes vittatus around Kinshasa?
  
  The full data review and audit is here: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD02NDAmZmlsZT0yODAmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==
  
  Reviewer 2. Paul Taconet
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. See attached.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. See below.
  
  Additional Comments: This data paper presents a valuable contribution, and the effort invested in publishing such a dataset is both commendable and highly appreciated. It represents an important step towards building a pan-African resource for Aedes mosquito data collection.
  
  Overall, the paper and dataset are highly promising, but clarifying the sampling design and improving metadata consistency will significantly enhance their usability and scientific value.
  
  Major comments:
  
  The main point of confusion concerns the geographical definition of the sampling sites. In the manuscript, it is stated that “within each area, two sampling sites were selected.” This suggests a total of four sampling sites (2 areas × 2 sites each). However, elsewhere the text mentions “adults collected from different households for each of the three sampling techniques,” which implies three households per area (i.e., three sites).
  
  In contrast, the dataset appears to include only two sampling points (one per area), each with extremely precise geographic coordinates (six decimal places, implying sub-meter accuracy). This suggests that collections were made at identical locations, contradicting the description in the paper (two sites, multiple households, etc.).
  
  To resolve this inconsistency, clarification is needed both in the paper and in the dataset:
  
  In the manuscript, explicitly state the number of sampling sites used for each protocol.
  
  In the dataset, either provide the true coordinates or specify the level of spatial accuracy. This could be achieved by adding a column such as coordinatePrecision, coordinateUncertaintyInMeters, or footprintWKT in the event table (see: https://dwc.tdwg.org/list/2020-10-13#dwc_coordinatePrecision, https://dwc.tdwg.org/list/#dwc_coordinateUncertaintyInMeters, https://dwc.tdwg.org/list/#dwc_footprintWKT). Such clarification is essential.
  
  Minor comments (manuscript):
  
  In the “Mosquito collection” section, please provide more detail about the sampling schedule (e.g., total number of sessions for each technique, average sampling frequency, etc.).
  
  In Table 2, define precisely how dry and rainy seasons were determined (e.g., based on calendar months or rainfall thresholds or other).
  
  The dataset contains information on mosquito sex and feeding status, yet the paper does not describe how these were determined. Please add methodological details.
  
  Indicate how far apart the sampled households were located, since simultaneous sampling at nearby sites could bias results.
  
  Typographical corrections:
  
  Introduction: “entomological occurrence records collected in 20224 2024” → revise.
  
  Introduction: “spatially explicit record of Aedes mosquito occurrence in Kinshasa to data date” → revise.
  
  Methods: “Water from each breeding sites was using with a ladle...” → revise wording for clarity.
  
  Comments on the dataset:
  
  For completeness, the event table could include additional fields such as habitat, samplingEffort (especially relevant for adult collection), sampleSizeValue, and sampleSizeUnit. These details are already provided in the paper and could easily be added to the GBIF dataset.
  
  In the occurrence table, the entries under ScientificName are currently generic (e.g., “Aedes albopictus” should be written as Aedes albopictus (Skuse, 1895)). Consider renaming the current column as genericName and adding a proper ScientificName column with complete taxonomic names.
  
  The use of MaterialSample as the basisOfRecord seems questionable. According to community discussions (e.g., https://discourse.gbif.org/t/understanding-basis-of-record/5857), HumanObservation would be more appropriate in this case.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.03.674006v1
www.biorxiv.org www.biorxiv.org

Whole Genome Sequencing and Assembly of the House Sparrow, Passer domesticus

2
1. GigaScience 02 Sep 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents present the genome sequencing of the house sparrow (Passer domesticus) carrying out genome assembly and annotation using in silico approaches with tools that could be a valuable resource for understanding passerine evolution, biology, ethnology, geography, and demography. The final genome assembly was generated using short read sequencing and a computational workflow that included Shovill, SPAdes, MaSuRCA, and BUSCO benchmarking. Producing a 922 MB reference genome with 24,152 genes. The first draft was significantly smaller than this but peer review provided suggestions on how to improve the assembly quality. And after a few attempts and assembly with a reasonable size and BUSCO score was achieved. This openly available data potentially serving as a valuable resource for checking adaptation, divergence, and speciation of birds.
  
  This evaluation refers to version 2 of the preprint
  
  Summary
2. GigaScience 02 Sep 2025
  
  in GigaByte
  
  AbstractThe common house sparrow, Passer domesticus is a small bird belonging to the family Passeridae. Here, we provide high-quality whole genome sequence data along with assembly for the house sparrow. The final genome assembly was assembled using a Shovill/SPAdes/MASURCA/BUSCO workflow, consisting of contigs spanning 268193 bases and coalescing around a 922 MB sized reference genome. We employed rigorous statistical thresholds to check the coverage, as the Passer genome showed considerable similarity to Gallus gallus (chicken) and Taeniopygia guttata (Zebra finch) genomes, also providing a functional annotation. This new annotated genome assembly will be a valuable resource as a reference for comparative and population genomic analyses of passerine, avian, and vertebrate evolution.Significance Avian evolution has been of great interest in the context of extinction. Annotating the genomes such as passerines would be of significant interest as we could understand the behavior/foraging traits and further explore their evolutionary landscape. In this work, we provide a full genome sequence of Indian house sparrow, viz. Passer domesticus which will serve as a useful resource in understanding the adaptability, evolution, geography, allee effects and circadian rhythms.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.161), and has published the reviews under the same license.
  
  Reviewer 1. Gang Wang
  
  Is the language of sufficient quality? Yes. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No
  
  Overall Comments: The article is logically clear and the analysis is complete. The description of both sample collection and sequencing is relatively clear. At the same time, the analysis process shown in Figure 1 is also very reasonable. However, as described by the previous reviewer, I suggest that you remove the high-quality level. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements. Figure 2, the letters of a and b are too different, please unify them. Figure 4 is completely unclear, please increase the font size. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this. Re-review: The authors used FCS-GX to exclude contaminating sequences in the genome, so I agree that this paper should be published.
  
  Reviewer 2. Agustin Ariel Baricalla
  
  Are all data available and do they match the descriptions in the paper? No. Matching data: NCBI project with access to the NCBI-SRA deposited raw data. Nonmatching data: Oxford Nanopore data: The authors reply to a previously submitted manuscript arguing that this data was not used, but Fig. 1 refers to Nanopore Minion data. The manuscript body and the additional data section do not include the Quast and BUSCO reports or their corresponding plots.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No. GigaByte suggests a checklist including the genome, CDS, and proteins in FASTA format, as well as the annotations in GFF format; however, these items are not available for evaluation.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. The FastP step for raw data processing is mentioned in the results section but is not detailed in the methods section.
  
  Is there sufficient data validation and statistical analyses of data quality? No. The authors have not included the BUSCO results. The OrthoDB database for 'passeriformes_odb12' contains over 10,000 curated genes, representing approximately 50-60% of the total genes in a typical passeriform genome. Therefore, the BUSCO report for the new assembly should be provided. The author mentioned that "The gene completeness for Passer was assessed through Benchmarking Universal Single-Copy Orthologs ( Busco version 5.5.0 ) [26] by using the orthologous genes in the Gallus gallus [ chicken] genome" but BUSCO uses the OrthoDB datasets to run, I do not understand what this phrase refers to.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes. All the procedures are consistent and the programs or pipelines are well-known and well documented in the bioinformatic and genomic fields.
  
  Additional Comments: The inclusion of the mitochondrial genome represents a significant improvement in this manuscript. I recommend presenting all nuclear results together first, followed by a separate and clear description of the mitochondrial analysis and findings to enhance clarity. The data is interesting for analyzing the genetic dynamics behind Passer domesticus adaptation and evolution and can show differences between the previous genomes available from a European reference sample but this is not presented in this work. As of this revision, the NCBI's Passer domesticus genome includes two European reference genomes, both classified with 'chromosome-like' status (NCBI: GCF_036417665.1 and GCA_001700915.1). These genomes can be utilized in two distinct ways: (1) performing a 'genome-guided assembly' with MASURCA, using one of these genomes alongside the Illumina data, or (2) conducting genome scaffolding by employing one of these genomes as a reference and the assembled genome from raw reads as a query, using tools like RagTag or the chromosome scaffolder available in MASURCA. Both approaches could potentially lead to improvements in scaffold number and contiguity metrics, such as N50, N90, and the largest scaffold.
  
  Re-review: The authors have subtly improved the original version previously presented, but have not managed to surpass the minimum standards established by the publisher to be published by the journal. Easily achievable changes have been requested to complement the analysis previously made and have been ignored. Requests have not been answered, graphics that generate confusion between them and the text presented have not been fixed, and no relevant improvement between the previous and current versions has been shown.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.04.565608v3
Aug 2025
www.biorxiv.org www.biorxiv.org

Telomere-to-telomere African wild rice (Oryza longistaminata) reference genome reveals segmental and structural variation

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Chengzhi Liang
  
  The authors generated a 343-Mb telomere-to-telomere (T2T) genome assembly for an African wild rice (Oryza longistaminata), covering all telomeres and centromeres across the 12 chromosomes, and performed genome annotation and analyses on structural variations and NLR genes. While the manuscript has provided a valuable genome sequence, several problems should be addressed before the manuscript can be published.
  
  Major issues 1. The authors estimated that the genome heterozygosity is 1.27%, which is quite high, so I am wondering how large the assembled genome size is using only HiFi data, which could reflect the actual heterozygosity rate of the genome, particularly by comparing it with the final genome size of 12 chromosomes. If there was only one gap in the initial assembly of Hifiasm (a total of 13 contigs), it is unlikely that the genome has such a high heterozygosity. In Table 1, the total size of assembled genome was 331,045,917bp. If this is the summed size of 12 chromosomes, it should be used as the final genome size in the main text. Please clarify. Also, what is the base accuracy of Ultra-long CycloneSEQ data? which is useful to readers for this is a new sequencing technology. 2. For SV detection, considering that the assembled genome in the manuscript (does it have a accession ID or name?) is an African wild rice, it is rather strange that the authors did not compare it with an O. glaberrima genome, but with an O. sativa genome. Meanwhile, the name of the genomes should be mentioned since there were so many different genomes in each species, all with different SV variations between them. 3. The conclusion that "This distribution suggests that chromosomes 1, 4, 3, and 2 might have contributed to the evolution of rice in previously unrecognized ways (Table S8)" is purely speculative, and thus should be removed from the manuscript, or the authors should provide more evidence to support it. 4. The author claimed that "Compared with other Oryza species, O. longistaminata has many fewer NBS-lRR domain genes, which reflects a contraction of resistance genes in this species." Please give specific gene numbers for each species. Meanwhile, the conclusion does not look right here since it looks that O. longistaminata had more NBS-LRR genes than other species.
  
  Minor issues 1. What is "quartets"? 2. The author used "11 Oryza species" which included O. indica, please clarify what this species is.Bold
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Francois Sabot
  
  The manuscript from Guang et al deals with a T2T assembly for the wild perennial African rice Oryza longistaminata. Using last up to date technologies and approaches, authors provided a high quality assembly for this wild species, rending it a valuable ressource for understanding rice evolution. While the results as assembly are of high quality, the interpretation of some biological results, in particular about the NBS-LRR, are quite weird, in my opinion, and need to be more refined. That's why I think the manuscript should be published, but after major corrections.
  
  in details:
  
  -Introduction: not sure the exceptional biomass is a good idea from longistaminata, as this plant has avery high content in silicium, rendering its biomass complex to use. - Methods: We do not have access to most of the command options and command-lines. please provide them at least as a texte file in supp data. In addition, some of the references for tools are missing. Finally, please provide the accession number of the assembled plant. - Assembly in itself: O longistaminata is a outcrossing heterozygous organism. Did you obtained the two haplotypes ? - Comparison with the previous longistaminata genome: is the inversion in middle of Chr6 specific ? or due to an error of previous assembly ? - Table 1: what do you mean "Total size of assembled genomes (bp) 331,045,917" ? What is the residual percentage of N ? - Figure 1 and others: please show the legend in other way, here we may mix it with the main text. in addition, check the legends for spelling and the size of figure (3b eg) for lisibility - Syri/MUMmer analysis: you limit as min size at 1kb ? What was the order of query vs ref ? can we have a bed file with the positions ? - SD: is there a statistical link between chromosome size and number of SD ? It could explain why the first 4 ones have more SD. In general, the data are missing stats. - GO in SD: any statistical validation ? - Genomes comparison: please provide the acc number of the genome you used for comparison. - NBS-LRR: the longistaminata genome has 215 genes for 116 to 289 for other oryza so I cannot see any contraction or expansion. in addition, the text here is weird, starting speaking of onctraction then going to expansion ??? - TF analysis; the african assemblies are quite bad I think, explaining the discrepency. For glaberrima, did you check the one from Tranchant-Dubreuil et al, 2023 ?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611405v1
www.biorxiv.org www.biorxiv.org

A telomere to telomere phased genome assembly and annotation for the Australian central bearded dragon Pogona vitticeps

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps (Ahl, 1926) (NCBI:txid103695).
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yuan Li
  
  The authors de novo assembled a telomere to telomere phased genome assembly of the Australian central bearded dragon Pogona vitticeps, using PacBio HiFi, ONT, HiC, and Illumina sequencing platforms. The assembly achieves remarkable contiguity (scaffold N50: 266.2 Mb) and completeness (97.9% BUSCO score), surpassing Earth Biogenome Project standards. The phased assembly of sex chromosomes (Z/W) and identification of candidate sex-determining genes (Amh, Amhr2, and Nr5a1) provide valuable insights into reptilian sex determination. Overall, the study is well-executed and provides a valuable resource for comparative genomics and reproductive biology.
  
  Major concern: 1.The description of read depth had errors at lines 401-402, such as 60.6x. In addition, "4 x promethION", "2x150 bp" were should be revised and please check and revise all the similar description in the manuscript. 2.There are errors in the citation format of the journal references, such as the absence of punctuation "."marks between the title name and the journal name at lines 1005-1009, mixing abbreviations (e.g., "PNAS" vs. "Proceedings of the National Academy of Sciences USA") (lines 988-990, 1005-1009). Please check carefully the format of all references. 3.The script "calculateGC.py and processtrftelo.py" (lines 242 and 245) are mentioned without code availability or parameter details. Provide effective links or repository access. 4.The inconsistent use of "Gb" and "Gbp" is observed; it is recommended to adopt a unified description. 5.Units were missing in the descriptions in multiple places in Table 1 and 2, such as the unit for "Total Bases" and "Assembly length"; please include them. 6.At lines 683-687, the conclusion that Amh/Amhr2 are sex-determining genes relies solely on positional evidence. Discuss the need for functional studies (e.g., CRISPR knockouts) to strengthen claims. 7.There were errors in "Vasimuddin et al. 2019" (line 238) and "Danecek et al. 2021" (line 239). Please check all the other formats of references. 8.At lines 476-481, BAC mappings are cited as validation but lack visual evidence (e.g., alignment plots in figures or supplements). Please verify the accuracy of Figure 7 at line 478, as it does not correspond with the description.
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Heiner Kuhl
  
  Patel et al. present a genome assembly of the bearded dragon Pogona vitticeps a lizard species that is widely distributed as a pet and known for its interesting sex-determination, which may switch from genetic sex-determination (ZW) to temperature dependent sex-reversal. The methods chosen to assemble the genome are very state-of-the-art including HIFI and ONT long reads, Hi-C and suitable bioinformatic tools.
  
  I have to admit that I have recently been reviewing a similar manuscript for Gigascience (https://www.biorxiv.org/content/10.1101/2024.09.05.611321v1), where a female ZZ P. vitticeps had been sequenced/assembled from long read data of a different nanopore technology and analyses of the ZW-chromosome was done by short read coverage analysis. One of my major comments was that this approach lacked a true assembly of the W-chromosome. Thus, I am happy to see that the assembly of the W-specific region has been achieved here and the sequencing technologies used might even improve the assembly quality over the ZZ assembly in terms of phasing, consensus accuracy etc. The two manuscripts are highly complementary and I think they should be published, if possible, in the very same issue of Gigascience. Surely both groups have invested a lot of efforts. (Reading L. 685, I just have realized that this seems to be the intention of the journal and I very much support this idea.)
  
  Still there are some minor points that need improvement for the current manuscript:
  
  Why do you leave the Z and W splitted into PAR, Z- and W-specific scaffolds and do not assemble the full-length chromosomes (L. 676)? Would the Hi-C data not support that?
  
  Mitochondrial assembly: from ONT only (L. 307), please do a consensus correction with illumina data, or at least show that the MT assembly has a high consensus accuracy (Q40-Q50).
  
  Genome annotation: show BUSCO scores for annotated proteins (do they fit to BUSCO performed on the whole genome?). If possible, compare to results of the NCBI RefSeq annotation (is it already available?). In this regard please explain the relatively low mapping rates (L. 647) of RNAseq to the annotated sequences.
  
  Could you provide some expression data for the Z-specific Amh and AmhR2? Is it differentially expressed in testis/ovary (after correction for copy number)?
  
  Table1, could you show results for the two different ONT library types (ligation vs. ultralong kit). It seems the overall yield was low (5 cells -> 100Gb), any speculation why?
  
  I think assembly statistics (Table2) should also contain contig N50 length as an additional value to show the high continuity of the assembly.
  
  L. 488: "48.36 (1 error in 146kb)", I think something is wrong here. Q48.36 would be 1 error in 68.5kb. I would suggest to re-check these values and incorporate them in Table2. The high consensus accuracy is one selling point compared to the competitor's assembly.
  
  L. 490: "Individual haplotypes were 85.5% complete…". Explain why you are confident that the haplotypes are more complete than the Merqury results suggest (just one sentence).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.01.651798v1
www.biorxiv.org www.biorxiv.org

A near-complete genome assembly of the bearded dragon Pogona vitticeps provides insights into the origin of Pogona sex chromosomes

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Heiner Kuhl
  
  Guo et al. present a new reference genome for Pogona vitticeps, a widespread reptile model organism that is also common as a domestic animal worldwide. The genome assembly shows much improvement over an older assembly from 2017. There are two points that make this manuscript outstanding from common genome assembly papers:
  
  The authors find a new sex determination locus in this species.
  
  the authors use a new nanopore sequencing technology ("CycloneSEQ"), which has so far only described in a preprint (https://www.biorxiv.org/content/10.1101/2024.08.19.608720v1).
  
  In my opinion this deserves a publication in Gigascience, but both points must be focused more in a revised manuscript.
  
  Major comments:
  
  1) The authors have sequenced a male individual (ZZ), which means the long-read reference assembly is missing the W-chromosome. PAR and SDR regions are deduced from the Z sequence, by analysis of sequencing coverage of only a few sexed samples (2 females and 4 males). It is unclear if these individuals are from the same family, which could mean that the newly found SD-region could just be a family specific variation. To make the whole story more intriguing and statistical sound the authors should at least test 15 males and 15 females from different P. vitticeps populations for W-specific markers near the proposed AMH deletion. The authors should also show that the prior proposed SD locus (nr5a1) does not carry W-specific mutations in these 15+15 individuals. Furthermore, a phased assembly of a female (ZW) Pogona vitticeps individual, could enable the assembly of the missing W-chr and should be included, it would even improve analysis of W-specific sequences in the proposed additional individuals.
  
  2) A technology aware reader would like to see more information on the specifics of the CycloneSEQ data quality and handling and maybe a comparison to competing technologies. Which enzymes and buffers were used to prepare the library? In the sections on the methods, there are only superficial descriptions such as (DNA repair buffer/enzyme, DNA clean beads, wash buffer for long fragments). Is it a kit or were the enzymes and buffers purchased individually? I cannot find the procedure for preparation and sequencing of the long-read cDNA libraries. How many flowcells were needed to generate the different datasets? How do the read-length distributions look like (statistics over all reads not only selected 40Kb+)? How was the variability between those runs, especially culmulative output over time? What hardware was needed to run the basecalling and what was the runtime? How is the Q-Value distribution of the reads? Why is the consensus accuracy of the assembly low (Q36.4)? can it be improved? Typically reference quality genomes should have Q40+. Which regions of the genome display lower consensus accuracies (is it random or sequence specific)?
  
  Minor comments:
  
  L.900: PRJNAxxxxxx looks like a placeholder, insert the true number,please.
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Nazila Koochekian
  
  Impressive work but needs major revision to be accepted. The authors compressed everything in the result section and did not put enough effort into the other sections. Introduction and discussion need major changes and more details regarding many aspects of the study that comes in the results. Methods need rearrangement. It's common to keep the order of methods such as first DNA extraction, then sequencing and so on. The data availability needs to be completed. Biosamples for each sequenced tissue, all the reads, and even intermediate assemblies need to be submitted to the database and reported in the manuscript. More specific comments are on the copy of the manuscript attached for the authors.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611321v1
www.biorxiv.org www.biorxiv.org

SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework

2
1. GigaScience 04 Aug 2025
  
  in GigaScience
  
  ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Daniel A. Skelly
  
  Overall, this is a very nice writeup of a useful package that extends the Seurat package to expand possibilities for single cell analysts in R. I liked the visualization options, the ability to try certain python-based tools easily in R which was not previously easy, and some of the authors' new innovations like their use of pathway enrichment scores in broad ways. Kudos to the authors for releasing a package with really excellent documentation and tutorials!
  
  I think this paper could be made better if the authors stressed with a little more clarity how specifically their work is innovative. The text in the present manuscript is fine but reads like a bit of a grab bag of functionality. For example, from the abstract: "SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package integrates multiple databases, … and incorporates popular Python tools … [We] showcase its novel applications in pathway-level analysis and cluster annotation. SeuratExtend enhances data visualization …"
  
  How could they be more clear or specific? One example could be by categorizing what SeuratExtend can do that other packages can't. For example, I see innovations in perhaps three general areas: 1. Making single cell analyses easier/faster/prettier (i.e. visualizations, pathway enrichment) 2. Making previously published single cell tools more broadly accessible (e.g. first option to bring certain python tools to R) 3. New innovations (e.g. dimensionality reduction and clustering based on pathway enrichment scores; may not be completely new but I don't recall seeing this elsewhere) If this was added I feel the paper would more clearly communicate to readers the information necessary for them to choose whether they want to try the package.
  
  I have the following additional significant comments: * Integration of multiple databases for GSEA — these methods are good, but what about in a few years when those databases have been updated? Do the authors intend to continue updating? Could they provide a function for users to use their own database (e.g. .gaf and .obo files, for example for another model organism)? Similar comment about gene identifer conversion, which may need to be updated every few years. * "While the Python ecosystem has benefited greatly from the comprehensive scverse project [7], which utilizes the universal AnnData format to connect various tools and algorithms, a comparable integrated solution has been lacking in the R community. SeuratExtend addresses this gap by providing a unified framework centered around the Seurat object, effectively becoming the R counterpart to scverse." —> some might argue that SeuratWrappers is this solution. The authors should more clearly and explicitly comment on what SeuratExtend does differently/better than SeuratWrappers. * I'm not particularly convinced by the authors' example studies that used SeuratExtend. For example, they describe Hua-Vella et al. (2022) and Hua et al. (2023). These are very nice studies and I have no doubt they made use of SeuratExtend in their analyses. But I don't see anything these authors describe those authors doing as being uniquely possible with SeuratExtend. Perhaps SeuratExtend made their analyses easier, or faster. But it would be better if we had some further concrete details. For example, something communicating a message like one of the following: (1) the authors only tested method X on a whim because it was so easy to run in SeuratExtend, and found that it revealed unexpected biology Y; or (2) the authors were able to bring together method X which runs in R and method Y which runs in python and the joint inference — not possible in other packages — revealed key result Z. If the authors of this manuscript can't point to those sorts of examples, then I'm not sure it adds much to include this discussion in the present paper. * I really liked the section "Novel Applications of SeuratExtend in Pathway-Level Analysis and Cluster Annotation", especially "Exploring and Analyzing Single-Cell Data at the Pathway Level". I thought these applications could perhaps be stressed a bit more strongly or made more prominent earlier in the paper. * Figures 2 and 3 are showing example plots from which we don't actually need to infer any important biology. I thought these figures could be combined and each individual plot type only shown once. (This is for clarity and I don't see anything incorrect about the authors' current plots. * There may be some issues with dependencies for some users. For example, it prompted me to install viridis and loomR as I went through the Quickstart. I ended up encountering an error there is no package called 'loomR' while trying. I had to manually install with remotes::install_github(repo = "mojaveazure/loomR"). Maybe provide an explicit dependencies list/list of recommended packages to install? * I had an error the first time calling Palantir.RunDM(). I hadn't created a seuratextend environment. I found that I could do this manually using create_condaenv_seuratextend(), but that this wasn't supported for Apple Silicon chips. I would suggest that the authors do try to find a way to get this working on newer Apple chips, because Mac machines are very common among bioinformaticians in my experience. * While the writing is largely quite clear, I found it to be a bit voluminous. If the authors are able to cut down on text length that may help in emphasizing the key points that make their package valuable to users.
  
  I had these minor comments: * "Moreover, mainstream scRNA-seq analysis tools are primarily developed for either the R or Python platforms, with additional options like Nextflow and Snakemake" — I suggest revising this sentence. The tools are developed in R or python languages, which I would not call platforms. I would reword that Nextflow and Snakemake are workflow management systems that provide additional options for pipeline automation * "the R ecosystem surrounding Seurat appears relatively limited" — I'm not sure I would agree with this. I counted wrappers for 17 methods currently. Yes it is true that there are more packages in scverse. However, I suggest moderating your claims about Seurat being limited. * Suggest removing snakemake from Table 1 — it is really different from the other tools listed there
2. GigaScience 04 Aug 2025
  
  in GigaScience
  
  ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Yu H. Sun
  
  This manuscript introduces an extended version of the widely-used Seurat package, named SeuratExtend. Specifically, Hua et al. developed an integrated an intuitive framework to streamline scRNA-seq data analysis, such as trajectory analysis, GRN construction, and functional enrichment analysis. The package also features direct integration with other popular tools, including Seurat, scVelo, etc. Notably, the software has been demonstrated through training programs, with over 100 stars on GitHub, which is impressive. I have tested the package, including installation and some basic functions. Moreover, the GitHub webpage is well-documented, featuring multiple use cases tailored for beginners. The overall user experience exceeded my expectations, though I have a few minor comments for improvement:
  
  1, The DimPlot2 function is very useful, and easy to customize the colors. However, the default color scheme seems to be too dark. Considering a more distinguishable and visually appealing color palette might be a solution.
  
  2, How to control the angles of cell type labels when using VlnPlot2? The 'Split visualization' has all the labels in a horizontal direction, leading to overlapping in some cases, while 'Subset Analysis' plots have labels in 45 degree, which is much better to read. However, I didn't see a parameter to control this. Does VlnPlot2 handle this automatically?
  
  3, It's a very nice feature to have the 'Statistical Analysis' function to label significant groups. However, in single cell analysis, the p values are easy to be inflated due to the large number of cells. While the example pmbc data is relatively small, larger datasets might yield significant p values without obvious differences in the violin plots. It would be beneficial to mention this in the documentation, and provide some guidance so the results won't be misleading.
  
  4, The ClusterDistrBar is another valuable function. Based on my experience with similar analyses, I suggest incorporating features to identify robust changes in cell type composition. For instance, tools like sccomp can help determine changes in cell population composition.
  
  5, I wonder if the gene label directions can be changed easily for WaterfallPlot?
  
  6, Regarding the volcano plot, does LogFC mean log2 or log(e)? I noticed that this may not be consistent if you used different tools. For example, some tools like Seurat FindMarkers uses Log2, while NEBULA uses Log(e). Clear labeling on the x-axis and tutorial guidance would help ensure consistency.
  
  7, Very nice introduction about the color palettes at the end of the Enhanced Visualization tutorial.
  
  8, The incorporation of python tools into R is innovative, including scVelo, Palantir. There may be a need to continue incorporating new tools, such as Dynamo, a newer tool I started to use recently. While this is not required for the current revision, it could be a valuable direction for future development.
  
  Overall, this tool represents a comprehensive extension of Seurat, combining enhanced visualization, pathway enrichment, and trajectory analysis into a single package. I look forward to seeing a revised version of this manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.01.606144v1
www.biorxiv.org www.biorxiv.org

Multiomics uncovers the epigenomic and transcriptomic response to viral and bacterial stimulation in turbot

3
1. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Aijun Ma
  
  In the manuscript "Multiomics uncovers the epigenomic and transcriptomic response to viral and bacterial stimulation in turbot", many investigations were applied to uncover the immune regulatory response in the turbot. This multi-omics investigation provided an improved understanding of the epigenomic basis of turbot immune response and offers novel functional genomic information. However, some aspects need to be considered in order to improve the manuscript, as indicated below. 1 Line 16: In this sentence, authors used "the innate immune regulatory response" to describe the response of these two stimuli in a tissue and cell. Innate immunity is a very strict term, and it is not appropriate to use it here. 2 Line 34-36: poly I:C and inactive Vibrio anguillarum were just like PAMP, the response to these two stimulations cannot represent the process of disease defense. The sentence "which can be leveraged for disease resistance selective breeding" was listed in conclusions, that was not accurate. Suggest moving this sentence to the outlook section. 3 Line 80-87: Head kidney is a key lymphoid organ in most marine fishes, and plays central role in fish immunity. It is inappropriate to only talk about its innate immune function. Vibrio is a common bacterium in seawater, while Vibrio anguillarum is an opportunistic pathogen. Strictly speaking, experimental fish will inevitably meet Vibrio during the breeding process before the experiment. Suggest reorganizing the sentences of this paragraph.
2. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Elisabeth Busch-Nentwich
  
  This is a careful analysis of a large and high-quality dataset that will be a very useful resource for researchers across disciplines. I commend the authors on their extensive metadata, and comprehensive and well annotated data tables, which make this a truly accessible resource. I don't have any major criticism. A few minor points: 1. Typo in Figure 1 (it's immature, not inmature) 2. In Fig 3 Upset plots could be a bit easier to parse 3. Fig 5 doesn't have a legend for the blue gradient (but it's pretty self-explanatory)
3. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Laura Caquelin
  
  Summary of the Study This study provides the first multi-omics investigation of the innate immune response in turbot (Scophthalmus maximus). By integrating RNA-Seq, ATAC-Seq, and ChIP-Seq data, researchers identified changes in gene expression, chromatin accessibility, and histone modifications after viral and bacterial stimulation. The findings reveal a significant coupling between the transcriptome and chromatin state, offering insights for the selection of disease resistance in aquaculture.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: Association of ATAC-Seq and ChIP-Seq data with RNA-Seq data
  
  ● Outcome: Overlap of promoter DARs and DHMRs with DEG promoters ● Analysis method outcome: Hypergeometric test ● Main result: "DARs and DHMRs were much more overrepresented at the promoter regions of upregulated rather than downregulated DEGs" (Table 4, Supplementary Table 11; Lines 403-405, Page 9)
  
  Availability of Materials a. Data ● Data availability: Raw data are available, but generated data from the study are shared with the journal and not yet publicly available ● Data completeness: Complete ● Access Method: Manuscript's supplementary files/Private journal dropbox ● Repository: - ● Data quality: Structured, but lacks variable definitions in supplementary files, making it difficult to interpret and use. b. Code ● Code availability: Not available for the primary result ● Programming Language(s): Excel ● Repository link: - ● License: - ● Repository status: - ● Documentation: README lacks information on hypergeometric test.
  
  Computational environment of reproduction analysis
  
  ● Operating system for reproduction: MacOS 14.7.4 ● Programming Language(s): Excel ● Code implementation approach: Excel formulas based on methodology description provided by authors ● Version environment for reproduction: Excel version 16.94
  
  Results
  
  5.1 Original study results ● Results 1: Table 4 and supplementary table 11
  
  5.3 Steps for reproduction
  
   Reproduce supplementary table 11 to perform hypergeometric test * Issue 1: No code or instructions for constructing Table 4 in manuscript and README text. ▪ Resolved: Authors shared methodology upon request Authors' Clarification: The hypergeometric test wasn't carried out with any particular script but with the following public online tool, that can be replicated in excel: https://systems.crump.ucla.edu/hypergeometric/ The tool basically runs the following excel formulas: Cumulative distribution function (CDF) of the hypergeometric distribution in Excel =IF(k>=expected,1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) =IF(k>=((sM)/N),1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) expected = (sM)/N direction =IF(k=expected,"match",IF(k<expected,"de-enriched","enriched")) fold change =IF(k<expected,expected/k,k/expected)
  
  where k is the number of successes (intersection of DAR/DHMR in promoters + DEG), s the sample size (DEG), M the number of successes in the population (DAR/DHMR in promoters) and N the population size (28.602 genes). For each condition, the count of downregulated and upregulated DEG (s) was taken from supplementary table 4. Similarly, the count of downregulated and upregulated DAR/DHMR (M) was taken from supplementary table 10, considering only differential peaks that are annotated as "promoter-TSS" in the annotation column (column M). The population size (N) was the total list of genes that were DEG, DAR or DHMR (combining the data on supplementary tables 4 and 11, eliminating duplicates). Finally, the intersection of of DAR and DEG (k) for each condition was retrieved with the following venn diagram online tool: https://bioinformatics.psb.ugent.be/webtools/Venn/" * Issue 2: Discrepancies in DEG counts from supplementary table 11 ▪ Resolved: Investigated variable definitions (using the wrong variable - strand), confirmed that log2FoldChange determines up/down-regulation * Issue 3: Filling in DAR/DHMR values ▪ Unresolved: Unclear correspondence between "promoters" rows and excel file sheets. Does H3K27me3 correspond to the promoters? * Issue 4: Using the Venn diagram tool to find intersections ▪ Unresolved: Worked for one condition (ATC vivo poly (down)) but failed for ATAC vitro-vibrio and ATAC-vivo-vibrio. Tool returns a "Request Entity Too Large" error. * Issue 5: Define the population size ▪ Unresolved: The instructions for defining the population size are not clear. In supplementary table 4, it seems that the variable "Gene ID (ENSEMBL)" should be used, but in supplementary table 10, should the variable "Nearest PromoterID" or "Gene symbol" be used?  Using supplementary table 11 values to perform hypergeometric test Having failed to obtain the values required to reproduce supplementary table 11, the data already provided were used to obtain the "enrichment" and "p-value" values using the excel function provided. * Issue 1: Comparison of p-values ▪ Resolved: For Up condition, extremely small p-values are not displayed correctly due to Excel's limitations in scientific notation. Excel may either display them as zero or in an incomplete scientific format (e.g., 0.00E+00). Using the tool on the web.
  
  5.4 Statistical comparison Original vs Reproduced results ● Results: Based on the available data in supplementary table 11, the "enrichment" and "p-value" values have been successfully reproduced in most cases. ● Comments: The full table could not be reproduced, particularly the data corresponding to DAR/DHMR, DAR/DHMR+DEG and population size values, due to missing information or unclear definitions in the supplementary files. ● Errors detected: The enrichment value for the Up condition of promoters-vitro-vibrio was incorrectly reported in the manuscript/table. Based on the Excel formula and the online tool used, the correct value appears to be 2.28 instead of 2.82. ● Statistical Consistency: All the values that could be reproduced from the available data matched the original results, except for the detected error.
  
  Conclusion
  
  Summary of the computational reproducibility review The study's results were partially reproduced. Key values such as enrichment and p-values were successfully replicated, but some dataset elements (DAR/DHMR, DAR/DHMR+DEG, and size population) could not be verified due to insufficient methodological details provided in the manuscript. An error in the enrichment value for the Up condition of promoters-vitro-vibrio was identified (2.28 instead of 2.82). The p values used for statistical inference were however successfully reproduced.
  
  Recommendations for authors o Improve data documentation: Define variables in supplementary files. o Provide all code and scripts: Share the excel formulas used for table 4/supplementary table 11. o Clarify statistical methodology: Include detailed methods description for the hypergeometric test. o Enhance reproducibility workflow: Provide a structured README with all necessary steps.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.15.580452v2
Jul 2025
www.biorxiv.org www.biorxiv.org

Chevreul: An R Bioconductor Package for Exploratory Analysis of Full-Length Single Cell Sequencing

2
1. GigaScience 30 Jul 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents Chevreul, a new open-source R Bioconductor (meta-)package for processing and integration of scRNA-seq data from cDNA end-counting, full-length short-read or long-read protocols. Alongside a R Shiny app for easy visualization, formatting, and analysis for exploratory analyses of scRNA-seq data processed in the SingleCellExperiment Bioconductor or Seurat formats. The name of the tool is inspired by the colour theorist Michel-Eugène Chevreul and the optical illusion of the same name. To demonstrate the use of Chevreul, the authors provide a sample analysis, which helps to demonstrate how users can visualize a wide range of parameters, enabling transparent and reproducible scRNA-seq analyses. Peer review also pushing the author to provide extensive guidance materials to assist with use. Being implemented in R, the R package and integrated Shiny application are freely available under an open-source MIT license in Bioconductor and their GitHub page here: https://github.com/cobriniklab/chevreul
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 30 Jul 2025
  
  in GigaByte
  
  AbstractChevreul is an open-source R Bioconductor package and interactive R Shiny app for processing and visualization of single cell RNA sequencing (scRNA-seq) data. It differs from other scRNA- seq analysis packages in its ease of use, its capacity to analyze full-length RNA sequencing data for exon coverage and transcript isoform inference, and its support for batch correction. Chevreul enables exploratory analysis of scRNA-seq data using Bioconductor SingleCellExperiment or Seurat objects. Simple processing functions with sensible default settings enable batch integration, quality control filtering, read count normalization and transformation, dimensionality reduction, clustering at a range of resolutions, and cluster marker gene identification. Processed data can be visualized in an interactive R Shiny app with dynamically linked plots. Expression of gene or transcript features can be displayed on PCA, tSNE, and UMAP embeddings, heatmaps, or violin plots while differential expression can be evaluated with several statistical tests without extensive programming. Existing analysis tools do not provide specialized tools for isoform-level analysis or alternative splicing detection. By enabling isoform-level expression analysis for differential expression, dimensionality reduction and batch integration, Chevreul empowers researchers without prior programming experience to analyze full-length scRNA-seq data.Data availability A test dataset formatted as a SingleCellExperiment object can be found at https://github.com/cobriniklab/chevreuldata.
  
  Reviewer 1. Dr. Luyi Tian and Dr. Hongke Peng
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Thus, the statement of need is well-defined, addressing both the problem (complexity of scRNA-seq data analysis without programming skills) and the intended audience (non-programming researchers in the field).
  
  Additional Comments: This study provides Chevreul, a Bioconductor package, for analysis and visualization of single-cell sequencing data. This package contains a shinny app. It also provide the functions which implemented by a set of bioconductor packages for standard scRNA-seq analysis to generate the necessary input of the shinny app. I believe that this app can provide an additional option for researchers who work with single-cell data. However, there might be a few comments need addressing.
  
  While the title emphasizes "exploratory analysis of full-length single-cell sequencing," the authors do not explicitly mention the analysis full-length data (e.g., isoform detection or quantification). For instance, the “sce_process(...)” pipeline figure lacks specific steps addressing full-length sequencing workflows. To strengthen this claim, the authors might need to mention/summarize the methods for isoform detection and quantification, for both annotated and novel ones. It would be better to specify recommended tools for transcript-level analysis (e.g., transcript assembly or differential isoform usage) that integrate with Chevreul's visualization features. Meanwhile, The manuscript focuses on Smart-seq as the representative full-length method. It might also be helpful to discuss other full-length methods such as ONT nanopore sequencing or PacBio, in aspect of data processing, transcript assembly, de novel usage or potential challenges in adapting Chevreul to these platforms, etc.
  
  There is another minor suggestion. Functions mentioned in the text and Figure 1 (e.g., “sce_process”, “sce_integrate”) should include parentheses (e.g., “sce_process()”) to align with R syntax conventions and clarify their roles as package functions.
  
  Re-review: I am happy with the revision and author have fully addressed my concerns.
  
  Reviewer 2. Dr.Tianhang Lv
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Chevreul provides tools for exploratory analysis of single-cell data and offers essential tools for the analysis and visualization of single-cell full-length transcriptomes. In several sections of the article, the authors discuss the key computational challenges addressed by this software. However, in the abstract, they need to emphasize the advantages of Chevreul in single-cell full-length transcript analysis (the current version lacks sufficient description). In the "Statement of Need" section, the authors could also highlight the limitations of existing single-cell full-length transcript analysis tools and introduce the advantages of Chevreul in this regard.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. Although the authors have provided installation documentation, the current documentation on GitHub is not user-friendly. For example, the page at https://github.com/cobriniklab/chevreul does not include code for importing seuratTools, yet it runs the built-in function clustering_workflow from seuratTools. Additionally, the current documentation is overly simplistic and not accessible to those without programming experience.
  
  Is the documentation provided clear and user friendly?
  
  No. The authors have separated the example workflows for SingleCellExperiment objects and Seurat objects into two different GitHub projects, which is not conducive for users to understand the structure of Chevreul or to facilitate learning. Additionally, the batch integration mentioned in the article lacks specific implementation examples. The authors should at least provide implementation examples for the results mentioned in the manuscript. Furthermore, the current documentation needs further refinement to truly enable individuals without programming expertise to easily analyze single-cell data.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. The authors have developed an excellent Shiny app for single-cell visualization, enabling users without programming expertise to freely export visualization results from single-cell analysis. The installation commands provided by the authors on https://github.com/cobriniklab/chevreul do indeed allow for the installation of Chevreul. However, Chevreul involves nearly 300 dependency packages, including sub-libraries developed by the authors (seuratTools, chevreulPlot, chevreuldata, chevreulPlot, chevreulProcess, chevreulShiny) as dependencies. Relying solely on the installation commands provided by the authors to install all dependency packages may result in some packages (especially large ones) failing to install due to network bandwidth issues, which is not user-friendly for those without programming experience. Additionally, could the numerous dependency packages of Chevreul potentially cause dependency conflicts with existing R environments? Should the authors recommend users to deploy Chevreul in a new R environment? It is recommended that the authors provide a step-by-step installation guide, explaining potential issues and solutions during the installation process based on the dependencies of Chevreul and its sub-libraries. By installing dependency packages step by step, users can gradually complete the installation of Chevreul. The current installation documentation is clearly not user-friendly for non-programmers and does not align with the authors' statement in the manuscript: "It differs from other scRNAseq analysis packages in its ease of installation and use." At present, the installation documentation provided by the authors may not meet the original design intent of Chevreul. Additionally, the authors should specify that Chevreul supports Seurat version V5.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. The authors could provide specifications for the minimum hardware requirements needed to run Chevreul, such as the number of CPU cores and the amount of memory. Additionally, the authors could offer data on the runtime of Chevreul as the volume of data increases.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.
  
  Additional Comment. The authors have developed an R Shiny app for single-cell exploratory data analysis, which will significantly expand the application scenarios of single-cell data analysis and bring great benefits to a wide range of biology practitioners. The large size of Chevreul's installation package indicates the considerable difficulty in its development, reflecting the immense wisdom and effort the authors have invested in creating this package. Chevreul's advantages in visualization and analysis are evident, and if further developed and refined, it is certain to attract even more users in the future. To ensure that such an excellent package as Chevreul can be easily and quickly adopted by users, several suggestions for improving the documentation and enhancing user-friendliness are provided. We hope the authors can refine the package based on the reviewers' feedback and recommendations.
  
  Re-review: I have carefully reviewed the revised manuscript and am satisfied that all my comments have been adequately addressed. The authors have resolved the software errors reported in the original submission by updating the relevant shiny app modules. They have also enhanced the package documentation to assist users without programming experience in installing and using Chevreul. In the manuscript itself, the authors have provided detailed responses and explanations to each of my points.
  
  Overall, they have addressed all of my comments thoroughly. That said, a few minor issues remain in the manuscript (revised version with tracked changes) that should be corrected to ensure consistency with academic publishing standards and to help readers better learn how to use Chevreul: 1. On line 52, the placeholder “(doi reference for Shayler et al. data to be provided)” appears—did the authors forget to insert the citation or data link? 2. On line 96, would it be more appropriate to replace “SingleCellExperiments” with “SingleCellExperiment objects”? 3. On line 119, please add a space so that “databases[19–21]used” reads “databases [19–21] used.” 4. For consistency, should the second occurrence of “batchelor” on line 132 be italicized? 5. The Chevreul link is already cited in the “Availability & Implementation” section and need not be repeated in the Figure 1 legend. 6. On line 184, the gene symbol “NRL” should be set in italic Latin script. 7. On the GitHub page (https://github.com/cobriniklab/chevreul), the phrase “A demo with a developing human retina scRNA-seq dataset from Shayler et al. is available here” points to an inaccessible web demo. Restoring this demo in a future update would greatly facilitate experimental biologists in learning and using Chevreul.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.27.656486v1
www.biorxiv.org www.biorxiv.org

CellBinDB: A Large-Scale Multimodal Annotated Dataset for Cell Segmentation with Benchmarking of Universal Models

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Shan Raza
  
  The paper presents a multimodal data set for cell segmentation and benchmarking. The major strength of the dataset is its multimodal nature and including both mouse and human tissue. The paper analyses existing data sets and the performance of state-of-the-art methods. However, the authors missed one of the biggest data sets on the cell segmentation and classification which includes more than 500,000 annotated nuclei in H&E https://www.sciencedirect.com/science/article/pii/S1361841523003079.
  
  The CoNIC challenge paper also analysis state-of-the-art nuclei segmentation and classification methods. The authors should add one of the best performing models in their analysis. I would also suggest the authors to include PQ and froc in the metrics to analyse the results as this is commonly used in this domain for comparison. I would also suggest to compare the results with HoVerNet or HoVerNext (https://github.com/digitalpathologybern/hover_next_train) which are state-of-the-art algorithms for nuclei instance segmentation. The code for these algorithms is publicly available.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Jeff Rhoades
  
  General comments:
  
  Dataset Innovation: CellBinDB offers a significant improvement over existing datasets with its diversity of staining types (DAPI, ssDNA, H&E, mIF) and broad tissue coverage, including normal and diseased samples.
  
  Benchmarking of Models: The evaluation of seven state-of-the-art segmentation algorithms provides valuable insights for researchers selecting tools for various imaging modalities.
  
  Analysis of Influencing Factors: The manuscript thoroughly examines biological (e.g., cell morphology) and technical (e.g., image gradient) factors affecting model performance, providing practical recommendations for improving segmentation outcomes.
  
  Preprocessing Impact: Demonstrating the effectiveness of preprocessing (e.g., grayscale conversion for H&E images) is an immediately actionable takeaway for practitioners. However, authors should apply preprocessing uniformly to all segmentation approaches, not just those that did poorly initially.
  
  Major Areas for Improvement:
  
  Preprocessing Uniformity:
  
  Apply preprocessing steps uniformly across all segmentation approaches to ensure fair comparisons and avoid bias.
  
  Inclusion of Cellpose3 Training Dataset:
  
  The manuscript should include the dataset used for training Cellpose3 in its comparisons. Cellpose3's superior generalist model performance is emphasized, yet the absence of its training dataset in the comparisons raises questions about robustness of the benchmarking.
  
  Evidence of Dataset Utility:
  
  While the dataset's benchmarking is well-done, the manuscript does not provide evidence that models trained on CellBinDB outperform those trained on other datasets. Addressing this, though potentially out of scope, would strengthen the manuscript's impact.
  
  Figure Panels:
  
  Labeling in figure panels should be clearer to enhance interpretability. For instance, indicate whether the instance or semantic masks are being shown and consider making instance segmentation masks colorful to highlight unique IDs.
  
  Semantic masks could be omitted if space is constrained, as they are largely redundant with instance masks.
  
  Ensure figures are spaced more evenly throughout the text, ideally located near their first references, to improve readability.
  
  Abstract Clarity:
  
  The abstract should better reflect the intellectual contributions of the analysis of segmentation performance factors (i.e. cell morphology and image gradients).
  
  Normalization Methods:
  
  Provide details on how cell morphology indicators are normalized in the methods section to ensure reproducibility and clarity.
  
  Explanation of Image Gradient:
  
  The discussion of gradient magnitude and its calculation using the Sobel operator requires more accessible language. Not all readers will be familiar with this concept, so additional context is essential.
  
  Tissue Classification:
  
  Group related tissues, such as "brain," "half brain," and "cerebellum," under a common "neural tissue" category for easier interpretation and analysis. Additional Suggestions:
  
  Address grammatical errors and improve clarity in some sections, such as the benchmarking pipeline description.
  
  Replace vague terms like "ML-based" when referring to CellProfiler with specific algorithmic descriptions.
  
  Including public datasets, such as Cellpose, to create a unified, all-inclusive CellBinDB dataset might significantly enhance the resource's utility for machine learning practitioners.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.20.619750v2
www.biorxiv.org www.biorxiv.org

Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer: Christopher Tabone **
  
  This manuscript evaluates the use of large language models (LLMs) to improve the consistency and usefulness of BioSample metadata. The authors focus on extracting specific biological terms from freetext sample descriptions: first, identifying cell line names (using a curated gold-standard for evaluation), and second, identifying experimentally modulated gene names (in a scenario without prior manual curation). An open-source 70B LLM (Llama 3.1) was used and its performance was compared against a conventional ontology-mapping pipeline (MetaSRA). Overall, the study is well-motivated - addressing the challenge of heterogeneous metadata - and the approach is generally sound and well documented. Below, I address specific aspects of the work in detail: Methodological Appropriateness and Controls: The methods are appropriate to the study's aims and are described with detail. The two-part evaluation (cell line extraction and gene name extraction without prior curation) aligns well with the goal of demonstrating LLM utility in metadata curation. The authors took care to construct a gold-standard dataset for cell line extraction by leveraging ChIP-Atlas's manually curated sample annotations. This approach avoids starting from scratch and ensures the evaluation is grounded in experimental metadata. The sample selection strategy is well justified: using equal numbers of ChIP-seq and ATAC-seq samples to control for the presence/absence of protein names (a potential confounder for detecting cell lines), avoiding duplicate projects and identical terms, and restricting to human samples to leverage the Cellosaurus ontology. These controls strengthen the evaluation by preventing bias (e.g. one project dominating results or trivial cases duplicating answers). The LLM pipeline is clearly outlined (Figure 2) - the model is prompted with BioSample attributes to extract a representative cell line term. Importantly, the authors compare this LLM-assisted pipeline against an existing rule-based method (the MetaSRA ontology mapping pipeline). This serves as an essential control/baseline to quantify the improvement gained by using an LLM. For the second task (extracting modulated gene names), where no curated baseline exists, the authors sample thousands of BioSample entries and perform manual evaluation of the LLM's outputs. While manual checking is necessary here, the manuscript could clarify the evaluation procedure (e.g. how many evaluators or what criteria were used) to assure readers of consistency. Overall, the experimental design is solid. The necessary details (model used, prompt design, parameter settings like temperature=0 for reproducibility) are all provided, and the authors have made their code publicly available, which aids reproducibility. The methodology is transparent and should allow others to replicate or build upon the work. Support for Conclusions by Data: The conclusions are, for the most part, well supported by the data presented. In the cell line extraction task, the LLM-based method clearly outperforms the traditional MetaSRA pipeline in both accuracy and coverage (Table 4). For example, the LLM pipeline achieved substantially higher coverage (93.0% vs 72.1% for MetaSRA) without sacrificing accuracy (~92.3% vs 90.3%), and it also showed improved precision in identifying non-cell line samples. These results validate the authors' claim that LLMs can more flexibly and comprehensively interpret metadata, mapping many more actual cell line samples to ontology terms while maintaining low false-positive rates. The data support the conclusion that the LLM approach enhances metadata findability (since far more samples get correctly annotated) and does so with high reliability. The authors appropriately note that the conventional method's conservative strategy yields high precision at the cost of leaving many samples unmapped, whereas the LLM can confidently map a greater portion of samples. This finding is well substantiated by the numbers and the error analysis in Table 5 (which categorizes the few failure cases of the LLM, such as confusion with derivative cell lines or missing a cell line when certain keywords were absent). In the gene name extraction task, the authors report that the LLM identified at least one gene in 600 out of 3,723 tested samples, with an overall accuracy of ~80.3% for those outputs (about 91.6% accuracy on gene names themselves, and 84.7% on the associated modulation method). This demonstrates that the LLM can successfully parse complex descriptions to find gene perturbations in a majority of cases. While there is no baseline for direct comparison here, these results are consistent with the idea that LLMs can extend curation to new information types not yet curated (in this case, finding manipulated genes where an ontology or curated list didn't exist). The authors' conclusions about the utility of this - for example, that it could allow users to filter out experiments with gene knockouts/knockdowns to avoid confounding effects - are reasonable extrapolations from the data. The discussion correctly notes that coverage for this gene task wasn't evaluated (since no gold standard exists) and acknowledges that some fraction of relevant cases might be missed. All major conclusions (LLM outperforms rule-based methods; LLM extraction of new metadata is feasible and useful) are backed by the evidence provided. The authors also contextualize their findings by noting limitations and practical considerations (e.g. the processing throughput of ~400 samples/hour and the challenge of scaling to 40 million records). This adds credibility to their interpretation that LLM-based curation will need further resources or model improvements to handle the entire database. In summary, the data presented are analyzed in depth (with relevant tables, figures, and a breakdown of error types), and they support the paper's conclusions well. I have no concerns that the authors are overstating their results. Language Clarity and Quality: The manuscript is written in generally clear and professional English. The authors note that they translated the draft from Japanese with assistance from ChatGPT, and the result is readable and scientifically appropriate. The overall clarity is good - important terms are defined, and the narrative flows logically from the motivation to methods, results, and discussion. I did not encounter ambiguities that impede understanding of the science. There are only a few minor issues in language usage and grammar that require attention. For example, there is a small typo in the description of gene overexpression ("achieved by trasfection of a plasmid…" on page 19) - "trasfection" should be "transfection" (unless this typo was carried over from the original prompt). Another example is the sentence "the outcomes of this study can handle these errors to rescue the affected published data for further use," which is a bit awkward in phrasing - perhaps reword to clarify that the methods developed can help correct metadata errors from submitted data. These are relatively minor edits; the manuscript does not require heavy language revision, just light editing for a few misspellings and stylistic "smoothing". The structure of the paper is appropriate, with a clear Introduction and well-labeled sections (Methods, Results/Discussion, Limitations, etc.). Data presentation is also clear: figures and tables are easy to interpret, and captions are explanatory. For example, the flowchart in Figure 2 and the definitions in Figure 3 clearly help in the understanding of the pipeline and metrics. In summary, with minor editorial changes, the quality of language and presentation will be suitable for publication. Statistical Analysis and Data Presentation: I am able to assess all the statistics and quantitative analyses in the manuscript, and they appear appropriate. The study primarily uses descriptive performance metrics (accuracy, coverage, precision, recall) to evaluate the extraction tasks - these are standard and well defined (the text and Figure 3 provide clear definitions of each metric in the context of the task). The comparisons between the LLM pipeline and the MetaSRA pipeline are straightforward to interpret. The authors did not perform complex statistical tests (e.g., no p-values are reported), which can be justified given that the magnitude and consistency of the improvements are evident and the evaluation emphasizes practical performance metrics rather than hypothesis testing. However, the manuscript states in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq subsets. If the authors intend "significant" to indicate statistical significance, it would be necessary to include the specific statistical test used along with associated test statistics and p-values to substantiate this claim. If no formal statistical testing was conducted, it would be more accurate and clearer to rephrase this as a qualitative observation rather than implying formal statistical support. All underlying data needed to interpret the results are provided either in the main figures/tables or supplementary material. The presentation of results is clear and transparent: Table 4 quantitatively summarizes the performance of each pipeline, and Table 5 qualitatively categorizes the errors made by the LLM. I have no other concerns about the appropriateness of statistical methods used - the evaluation metrics are suitable for information extraction tasks, and the sample sizes (600 samples for the cell line task, and thousands scanned for the gene task) are adequate to support the conclusions. In terms of data transparency, the manuscript indicates that outputs and code are available (with a GitHub repository provided), which will allow others to reproduce the analysis. Additional comments and suggestions: Beyond the points above, I have a few minor suggestions to further strengthen the manuscript. First, it would be helpful if the authors could clarify in the Methods how the manual evaluation of gene name extraction was performedâ€”for example, whether multiple curators independently reviewed the outputs or if any consensus procedure was employed to resolve ambiguous cases. Providing this detail would add transparency to the accuracy figures reported, although the existing explanation about handling ambiguous cases (e.g., fusion genes) is already helpful. Second, given the manuscript's emphasis on a zero-shot LLM approach, it would be beneficial for the authors to briefly discuss whether alternative strategies, such as fine-tuning smaller language models, were considered. This would more clearly position the study within the broader landscape of metadata curation techniques. Third, the authors describe the use of the locally deployed Llama 3.1 model and emphasize its advantages regarding data privacy and scalability. Since these benefits are significant for practical adoption, it would further strengthen the manuscript if the authors explicitly highlight practical considerations, such as specific hardware requirements (in addition to the graphics card usage already included) and runtime performance benchmarks. Finally, as mentioned earlier, the authors mention in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq samples. If the term "significant" here is meant to indicate statistical significance, please include details of the specific statistical test and associated values (e.g., test statistics and p-values) that substantiate this conclusion. If no formal statistical testing was performed, it would be more appropriate to rephrase this statement to indicate a qualitative observation rather than imply statistical testing. These points are relatively minor and do not indicate fundamental issues with the manuscript. Recommendation: In summary, this is a strong manuscript that addresses a pertinent problem in biological data management using modern LLM tools. The methods are sound and well controlled, the results are convincing, and the authors have been appropriately cautious and thorough in their analysis. I recommend minor revisions for this manuscript. The revisions needed are primarily editorial (minor language fixes and clarifications), with one note about statistics, and do not require additional experiments. With those addressed, the work should be suitable for publication in GigaScience.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Sajib Acharjee Dip 1. The gold-standard dataset constructed for evaluation, though carefully validated by experts, was limited to 600 samples (300 ChIP-seq and 300 ATAC-seq). Such a limited scope may introduce selection bias or fail to capture the full variability present across the entire BioSample database (>40 million records). It is unclear how representative these samples are of real-world metadata submissions.Clearly demonstrate the representativeness of the sample selection or increase sample size to better represent BioSample's diversity.
  
  The manuscript predominantly compares the proposed LLM-based approach to the MetaSRA pipeline. While MetaSRA is a relevant baseline, the omission of comparisons with other contemporary methods like ChIP-GPT, and Bioformer is a notable oversight. These tools represent significant advancements in the field and have demonstrated efficacy in tasks closely related to the study's objectives. A comprehensive evaluation against these methods or comparative discussions would provide a clearer understanding of the proposed approach's relative performance and contributions. https://academic.oup.com/bib/article/25/2/bbad535/7600389 https://pmc.ncbi.nlm.nih.gov/articles/PMC10029052/
  
  "LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage." While the study reports improved performance over MetaSRA, the absence of comparisons with other SOTA methods renders this assertion less robust. Without such comparative analyses, it's challenging to attribute the observed improvements solely to the proposed approach.â€‹ Rephrasing claims to accurately reflect the scope of the comparisons made would strengthen clarity.
  
  Despite high accuracy, complex cases (fusion proteins, inhibitors mentioned indirectly, ambiguous terminology) were recognized as difficult, yet were excluded from primary accuracy evaluations. By excluding these ambiguous cases from performance metrics, the accuracy results might be artificially improved. Provide additional metrics that include these complex or ambiguous cases, clearly quantifying performance drops. This would offer more realistic insights into real-world applicability.
  
  The error categorization provided (derivation issues, overlooked terms, selection failures, etc.) is helpful, but somewhat superficial. The deeper root causesâ€”such as the LLM's lack of biological context knowledge, tokenization errors, or prompt ambiguityâ€”were not thoroughly explored or explained. Discuss or perform deeper qualitative analysis on specific error instances, highlighting precisely why the LLM made incorrect decisions (e.g., lack of biological understanding, misinterpretation of abbreviations, limitations of prompt wording).
  
  Temperature settings were fixed at zero for deterministic outputs. While deterministic settings are valuable for reproducibility, exploring or reporting the effect of temperature variations on accuracy and robustness would have strengthened this methodological choice significantly.
  
  The authors have not sufficiently explored or justified their prompt engineering choices which are critical for reproducibility and optimization. I recommend providing additional experiments or discussions on alternative prompting strategies tested, including prompt variants that failed and reasons why particular prompts were selected.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.17.638570v1
www.biorxiv.org www.biorxiv.org

CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome

1
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Despite the surge in data acquisition, there is a limited availability of tools capable of effectively analyzing microbiome data that identify correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here, we introduce CODARFE, a novel tool for sparse compositional microbiome-predictors selection and prediction of continuous environmental factors. We tested CODARFE against four state-of-the-art tools in two experiments. First, CODARFE outperformed predictor selection in 21 out of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data—that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects (e.g., ginseng field and cattle for arable soil, and HIV and crohn’s disease for human gut), using a model trained on one dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in five formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify - https://github.com/alerpaschoal/CODARFE.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf055), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:
  
  Reviewer: Jaak Truu
  
  This manuscript addresses key aspects of microbiome data analysis, particularly in relating continuous variables to microbiome data and utilizing microbiome data to predict variables of interest. The data analysis approach is well-articulated; however, there is a notable omission regarding the derivation of the microbiome datasets. While the sources of these datasets are mentioned, it remains unclear whether the authors processed the initial data to produce the count tables used as input or if these tables were directly adopted from the original publications. Given that the data in the main text are derived from studies based on 16S rDNA sequencing, variations in data processing pipelines between publications could introduce significant variability. Although the manuscript discusses the importance of the sequenced 16S rDNA region and the similarity of the environments from which the samples were obtained, it does not address the impact of the initial data processing pipeline (including taxonomy assignment).
  
  Additionally, the number of samples in each dataset is not provided in the tables.
  
  The manuscript includes a comparison of the proposed method with other tools; however, it omits MaAsLin (Microbiome Multivariable Association with Linear Models), that has been applied far more extensively in microbiome data analysis than the tools included in the current manuscript. Incorporating a comparison with MaAsLin would enhance the comprehensiveness of the evaluation.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.18.604052v1
www.biorxiv.org www.biorxiv.org

The FIP 1.0 Data Set: Highly Resolved Annotated Image Time Series of 4,000 Wheat Plots Grown in Six Years

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Wanneng Yang
  
  The manuscript presents a comprehensive dataset spanning six years, encompassing data from eight key growth stages of wheat, along with corresponding phenotypic data. The construction of such a comprehensive dataset is highly valuable. However, from the perspective of dataset construction itself, quality control and consistency checks require further refinement. Specific issues are as follows:
  
  How is the consistency check of parameters such as canopy cover and plant height at the eight key growth stages ensured? Especially for parameters like phenological stages and senescence assessment, which are determined through visual evaluation and thus susceptible to subjective influences, quality control and consistency check become particularly crucial. It is recommended to supplement relevant content for detailed explanation.
  
  For all images (151,150 out of 158,891 images), the success rate of alignment and within-field detection exceeded 95%. Does this mean that the final RGB sequence image dataset consists of 151,150 images?
  
  Regarding plant height measurement, the text mentions that "TLS (2016, 2017) or UAV (2018 to 2022) was used to measure plant height." Given the potential differences in height measurements obtained from these two methods, how were these differences addressed in the manuscript?
  
  Does this dataset cater to different tasks and include annotated data? If so, it is recommended to specify the concrete annotation methods and data.
  
  If possible, it is recommended to provide a summary table that specifies the different types of data contained in the dataset along with their respective quantities, facilitating readers' comprehensive understanding of the dataset.
  
  What are the potential limitations of this dataset? It is recommended to point them out.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Abhishek Gogna
  
  Thank you for the submission. The dataset surely holds value for the plant breeding community but my major concerns are (1) the availability of genetic data, (2) non-conformity to MIAPPE standards (https://www.miappe.org/). These restrict value of the otherwise excellent publication. I would welcome a submission addressing these major points. In addition, I have some minor points for specific sections. Please use the strings in quotation marks ("") to locate the specific sections.
  
  Context Change of Equipment: Please indicate how the change of equipment from TLS to drone affects data interoperability. "Figure 2, gray bars": Kindly update Figure 2 to clarify the representation of the gray bars.* "Heads were annotated": Does this mean that not all relevant images were annotated? If so, please modify the title to avoid confusion.
  
  Description of FAIR: Please revise this section. Both links listed under "Findable" and "Accessible" are eligible for these tags. Please modify "Interoperability" with reference to the publication listed in the "Re-use Potential."
  
  Reference measurements "Senescence was": Was this measurement done for all relevant images? Please include this information. "Adjusted genotype means with year calculation": Please add variance decomposition data for traits.
  
  3. Compilation as Data set* "pure GABI-WHEAT set for the extended set": Please revise this sentence for clarity.
  
  Heritabilities of intermediate and target traits* "y of the public marker" - Please revise the sentence for clarity.
  
  Genomic prediction ability of unseen multi-environment trial* Is the CDC data part of the data publication? Please add this information.6. Example 1 to
  
  6* Please revise all code for consistency and updated results. Also, include the necessary packages required to run the code.7. Availability of Source code and RequirementPlease create connectivity between repositories and add descriptive README files outlining their usage. Additionally, please provide instructions on how individual repositories may be used.I appreciate your attention to these points and believe that addressing them will strengthen your manuscript
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.04.616624v3
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly and methylome profile enables insights for the conservation of endangered loggerhead sea turtles

3
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: F Gözde Çilingir
  
  In this study, the authors generated a high-quality chromosome-level genome assembly and methylome for the loggerhead sea turtle (Caretta caretta) using a combination of Oxford Nanopore Technology (ONT) and Illumina sequencing. They also examined population size fluctuations, identified microchromosomes as key areas for monitoring genetic diversity and epigenetic flexibility, and focused on genes linked to temperature-dependent sex determination (TSD), with additional datasets from 10 individuals using whole-genome bisulfite sequencing (WGBS).The study consists of three key parts: 1) genome sequencing and assembly, 2) benchmarking ONT methylation calls with WGBS, and 3) epigenetic patterning of TSD-linked genes, which was contextualized for future studies. The first part certainly includes relatively novel genomic resources that will provide valuable tools for conservation and population genomics. It's encouraging to see the use of DNA modification detection via ONT, with a comprehensive analysis of 5mC and 5hmC methylomes alongside genomesâ€”especially for chelonians, a group that is underrepresented among available vertebrate genomes. Benchmarking ONT methylation calls with WGBS is also relevant for the field (though some clarifications on the experimental design are necessary). However, I have several concerns regarding the biological rationale of certain study design choices and the conclusions drawn by the authors regarding the TSD-linked genes' methylation patterns.Overall, this study provides valuable genomic resources for loggerhead sea turtles. However, some of the biological assumptions and study design choices regarding the methylation patterning require further clarification and a more robust discussion to ensure that the conclusions drawn can be supported by the data produced.Detailed comments to the authorsABSTRACTThe abstract states: "Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date."Throughout the manuscript, this number changes. Please double-check and ensure consistency in the number of TSD-linked genes reported.BACKGROUNDI suggest using the phrase "a skew toward female-biased sex ratios" instead of "feminisation" throughout the text for a clearer and more neutral description of the biological phenomenon. For example, the third sentence of the second paragraph could be revised as:"As multiple theoretical studies have predicted a significant skew toward female-biased sex ratios and subsequent population collapse by 2100 in response to future climate scenarios."METHODSPage 5, DNA extraction, sequencing, and quality control - first paragraph:ONT kit chemistry numbers and flow cell types can be confusing for readers. Could you also clarify that the SQK-LSK109 kit used is associated with R9.4.1 flow cells, indicating the sequencing error profile of the technology?Regarding the Phred score >Q8 cutoff: Q8 corresponds to a sequencing error rate of ~15-16%. Could you clarify the reasoning behind choosing this cutoff? Citing similar studies that have used this threshold would add support to your decision.Page 8: I couldn't find the de novo assembled transcriptomes in the ENA or GigaDB repositories. Are these data publicly available? If so, it would be beneficial to provide the location.Page 9, ONT methylation call and validation with WGBS:There's a discrepancy between the retained CpGs: you mention "26,449,075 CpGs" in one place and later report different numbers in the results section. Please clarify these numbers and ensure consistency.It would be helpful to include a table summarizing key metrics of the ONT methylation call, such as mean/median CpG site coverage, similar to Table S3.Page 9, second paragraph: You mention "Ten nesting loggerheads." Please specify that these are ten adult loggerhead females for clarity. Additionally, correct the table references: Table S3 should be Table S2, Table S4 should be Table S3, etc.RESULTS AND DISCUSSIONGenome AssemblyFigure 1B: While Table 1 effectively illustrates the differences in contiguity levels, Figure 1B doesn't add much due to the difficulty in distinguishing closely aligned lines. If you retain the figure, I suggest using more contrastive colors to improve readability.Genome Annotation: I agree that the lack of a pre-determined training parameter set for chelonians within the BRAKER pipeline leads to relatively incomplete gene model predictions. However, lifting over gene models from other sea turtle genomes and combining them with predictions (again using TSEBRA) would likely improve the overall completeness of the annotations.Methylation Call and ValidationYou state, "To verify our ONT methylation call, we compared calls with ten loggerhead methylomes re-sequenced via WGBS." Does this mean you generated an ONT methylome from a single individual and compared it to the average methylation levels from ten different individuals obtained with WGBS? If so, this may not be an ideal benchmarking strategy. Generating both ONT and WGBS data for all individuals would provide a more robust comparison. Clarifying this design would help the reader understand the validation process better. Additionally, consider citing relevant benchmarking studies.In the last paragraph of this section, you highlight ONT as a robust alternative to WGBS but then use WGBS for the TSD-linked gene analysis. This appears somewhat contradictory. It might be useful to explain why WGBS was favored in this part of the analysis.Genome Properties: Figures 3C-F were difficult to read to me (low resolution), and they don't seem directly related to Figures 3A and 3B. I suggest separating these figure groups for better clarity. Additionally, it would be helpful to report or visualize the repeat content of both micro and macro chromosomes. Long-read sequencing assemblies are particularly effective at resolving repeat-rich regions, and microchromosomes are often repeat-rich. Highlighting this aspect would demonstrate the added value of long-read sequencing for assembling reference genomes of organisms like sea turtles.TSD-linked genes: methylation patternsTesting methylation differences between TSD-linked and non-TSD-linked genes focusing on specific regulatory regions is potentially informative, but the biological rationale for expecting consistent differences between these two groups is unclear. TSD-linked genes are involved in dynamic, environmentally responsive processes, whereas non-TSD-linked single-copy orthologues (as used in the study) typically represent essential, evolutionarily conserved functions with more stable methylation patterns. The use of single-copy orthologues as a control set is problematic because these genes could serve fundamentally different roles. A more relevant comparison would be between TSD-linked genes and other genes involved in similarly dynamic, environmentally responsive pathways.Additionally, all methylation data come from adult female blood (N=10, all from the same beach), which may not be the most appropriate approach for studying TSD, a process that primarily occurs during embryonic development, when temperature cues influence sex determination. Methylation patterns in adults may no longer reflect the active regulatory processes that control TSD during embryogenesis. In other words, adult methylation patterns could be influenced by factors such as reproductive status or aging, and may not reflect the regulation of TSD-linked genes during key developmental stages. These limitations/points should be addressed.CONCLUSIONSThe manuscript would benefit from a discussion of how biological context (such as developmental stage) affects the interpretation of methylation patterns in this study.It is also worth mentioning that both ONT and WGBS require substantial amounts of input DNA, and blood samples from reptiles are ideal because of their nucleated red blood cells-this could be acknowledged as a practical advantage somewhere in the text.SUPPLEMENTARY INFOCould you explain what "DMS" refers to in Text S3? This term isn't defined in the manuscript.There are two Figure S7, please change the last one to Figure S8.SUPPORTING DATAThe FTP server data look good, but I couldn't find the de novo transcriptomes. Some files have long, confusing namesâ€”adding a README file in each directory would help clarify the contents.Important note: It would be helpful to include line numbers in the manuscript to facilitate direct and effective feedback.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Victor Quesada
  
  This work offers a in improved version of the reference genome for the loggerhead sea turtle. The authors have also analyzed the methylation patterns of blood obtained from different individuals and with two methods. The resulting data set includes gene annotations, methylation levels and the specific analysis of methylation levels of genes involved in temperature-dependent sex determination (TSD). While the improvements offered by this work seem modest, I think that the data sets may provide important resources for future works.-In my opinion, the use of a previous version of the same genome in the assembly process should be noted in the abstract. It would be enough to write "... followed by homolgy-guided scaffolding to GSC_CCare_1.0...".-If possible, the authors should clarify the taxonomic relationship between the reference individual in this work and the reference individual for the previous version of the genome (ref. 26). Is it the same NCBI taxid?-There is a mention to "lateral terminal repeats" at the "Genome annotation" section (page 7). I think it is a typo and it should read "long terminal repeats".-In the same section, at page 9, reference 73 refers to StringTie, not gffread. In addition, it is not clear how "in-frame stop codons were removed". A simple way to unambiguously explain this would be to provide the options that were used, as with other programs.-I would revise the use of "coverage" versus "depth". For instance, the expression "...a coverage of 9.2(...)X" would be more precise as "...a sequencing depth of 9.2(...)X". Coverage should be a fraction or a percentage. However, this is only a piece of advice, as there is no strong consensus at the moment.-The interpretation of methylation patterns is always difficult. In my opinion, the manuscript should discuss several limitations about the results:First, using blood as the starting tissue is convenient but not ideal, as many methylation patterns are tissue-specific. The authors may want to add a reference to preliminary evidence that some methylation changes in blood cells are related to TSD (Bock et al., Mol Ecol. 2022; 31:5487-5505).Second, the work examines broad patterns of methylation (all promoters, all coding sequences,...). While this may be interesting for descriptive purposes, it may also drown significant signals. The manuscript should mention this limitation.*Figure 2B shows methylation per gene. If the aim is to compare both kinds of sequencing, there should be at least one comparison of methylation per CpG, which might even be cathegorial or downsampled.-The origin of the duplication of EP300 seems outside the scope of the manuscript. Nevertheless, given that the question is posed, the authors may want to perform a simple phylogenetic analysis of the sequences. Even the basic analysis of the annotated copies plus an outgroup is likely to give a robust answer to this question.-For the benefit of non-specialists, the manuscript might include a brief mention of how microchromosomes allow a larger number of combinations of variants without chromosome recombination.-Some expressions may be edited for clarity and precission. Examples are "which should be verified whether they are true" (page 17) and "microchromosomes have greater methylation potential and realised levels...".
3. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Zhongduo Wang The study presents high-quality genomic and methylomic data for loggerhead sea turtles, serving as a significant resource for further genomic and epigenomic research on this species. Notably, this is the first methylome derived from a sea turtle using ONT technology, offering a new, reliable method for studying the epigenetic characteristics of non-model organisms. Moreover, by integrating genomic and methylomic data, the authors analyze the functionality and methylation patterns of TSD-related genes, contributing fresh perspectives to the molecular mechanisms underlying TSD. While the study offers valuable data, there are several areas that could be enhanced.1) Lack of Reference to Hawksbill Turtle Genome: The manuscript does not discuss any information regarding the hawksbill turtle genome. Given that hawksbills also published a comparative analysis of the loggerhead's genomic data, I recommend that the authors include relevant information or clarify why hawksbill data was not considered.2) Further Optimization of Genome Annotation: The authors acknowledge that the completeness of the genome annotation requires enhancement and mention future improvements such as species-specific parameter adjustments and manual curation. While it is understandable that time and resource constraints may have limited these optimizations prior to submission, it would be beneficial for the authors to clarify the reasons for this and outline a timeline for future enhancements.3) Information on Individual Variability in WGBS Results: The manuscript lacks specific information on inter-individual variability among the ten individuals in the WGBS data. I suggest that the authors consider adding this analysis or provide justification for its absence. If significant variability exists among individuals, averaging the methylomic data could obscure important biological information.4) Clarification on Statistical Tests and Data Processing: The manuscript employs several statistical tests such as t-tests, Ftests, and chi-squared tests. However, the methods section lacks detailed information on how the data was processed for these analyses. I recommend that the authors provide a more thorough explanation of the data preparation steps, assumptions checked, and justification for the choice of tests.In summary, this manuscript makes a significant contribution to the study of loggerhead turtle genomics and methylomics. Addressing the aforementioned points could further enhance the quality and impact of the work.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.28.610089v1
www.biorxiv.org www.biorxiv.org

Chromosome-level reference genome for the medically important Arabian horned viper (Cerastes gasperettii)

3
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Hardip Patel
  
  Dear Authors, thank you for compiling this resource and the manuscript. I apologise for the delay in my review. I have read the manuscript with great interest. I have some major concerns that need be addressed and a lot of minor concerns. Without line numbers, it was difficult to provide comments. I have chosen to write the part of the sentence that my comment refers to for you to consider for improvements.
  
  Major concerns:
  
  Abstract can include quantitative values for some key results such as the genome size, contiguity (e.g.N50, L90) and quality metrics (e.g. BUSCO) of the genome assembly among other result claims listed in the abstract. Venom as the keyword can perhaps be described/defined. Authors interchangeably use "venom", "toxin", "venom toxin", genes coding venom proteins. I strongly suggest the use of consistent terminologies that are well defined in the manuscript. Methods need elaborate descriptions about reagents, procedures including for library preparations, sequencing machines, library kits and versions, etc. These are relevant for downstream analyses. For all software, list parameters used, even if default, then explicitly state that "default parameters were used". For all software, list version numbers used for analyses. Authors are urged to change "macorsynteny" and "microsynteny" terms to chromosome level and local synteny analyses. This is to avoid confusion related to macro/microchromosomes. "Genomic diversity" analyses use cross-species alignments and variant calling using software and methods developed for same species data. This can introduce significant bias in downstream interpretation and use of the variant data (heterozygosity measure may be). I suggest removal of this section because of lack of accuracy. Discussion of new discovery is largely lacking. I would appreciate if authors contextualized their results with other discoveries in the field. Section headings in Results and Discussions can be changed to reflect main findings instead of "transcriptomics" or "genomic diversity". One of the main findings is about SVMP gene family expansion. However, due to the lack of evidence about assembly accuracy in the region, accurate annotation of copies, and the effect of studying "primary assembly" instead of "haplotype assembly" at this region, I am not convinced of claims made in the paper. Appropriate justification is required for this section. The nomenclature of SVMP genes is confusing. For example, In Figure 4A, they are all labelled as SVMPs with different colours, but then they are labelled as MDCs and MADs in Figure 4b and Supp Figure 6. Please label each gene in each species with consistent names that can reflect orthologous relationship. This is hard to discern, especially without appropriate species labels in Supp Figure 6. Provide MSA files and trees used to infer evolutionary history. In the absence of the sequence alignments, and raw tree file, I am unable to evaluate this section of the manuscript. Please provide all required details for reviewers and readers. ??: It is not clear what authors mean by the word, term, phrase. Please correct them to convey accurate meaning using established and accepted scientific terminologies and English conventions. Minor concerns:
  
  Abstract:
  
  "compousing" ?? "highly expressed toxin genes": in what tissues? "genome-wide diversity" ?? "toxin gene families in venomous species" -> "toxin gene families in venomous snake species" Background: "Such advances in sequencing technologies": remove "Such" "depending on their type, interactions, and the organism": interactions with what? "proteomic (and transcriptomic) approaches": remove parenthesis "to new therapies for human illnesses including but not": since the title contains "medically important", it would be great to include some specific examples here from the literature. "However, venomous snakes are one": remove "However" "therefore, the fundamental model system": change "fundamental" to "useful" "of medical importance by the World Health Organization (WHO) due to their": provide citation "Within venomous snakes, the most medically": restructure the sentence for brevity and clarity. "cytotoxic effects (among others)": remove "(among others)" "conducted using a proteomic approach": clarify what proteomic approach mean here. "Hirst et al., (in review);" : remove this citation "within the Viperidae family posses an available reference": change the word "posses" to something meaningful "Moreover, employing several -omics techniques": be specific about techniques "We deciphered numerous genomic attributes": be specific Methods: Describe how blood was extracted from animals with all details including animal handling techniques, body part etc. "was stored in RNAlater until RNA extraction": source for RNAlater "We extracted gDNA from the blood of a female individual": provide additional details such as the quantity of blood used, thawing process, qty of reagents, especially elution buffer etc. Manufacturer protocols may be suited best for mammalian blood (humans, mice) without nucleus in RBCs unlike snakes. "Then, we sequenced a total of two 8M SMRT HiFi cells, aiming for a âˆ¼30x of coverage, at the University of Leiden": provide details of library preparation, sequencing machine etc. "(including venom glands, tongue, liver and pancreas, among others": Either list all or refer to the table. "RNA libraries were prepared with the VAHTS": Was the library and sequencing strand specific? Provide complete details on these processes. "8M SMRT HiFi cell containing two Iso-seq HiFi libraries": use correct names of these and also include sequencing machine details. "Quality control on HiFi and Illumina reads was assessed using FastQC": correct the phrasing of this sentence "To make an initial exploration of the genome, …..we generated a k-mer profile with Meryl": Explicitly state the purpose of this analysis. "Manual curation was performed with Pretext": cite Pretext properly. Explain decisions of this manual curation. i.e. what evidence was used to join or break contigs. "Then, we ran three iterative rounds of RepeatMasker to annotate the known and unknown elements identified by RepeatModeler and soft-masked the genome for simple repeats": break this sentence into two and explain reasons for running RepeatMasker three times. "We used GeMoMa v.1.9": Include all details about the annotations. This sentence is not sufficient for reproducibility. Were the RNAseq data assembled or provided as raw files to GeMoMa. How were they mapped to the genome assembly f "published: Anolis carolinensis from AlfÃ¶ldi": Remove the word "from" here as citation is sufficient. Provide details of assembly versions, annotation version, database of annotations etc. "Crotalus ruber from Hirst et al., (in review)": remove this citation or list it as personal communication "We previously quality checked and removed the adapters of the RNA-seq data": remove "previously" and provide details on how adapters were removed from RNAseq data "also removed the adapters for the Iso-seq data": Explain how this was performed. "We blast our ..": Change all occurrence of "blast" to "BLAST" and specify parameters, if it was BLASTN or BLASTP or something else. This is not clear at all. "we performed additional annotation steps for venom genes.": Details are not complete for reproducibility. State explicitly what decisions were made and how gene structure was determined. This is the main part of the paper and does require accurate details. "Whole-genome synteny was explored between": synteny by definition refers to being on the same string/chromosome. Therefore whole-genome synteny as a term doesn't make sense given that genome is divided into chromosomes. Revise it to say "chromosomal synteny" "chromosomes assembled in the reverse complement, which were corrected using SAMtools faidx": samtools faidx cannot do this. Explain how this was done. "After adapter trimming and quality control, we mapped our RNA-seq reads": how were adapters trimmed and QC implemented. "Gene counts per gene": change gene counts to read counts "Differential expression analyses were carried out": requires additional details such as filters applied for the count, groups compared, statistical model, multiple testing correction methods. "characterize the venom arsenal of Cerastes gasperettii": change the arsenal word. "Fragmentation spectra were matched against a customized database including the bony vertebrates taxonomy dataset of the NCBI non-redundant database": revise for accuracy "Unmatched MS/MS spectra were de novo sequenced": spectra were sequenced how?? "we used blast, incorporating both toxin and non-toxin paralogs": change blast to BLAST and provide additional details about the tool used "Then, we aligned those regions using Mafft (Katoh": provide coordinates of these regions for future research in each assembly "history for the main groups of toxins (i.e.,": parenthesis is not closed. Close it or remove it. "we also included other non-toxin paralogous genes from nontoxic species (for details about this see Supplementary Information": where do I look into the supplementary information? Be very clear. Provide coordinates of regions that were compared. "When needed, we translated CDS": when was this needed? Explain. "built a phylogeny for each of the toxin groups using Phyml": I presume that this is done with translated CDS sequences in toxin genomic regions. Please clarify. "Heterozygous positions were obtained from bam files with Samtools v1.9": provide details as to how this was done. Samtools doesn't have features to operate at a site level and therefore I am confused. "Filtered reads were mapped against the new reference genome of Cerastes gasperettii using the bwa mem algorithm": bwa mem is designed for same species comparisons. Here you have used it for crossspecies. Provide justification and perhaps biases it may have introduced for distantly related species. "SNP calling was carried out …": This is not appropriate as models assume same species data. You have used cross-species alignments, which can be highly biased. Results and Discussion: "PacBio HiFi (~40x), Hi-C (~60x) and Illumina data (~78x)": change to number of base pairs. 40x for a genome of 2GB is 80GB data and for genome of 1GB size, it is 40GB data. Before sequencing and assembly, the genome size cannot be known. "After manual curation, we enhanced the scaffolding parameters of our genome": what was done as manual curation. Please specify. "âˆ¼228 times more contiguous than the Anolis sagrei genome": how is 228 more measured. How is this useful as a metric without the known ground truth. Assemblies can and do have errors. "27,158 different protein-coding genes within our assembly": this seems large compared to other species. Can you elaborate or compare these numbers with other species. "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)." : please revise this statement. Two part of the sentence are saying opposite things. Second provide coordinates of these genes as GFF/BED file as supplementary file with their exon structure annotations for others to reuse this information. "showed a great level of similarity between Cerastes gasperettii and Crotalus adamanteus": provide quantitative metrics for "great" level of similarity. "we found several fission events in the A. sagrei genome,": Since A. sagrei genome is not contiguous and chromosome scale, you cannot infer fissions as it may be artefact of non-contiguous assembly. If that is not the case, provide evidence of this. "The last four…": Belongs in methods "Macrosyntenic differences between lizards and snakes": this is very superficial discussion point. Please remove it or strengthen it with evidence. "Heatmap analyses with the most 2,000": Revise this statement. It doesn't make sense. E.g. Heatmap is a visualisation technique and not analyses method. "We studied venom evolution within the most abundant toxin groups": rewrite the sentence for clarity and brevity. "After a thorough manual curation": Explain what was this manual curation process clearly and the purpose of it. "contiguous tandem repeat SVMPs for": Change "repeat" to "array" because tandem repeat has a different meaning in genomics research context. "flanked by the NEFL and NEFM": Unclear if they are both 5' or 3' of toxin genes. Clarify "Microsyntenic analyses showed": change to local synteny "gene copy number variation between": Since these are duplicate copies, clearly state how gene copies were identified. Include details of open reading frames, exon structures, pseudogene status, etc "we can see an expansion in": Describe number of new copies, their status as intact or not, and sequence similarity between copies. Provide evidence that there is no false duplication due to heterozygous allele collapse in the assembly. "More genomic data will indicate if SVMP12": Did you mean SVMP13? "This difference may be expected, as PLA2 only represents around 5% of the proteome for Cerastes gasperettii": This is not true. Proteome doesn't equal to genome in some cases and superficial inference such as this is not warranted. For PSMC analyses, please discuss the effect of mutation rate and generation time. Figures: Figure 1: Add y-axis scales to the circos plot. Figure 1b legend says it is a linkage map, but looks more like HiC contact map. Please edit. Figure 1b legend also says "including the sex chromosomes", which is not consistent with the circos plot. Figure 3A refers to transcriptome and 3b to proteome. Please make this very clear. Figure 4A, C and E, label genes consistent with the phylogenetic trees in supplementary figures so readers can know their genomic arrangements. Figure S4: Discuss why CG1 sample separates from rest of the samples. Seems like a batch effect.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  ** Reviewer Blair Perry**
  
  Mochales-Riano et al. present a high-quality genome assembly for the Arabian horned viper and provide a suite of genomic analyses related to synteny, toxin gene evolution and expression, genomic diversity, and demographic history of this and related species. This species is a valuable addition to existing snake genome resources given its medical significance and the current underrepresentation of genomes for Viperidae. I also appreciate that the authors sequenced the heterogametic sex and successfully assembled both sex chromosomes. I do have a few questions and concerns about the manuscript in its current form that I highlight below. Most notably, I feel that the arguments throughout the manuscript about toxin gene copy number correlating with proteomic abundance to be poorly supported and generally problematic given the data and analyses that the authors present. I suggest that the authors reevaluate these claims, and either provide additional analyses in an effort to support these claims or otherwise remove them from the manuscript, as I don't think they are ultimately crucial to the value of this genome report.
  
  Introduction:
  
  I find the argument being made in the sentence beginning "Previous works have shown that changes in gene regulation" a bit confusing. Rather than this arguing that studying the expression of venom genes is "insufficient," I think that this instead argues that transcriptomic and proteomic data are critical for studying venom in conjunction with annotated genome sequence. You could for example have a species with 20 copies in a particular tandem array, but only two of them are ever expressed at biologically meaningful levels and thus contribute proteins to the excreted venom. Knowing both the total number of copies in the genome and the number that are actually contributing to the venom proteome are both valuable and necessary for understanding the evolution of that gene family, its role and significance in venom phenotypes, etc. I'm also not sure I follow the logic of the next sentence. Why exactly would the identification of specifically "unexpressed" toxin genes be particularly notable for antivenom, drug discovery, therapeutics, etc.? "We deciphered numerous genomic attributes of this species including its genetic diversity and failed to find evidence of inbreeding" - lack of inbreeding is never discussed in the context of the heterozygosity results, but is pitched here as a major result of the paper. Did the authors have a priori expectations regarding inbreeding in this species?
  
  Methods:
  
  "Gene counts per gene…" - should this be "Gene expression counts per gene…"? Venom gland RNA-seq data was generated from three animals, but proteomic data was generated from a pool of two other animals. This is not ideal for linking gene expression to venom proteome composition, where you really would want venom collected from the same animals you are getting venom gland RNA from. This is especially true is there is intraspecific variation in venom phenotypes within this species. The latitude and longitude are not provided for the two proteome samples. Were these collected from the same latitude and longitude as the RNA-seq animals? For analyses of heterozygosity, the authors map wgs data from diverse species against the cerastes reference and call variants. Why was this approach chosen over instead mapping the data for each species to either that species' reference (i.e., C. viridis and N. naja) or a more closely related species for those without a reference? Presumably that would reduce the potential influence of reference bias on these estimates of heterozygosity?
  
  Results:
  
  "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)" this feels a bit contradictory. Maybe just can state that toxin genes were found on both macro and microchromosomes? "Finally, we also found a battery of 3FTxs and myotoxin-like genes, but they were not represented in our RNA-seq dataset (see below)." The authors do not further discuss this result as implied by "(see below)," unless that was simply referring to subsequent discussion of RNA-seq data. From what I can tell, these are also not present in the proteomic data, correct? "The venom gland transcriptome contained a total of 7,237 genes expressed (TPM > 500), including a total of 65 putative toxin genes. Differential gene expression analyses revealed a total of 161 genes (33 putative toxin genes) that were differentially upregulated (FC > 2 and 1% FDR) in venom glands compared to other tissues (Fig. 3A)." Figure 3A only shows 10 toxin genes with "unique" expression in the venom gland, not the 161 upregulated toxin genes as implied here. The authors should add a heatmap with these 161 genes to the supplement, if not to Figure 3 (guessing it might not fit). Fig 3: The authors do not discuss the lack of unique/upregulated expression evidence for PLA2s and Disintegrins in Fig 3A, despite their contribution to protein composition in Fig 3B. For disintegrins in particular, they represent a higher proportion of the venom proteome than CTLs and CRISPs, yet there is no evidence presented for high expression in these genes. What do the authors think is going on here? Could this be a technical issue related to the processing of the RNAseq data, perhaps related to the small size of these genes? Alternatively, could this be indicative of a mismatch between venom phenotypes of the animals used to generate transcriptomic versus proteomic data? In the text, the authors state "These genes, together with other SVMPs, SVSPs, Disintegrins (DISI) and Ctype lectins (CTL), were highly expressed in the venom gland and form the core toxic effector components of the venom" but again there is no presented evidence for DISI expression in particular. Are these genes included in the 161 upregulated genes in the venom gland? The authors only present proteomic data in the form of a pie chart of overall composition grouped by toxin family (Fig 3B). Does the proteomic data generated here provide individual gene-level proteomic abundance estimates? If so, this would be valuable to include, especially in support of the authors claims about gene copy number being correlated with protein abundance. For example in Figure 3, SVMP9 and SVMP10, and to a lesser extent SVMP13, are highly expressed and therefore possibly/likely the major contributors to SVMPs in the proteome. Is the SVMP section of the pie chart in Fig 3B dominated by proteins from these 3 genes? "We studied venom evolution within the most abundant toxin groups (i.e., SVMPs and SVSPs, as well as PLA2)." PLA2s are a relatively low proportion of the venom proteome in Fig 3B, and are not present in the expression heatmap in Fig 3A. Why were these chosen for further investigation over CTL, CRISP, DISI, etc.? "The amplification of SVMP copy numbers is consistent with proteomic results, as SVMPs were the second most abundant component…". Related to my comment above, are all/many of these copies expressed in proteomic, or at least transcriptomic, data? As the data is currently presented, it appears that a small number of SVMPs are highly expressed and thus likely contributing to the proteome. This does not support, and might in fact contradict, the authors claim that there is an association with increased copy number and contribution to the proteome. Related to this, and more generally, the authors do not present a convincing argument for the relationship between gene copy number and the resulting percentage of a given toxin gene family in the proteome. If copy number is directly related to the resulting amount of a toxin in the proteome, the authors would need to show that many/all of those copies are expressed in the transcriptomic data, and that proteins produced from those genes are present and contributing to the venom proteome (beyond just the total percentage for the family). Further, making any links between copy number and percent overall composition in the proteome is problematic, because it inherently is impacted by copy number variation and expression of all the other toxin genes. You could, in theory, have copy number expansion in a species where all the genes are expressed and contribute to the proteome, but no overall change in the percent of that toxin family in the proteome if other toxin families have also expanded and/or are expressed more highly. Related to this, there is currently no obvious baseline to compare against in order to make these claims that expansion has resulted in higher venom proteome composition (i.e., a situation where we have fewer SVMP gene copies and a corresponding lower percentage of SVMP proteins in the venom proteome). This would potentially require comparison across species and/or populations with differing copy number, etc. My concerns above also apply to the interpretation of SVSP results: "The high number of SVSP genes found (although lower than in Crotalus adamanteus) were in line with the proteomic results, as SVSPs are the most abundant toxin in the proteome (Fig. 3B)." Further, C. adamanteus has a larger number of SVSP genes than C. gasperettii, yet a lower percent composition of SVSPs in the proteome (Margres et al. 2014), emphasizing my concerns about associating copy number and percent composition. Could the two large Group 2 SVSPs in Fig 4E be misannotations of multiple genes? Looking at the adamanteus genes above these, there genes starting and ending at roughly the same position the start and end of these large SVSPs, making me wonder if there are multiple cerastes genes that were annotated as one. In my own experience, I have seen similar situations where FGENESH+ was fed a large region containing multiple genes and annotated multiple genes together as one, so might just be worth double checking that that hasn't happened here. Alternatively, could these be gene fusions? If that's the case, that would presumably complicate the gene tree analyses, correct? i.e., these genes would probably need to excluded from those analyses
3. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Jiatang Li
  
  In the manuscript entitled 'Chromosome-level reference genome for the medically important Arabian horned viper (Cerastes gasperettii)', the authors assembled a high-quality chromosome-level reference genome for the Arabian horned viper (Cerastes gasperettii), a special Viperid species, which is an important data resource. Combined with multi omics data, the authors characterized the genome, conducted the analysis of toxin gene family, and identified a novel SVMP gene. The research is with great significance for the revelation of the origin and diversification of snake venom. Overall, I think the science and findings of the study are meaningful and merit publication, but in its current form, there are some issues should be noticed: 1. It should be noted that Fig. 1 and Fig. 2 both have unidentified border lines.
  
  In all phylogenetic trees presented by the manuscript, it would be better for authors to indicate all species information.
  
  I'm curious if the authors considered period differences in sampling, for example differences in venom glands after venom harvest or in the resting state, which could affect the analysis especially the transcriptome.
  
  In the transcriptomics section, the author stated that the batch effect of CG1 was due to the low mapping of that sample to our reference genome. It is a misinterpretation to me as CG1 itself is the genome sequencing sample. The authors should further explain for this.
  
  The authors need to ensure that all data generated by the manuscript is accessible and information about the data is not currently available.
  
  Please check the references to ensure that the formatting meets the publisher's requirements, e.g., some Latin names of species requiring italics.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.29.605543v1
www.medrxiv.org www.medrxiv.org

Health Data Nexus: An Open Data Platform for AI Research and Education in Medicine

1
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  We outline the development of the Health Data Nexus, a data platform which enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security and privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform’s development, highlighting the importance of flexible uses, collaborative development, and open-source science.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf050 ), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:
  
  Reviewer: Hollis Lai
  
  The purpose of the paper is to demonstrate the adoption of PhysioNet as a medical data sharing platform. The authors outlined the process, workflow, and approval chain required to facilitate such process. The manuscript also provided initial data use and adoption to demonstrate feasibility of such platform. This is a difficult subject to publish as authors do demonstrate the use of platform, but it is difficult to present this subject in a scientific basis.1. The authors describe the datalake require for sharing medical data and does a good job on describing the administrative processes required for such datalake. However, how does this differ from the literature of other platforms? Why was this platform adopted and not other approaches? What information is provided in this adoption that other approaches did not consider or would need to know. I think there is an established literature out there on health data sharing platform that the authors should acknowledge, and highlight how this approach is needed to address these issues.2. The authors highlight adoption data, but no evaluation data was solicited nor provided. Such information would be helpful to know if we were to evaluate how this creation could be replicated. I think there are many great use cases for this outcome but very little is discussed on how it could be applied in the field. For example, is this a method paper promotine others in adopting the platform? or is this a paper demonstrating how others can develop similar platforms?3. There was acutally no relation to AI other than the use of data holding for AI training. The data holding would make sense for UToronto as the process and approvals are built based on local institution requirements. I have tried to access the system as an external and found it intuitive. But, other than building this platform for the purposes of UToronto holding data for UToronto researchers, is there any plans or process for adopting holdings for other institution? How should other users perceive this information? Could other holdings such as administrative data be used?I think the presentation of the article has merit but more needs to be done to capture what has already been done in the field and why this solution also needs to be presented (contribution to the field).
Visit annotations in context

Annotators

GigaScience

URL

medrxiv.org/content/10.1101/2024.08.23.24312060v2
www.biorxiv.org www.biorxiv.org

Analysis-ready VCF at Biobank scale using Zarr

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Zexuan Zhu
  
  The paper presents an encoding of the VCF data using Zarr to enable fast retrieving subsets of the data. A vcf2arr conversion was provided and validated on both simulated and real-world data sets. The topic of this work is interesting and of good values, however, the experimental studies and contributions should be considerable improved.1. The proposed method is simply a conversion from VCF to Zarr format. Since both are existing formats, the contributions and originality of this work are not impressive.2. The compression and query performance is the main concern of this work. The method should be compared with other state-of-the-art queriable VCF compressors like GTC, GBC, and GSC.Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics, 2018;34(11):1834-1840.Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H,etal. GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species. Genome Biology, 2023;24(1):1-22.Luo X, Chen Y, Liu L, Ding L, Li Y, Li S, Zhang Y, Zhu Z. GSC: efficient lossless compression of VCF files with fast query. Gigascience, 2024; 2;13:giae046.3. The method should be evaluated on more real VCF data sets.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Nezar Abdennur
  
  The authors present VCF Zarr, a specification that translates the variant call format (VCF) data model into an array-based representation for the Zarr storage format. They also present the vcf2zarr utility to convert large VCFs to Zarr. They provide data compression and analysis benchmarks comparing VCF Zarr to existing variant storage technologies using simulated genotype data. They also present a case study on real world Genomics England aggV2 data.The authors' benchmarks overall show that VCF Zarr has superior compression and computational analysis performance at scale relative to data stored as roworiented VCF and that VCF Zarr is competitive with specialized storage solutions that require similarly specialized tools and access libraries for querying. An attractive feature is that VCF Zarr allows for variant annotation workflows that do not require full dataset copy and conversion. Another key point is that Zarr is a high-level spec and data model for the chunked storage of n-d arrays, rather than a bytelevel encoding designed specifically around the genomic variant data type. I personally have used Zarr productively for several applications unrelated to statistical genetics. While Zarr VCF mildly underperforms some of the specialized formats (Savvy in compute, Genozip in compression) in a few instances, I believe the accessibility, interoperability, and reusability gains of Zarr make the small tradeoff well worthwhile.Because Zarr has seen heavy adoption in other scientific communities like the geospatial and Earth sciences, and is well integrated in the scientific Python stack, I think it holds potential for greater reusability across the ecosystem. As such, I think the VCF Zarr spec is a highly valuable if not overdue contribution to an entrenched field that has recently been confronted by a scalability wall.Overall, the paper is clear, comprehensive, and well written. Some high-level comments: The benefits for large scientific datasets to be analysis-ready cloud-optimized (ARCO) have been well articulated by Abernathey et al., 2021. However, I do think that the "local"/HPC single-file use case is still important and won't disappear any time soon, and for some file system use cases, expansive and deep hierarchies can be performance limiting (this was hinted at in one of the benchmarks). In this scenario would a large Zarr VCF perform reasonably well (or even better on some file systems) via a single local zip store? The description of the intermediate columnar format (ICF) used by vcf2zarr is missing some detail. At first I got the impression it might be based on something like Parquet, but running the provided code showed that it consists of a similar file-based chunk layout to Zarr. This should be clarified in the manuscript. The authors discuss the possibility of storing an index mapping genomic coordinates to chunk indexes. Have Zarr-based formats in other fields like geospatial introduced their own indexing approaches to take inspiration from? Since VCF Zarr is still a draft proposal, it could be useful to indicate where community discussions are happening and how potential new contributors can get involved, if possible. This doesn't need to be in the paper per se, but perhaps documented in the spec repo.Minor comments: In the background: "For the representation to be FAIR, it must also be accessible," -- A is for "accessible", so "also" doesn't make sense. "There is currently no efficient, FAIR representation...". Just a nit and feel free to ignore, but the solution you present is technically "current".* In Figure 2, the zarr line is occluded by the sav line and hard to see.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.11.598241v3
Jun 2025
www.biorxiv.org www.biorxiv.org

The first near-complete genome assembly of pig: enabling more accurate genetic research

3
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 2 version
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research.
  
  General comments:
  
  The authors have clarified how their HiC manual curation efforts were able to remove gaps from the assembly. This was my only remaining major issue. I only have a few minor comments remaining.
  
  Minor comments:
  
  Line 1 - Title: "A Near Telomere-to-Telomere Genome Assembly of the Jinhua Pig"
  
  Line 369 - replace "only 6 gaps left in our final JH assembly" with "only 6 gaps remain in our final JH assembly"
  
  Line 370 - Figure S5 needs a more detailed legend
  
  Line 405 - I just noticed this, but are the authors proposing that chr9 has 2 centromeres? Given the know pig karyotype (metacentric chr9), it seems more likely that they have identified some other form of tandem repeat at the beginning of chr9.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 1 version
  
  Reviewer 1: Martien Groenen
  
  In their revised version of the manuscript, the authors have addressed all my major concerns raised in my earlier review and have made the many editorial edits as suggested. I only have a few (mostly editorial) comments for the revised version. The most important one is the title of the manuscript. I realize I did not mention this in my earlier review, but I think the title is not very appropriate and could be more informative. I suggest something like "A telomere-to-telomere genome assembly of the Jinhua pig"
  
  Minor editorial comments: Line 40: Replace "provides" by "provide"; "genome" to "genomes" and "JH" to "Jinhua" Lines 50-51: "This study produced a gapless and near-gapless assembly of the pig genome, and provides a set of diploid JH reference genome." Should be changes to something like "This study produced a near-gapless assembly of the pig genome and provides a set of haploid Jinhua reference genomes." Line 177: Change "with with" to "with" Line 194: Replace "population" by "populations" Lines 232-233: Referring to human as a "closely related species" is rather awkward and not correct. I suggest replacing this with "eleven other mammals" Lines 299, 301 and 303: Insert "of" after "consisting" Line 317: Insert "and" before "2.33 Gb" Line 319: Insert "and" before "2.17 Gb" Line 320-321: Change to "The more continuous contigs of the two assemblies were selected to construct the final haploid assemblies". Line 323: Replace "assembly" by "assembler "Line 354: Delete "ranging" Lines 358359: Change "The average properly mapped rate" to "The average rate of properly mapped reads" Line 379: Insert "respectively" after "60.07"Line 380: "suggested" (remove space)Line 385: Change "indicate a gapless and near-gapless" to "indicate a near-gapless" Line 455: Change "were overlapped with" to "were overlapping with" Lines 557-559" The sentence "The insertion found in the SLA-DOB gene, which serves to enhance the immune system's response and is relevant to transplant rejection" seems incomplete and sound awkward. Perhaps you mean something like "The insertion found in SLA-DOB, a gene involved in enhancing the immune system's response to infection, might be relevant in relation to transplant rejection"
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research
  
  General comments: I thank the authors for addressing most of my points and providing more details on the parameters they have used. Unfortunately, I still have some unanswered questions regarding the methodology. My current understanding from the authors responses to my previous comments leads me to believe that the assembly has been scaffolded incorrectly. If the authors did indeed use HiC data to place 8 contigs into gaps and then joined those contigs without placing gaps at the joins or doing any further gap filling, that calls into question the validity of the assembly. Finally, the language needs further improvement for readability.
  
  Specific comments: Line 85 - *will contribute to. Lines 187-191 - HiC interaction maps do not provide information for gap filling. Either this has been explained insufficiently, or it has been done incorrectly. Placing assembled sequences in the correct order does not mean that it is okay to join them without a gap. It is necessary to return to the gap filling procedure now that the contigs are in the correct order and attempt to fill them as done previously. Line 191 - Figure S3 - These HiC contact maps are not very informative they need to be labeled and have a scale bar. Additionally, contact maps can have a lack of signal due to a gap in the sequence or due to multimapping reads in repetitive regions being filtered so it's not clear what they are trying to show in A-C. The authors reply to my previous concern regarding the labeling of this figure does not help, furthermore, the figure legend in the supplemental materials is still insufficient. I think I understand that panels D and E are chr3 before and after misassembly correction, it would be helpful if the two panels were at the same scale. I still don't know why panel F is shown, how is this related to panel C and I don't see any red ellipses indicated by the legend. Line 275 - "ensemble from Duroc pigs" is incorrect. It is an "assembly of a Duroc pig". Lines 299, 301, 303 - "containing" not "consisting" Lines 306-308 - Again, HiC data orders and orients contigs, but it does not fill gaps. Please clarify how the assembly was reduced from 14 gaps to 6 gaps with HiC data. Was an additional round of gap filling performed? Lines 313-314 - How is the contig N50 larger than the scaffold N50 above? Lines 335-336 - Does this refer to the Merqury analysis? I don't think "using mapped K-mers" is correct here, please reword. Lines 367-368 - what does it mean that "8 out of 63 gaps were corrected" is this from the HiC ordering of contigs? Line 369 - what does the mapping between Sscrofa11.1 and JH-T2T shown in figure S6 have to do with the JH-T2T gap filling being described here? Line 369 - I previously asked about this supplemental table only containing 55 entries. The authors response "The other filled 8 gaps were resolved through adjustments made to the Hi-C map to correct misassembles. As a result, these gaps cannot be precisely located within the existing order of the assembly." indicates that contigs must have been incorrectly joined solely based on the HiC signal between contigs. The authors must know what contigs were added or joined to form the final assembly. It would be trivial to align the two assembly versions and identify the positions of the old contigs in the new assembly. I believe that these incorrectly joined contigs should be broken and put through the same gap filling procedure as performed earlier. Lines 375-378 - Dramatic coverage changes in read mappings as found in these figures are usually indicative of assembly errors. I do not agree that "These findings confirmed the accuracy and reliability" of the assembly. I suggest replacing the last sentence with something more measured such as "Although supported by some read data, the inconsistency of coverage across these gap filled regions suggests that caution should be used when interpreting findings in these regions, cross-referencing results with the gap positions (Supplementary Table S9) is advised." Line 375 - "evidenced by fully coverage" remove "fully", it isn't proper usage of the word and I wouldn't interpret the low coverage in many of these regions as "full coverage". Line 385 - should read "Overall, our assembly quality metrics indicate a near-gapless assembly of the pig genome" Line 390 - should read "a gapless T2T sequence for 16 out of 20" Line 396 - Supplemental table 10 not 9.Lines 398399 - according to supplemental table S4 and figure 3A, chromosome 2 also has a single telomere. Line 402 - the centromeres are not marked in Figure 3A.Line 402 - Figure S8 - please rename chr19 and chr20, chrX and chrY. Line 406 - "at early research" unclear what is meant by this. please reword. Line 423 - as indicated on line 397, 33 telomeres were identified, not 35.Line 426 - "The JH-T2T assembly IDENTIFIED 17 centromeres" Line 450 - "are located in" Line 453 - "these SVs are located in" Line 455 - Moreover, 12,129 genes overlap these SVs" Line 502 - "which contained 544 gaps" Line 841 - Figure 2 legend description is still incorrect. Only A is mapping rates, B and C are PM rates and base error rates.
3. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original version
  
  Reviewer 1: Martien Groenen
  
  The manuscript describes the T2T genome assembly for the Chinese pig breed Jinhua, which presents a vast improvement compared to the current reference genome of the Duroc pig TJTabasco (build11.1). The results and methodology use for the assembly are described clearly and the authors show the improvement of this assembly by a detailed comparison with the current reference 11.1. While clearly of interest to be published, several aspects of the manuscript should be improved. Most of these changes are minor modifications or inaccuracies in the presentation of the results.
  
  However, there are two major aspects that need further attention:
  
  The T2T assembly presented, represents a combination of the two haplotypes of the pig sequenced. I am surprised why the authors did not also develop two haplotype resolved assemblies of this genome. Haplotype resolved assemblies will be the assemblies of choice for future developments of a reference pan-genome for pigs. The authors describe that they have sequenced the two parents of the sequenced F1 individual, so why did they not use the trio-binning approach to also develop haplotype resolved assemblies. I, think adding these to the manuscript would be a vast improvement for this important resource.
  
  The results described for the identification of selective sweep regions is not very convincing. This analysis shows differences in the genomes of two breeds: Duroc and Jinhua. However, these breeds have a very different origin of domestication of wild boars that diverged 1 million years ago, followed by the development of a wide range of different breeds selected for different traits. Therefore, the comparison made by the authors cannot distinguish between differences in evolution of Chinese and European Wild Boar, more recent selection after breed formation and even drift. To be able to do so, these analyses would need the inclusion of additional breeds and wild boars from China and Europe. Alternatively, the authors can decide to tone down this part of the manuscript or even delete it altogether, as it does not add to the major message of the manuscript.Minor comments Line 34: Change the sentence to: "with thousands of segments and centromeres and telomeres missing" Line 37: Insert "and Hi-C" after "long reads "Line 46: Delete " such as GPAM, CYP2C18, LY9, ITLN2, and CHIA" Line 54: Insert "potential" before "xenotransplantation" Line 82: Delete "in response to the gap of a T2T-level pig genome" as this does not add anything and the use of "gap" in this context is confusing. Line 93: Change "The fresh blood" to "Fresh blood" Line 100: The authors need to provide a reference for the SDS method. Lines 152-153, line 444, and table S6: This is confusing. The authors mention Genotypes from 939 individuals, but in the table it is shown that they have used WGS data. You need to describe how the WGS data was used to call the genotypes for these individuals. Furthermore, in line 444 you mention 289 JH pigs and 616 DU pigs which together is 905. What about the other 34 individuals shown in table S6?Line 244: Replace "were" by "was" and delete "the" before "fastp" Lines 287292: Here you use several times "length of xx Gb and yy contigs". This is not correct as the value for the contigs refers to a number and not a length. Rephase e.g. like "length of xx Gb and consisting of yy contigs" Line 294: The use of "bone" sems strange. Either use "backbone" or "core"Line 306: Replace "chromosome" by "genome" Lines 308-309: For the comment "Second, 16 of the 20 chromosomes were each represented by a single contig" you refer to figure 1D however from this figure it cannot be seen if the different chromosomes consist of a single or multiple contigs. Line 346: Do you mean build 11.1 with "historical genome version". If so, please use that instead. Line 349: "post-gap filled" Line 353: The largest gap is 35 kb not 36 kb. Figures 2F-I should be better explained in the legends and the main text (lines 353-358). Lines 378: For the 23,924 genes you refer to supp table S13. However, that table shows a list of SV enriched QTL not these genes. Furthermore, I checked all tables but a table with all the protein coding genes is missing. Line 380: For the 799 newly anchored genes, refer to table S10. Now you refer to table S17 which shows genes enriched KEGG pathways. Lines 383-386: For the higher gene density in GC rich regions, you refer to figure 1D, but it is impossible to see this correlation from figure 1D. For the density of genes and telomeres, you refer to figure 1G. However, that figure does not show gene densities only repeat densities. Line 406-407. This should be table S11.Lines 409412: For this result you refer to table S11. However, that table only shows data for the gained genes, not the lost genes. Lines 419-420: You refer to table S12 and figure 3B, but the information is only shown in figure 3B and not in table S12.Line 420: Replace "were" by "is" Line 422: Better to use "repeats" instead of "they" Line 425: "Moreover, 12,129 genes located in these SVs". Unclear to what "these" refers to and I assume that you mean genes that (partially) overlap with SVs? Also, this is an incomplete sentence (verb missing). Likewise, this number is not very meaningful as many of these SVs are within introns. It is much more informative to mention for how many genes SVs affect the CDS. Line 433 and table S14: This validation is not clear at all. What exactly are these numbers that are shown? You also mention "greater than 1.00" but the table does not contain any number that is greater than 1.00. Line 435: "Table" not "Tables" Line 436: Change to " SVs with a length larger than 500 bp "The term "invalidate" in figure 3D is rather awkward. Better to use "not-validated" and "validated" in this figure. Line 449: This should be Table S16. Line 452: There is not Table S18Lines 484-486: Change to "Similarly, in human, the use of the T2T-CHM13 genome assembly yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions [61]." Lines 500-501: Change to "For example, in human, the T2T-CHM13 assembly was shown to improve the analysis of global" Lines 517-528: This paragraph should be deleted as these genes have already been annotated and described in previous genome builds including 11.1. Why discuss these genes here? Following that line of thinking, almost every gene of the 20,000 can be discussed. Line 532: "%" instead of "%%" and insert "which" after "SVs" Lines 537-542: These sentences should be deleted. It is common knowledge that second generation sequencing is not very sensitive to identify SVs. The authors also do not provide any results about dPCR. Line 544: "affect" rather than "harbor" Lines 544-547: This is repetitive and has been stated multiple times so better to delete. Line 561: "which is serve to immune system's response and relevant to transplant rejection" This is an incorrect sentence and should rephrased. Lines 562-568: I don't agree with is statement and suggest to remove it from the discussion.
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research. The authors describe the telomere-to-telomere assembly of a Jinhua breed pig. They sequenced genomic DNA from whole blood with PacBio HiFi and Oxford Nanopore (ONT) long-read technologies as well as Illumina for short reads. They generated HiC data for scaffolding from blood and extracted RNA from 19 tissues for short read RNAseq for gene annotation. A hifiasm assembly was generated with the HiFi data and scaffolded with HiC to chromosome level with 63 gaps. The scaffolded assembly was gap filled with contigs from a NextDenovo assembly of the ONT data bringing the gaps down to 14. Finally, the assembly was manually curated with juicebox somehow closing a further 8 gaps. This needs to be clarified. Standard assembly assessments were performed as well as genome annotation. The authors compared their assembly to the current reference, Sscrofa11.1, and called SVs between the assemblies. The SVs were validated with additional Jinhua and Duroc animals. They then identified signatures of selection present in some of the largest SVs.
  
  General comments: The manuscript is mostly easy to read but would benefit from further editing for language throughout. The described assembly appears to be high quality and quite contiguous. Although the authors do mention obtaining parental samples and claim the assembly is fully phased, there is no mention of how this was done. There are many additional places where the methods could be described more fully including the addition of parameters used.
  
  Specific comments: Line 39 - Figure 1 only displays 34 telomeres, not 35. Additionally, I was only able to detect 33 telomeres using seqtk telo. Seqtk only reports telomeres at the beginning and end of sequences, digging further, the telomere on chr2 is ~59kb from the end of the chromosome, perhaps indicating a misassembly. Lines 79-81 - there are not hundreds of species with gap free genome assemblies and reference 19 does not claim that there are. Line 82 - the assembly is not gap-free, replace with "nearly gap-free" Line 95 - were these parental tissue samples ever used? Lines 151-156 - this section would be better located below the assembly methods. Please number supplementary tables in order of their appearance in the text. Line 171 - please provide parameters used here and for all analyses. Lines 187-188 - how did rearranging contigs decrease the gaps? Was the same gap filling procedure used after HiC manual adjustments? Line 188 - Figure S3 - I don't understand the relationship between the panels nor what the authors are attempting to show. If panels A-C display chromosomes 2, 8, and 13, Why does D display chr3? Both panels C and E are labeled chr13 but they look nothing alike. Are D-E whole chromosomes or zoomed in views? Missing description of panel F. Lines 222-224 - why weren't pig proteins used? Ensembl rapid release has annotated protein datasets for 9 pig assemblies. Line 264 - although most will know this, make it clear that Sscrofa11.1 is an assembly of a Duroc pig. Line 292 - how was polishing performed? This is missing from the methods. Line 294 - should this read "selected it for the backbone of the genome assembly."? Lines 298-299 - methods? Line 314 - what is meant by "using mapped K-mers from trio Illumina PCR-free reads data"? Line 331 - accession numbers for assemblies would be useful. Line 333 - what is "properly mapped rate"? Do you mean properly paired mapping rate? Line 346 - what is the historical genome version? Line 349 - Supplemental Table S8 only has 55 entries including the 6 remaining gaps. Where are the other filled 8 gaps located? Lines 350-358 - read depth displays wouldn't show the presence of clipped reads which would indicate an improperly closed gap. It would be more convincing to display IGV windows containing these alignments showing that there are no clipped reads. Line 354 - Figure S5 needs a better legend. What is ref and what is own? Line 359 - the assembly is near-gapless. Line 359 - where is the data regarding assembly phasing? How was this determined to be fully phased? Line 363 - 16 of 20 chromosomes are gapless. Line 370 - only 33 telomeres were found at the expected location (end of the chromosome), if you count the telomere on chr2 59kb from the end, then 34 telomeres were identified. Line 372 - chr13 also only has a single telomere. It does not have a telomere at the beginning. Line 372 - chr19 is chrX correct? Line 374 - Figure 1G - It would be nice to have the centromeres marked on this plot (or in Figure 3A). Are the long blocks of telomeric repeats internal to the chromosomes expected? Line 423 - Figure 3A - there is no telomeric repeat at the beginning of chr4 or chrXLine 431 - why were only 5 pigs of each breed used to validate SVs when 100's of WGS datasets from the two breeds had been aligned? How were these 5 selected? Line 481 - Sscrofa11.1 only has 544 gaps.Line 492 - ONT data was used to fill more than 6 gaps. Gaps in the assembly were reduced from 63 to 14 using ONT contigs. Lines 588-589 - please make your code publicly available through zenodo, github, figshare, or something similar. Line 815-824 - Figure 2 - legend description needs to be improved. Only A is mapping rates, B and C are PM rates and base error rates. The color switch from A-C having European pigs in blue to D having JH-T2T in blue might confuse readers.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.13.617951v1
www.biorxiv.org www.biorxiv.org

External validation of machine learning models - registered models and adaptive sample splitting

2
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revised 1 version
  
  Reviewer 1: Qingyu Zhao
  
  Thank for the authors for the thorough response. The only remaining comment is that some new supplement figures (figures 8-12) are not cited or explained in the main text (maybe I missed it?). Please make sure to discuss these supplement figures in the main text otherwise readers wouldn't know they are there. The response reads "To provide even more insights, we now present the relationship between the internally validated scores at the time of stopping (I_{act}), the corresponding external validation scores and sample sizes, for all 4 datasets in supplementary figures 8-11. The figures show a relatively good correspondence between internally and externally validated performance estimates with all splitting strategies". What insights are given? What do you mean by relatively good correspondence between internal and external performance? All I see in those figures are some normally distributed scatter plots, so it needs better explanation.
  
  Reviewer 2: Lisa Crossman
  
  I previously reviewed this MS and all the comments I made were answered in full. I would be pleased to recommend publication. I was fully able to replicate the adaptive split results from the GitHub repo. I have only one comment which is that I received several generated warnings of "RuntimeWarning: divide by zero encountered in scalar divide", and these can also be seen in the Jupyter notebook example.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original version
  
  Reviewer 1: Qingyu Zhao
  
  The manuscript discusses an interesting approach that seeks optimal data split for the pre-registration framework. The approach adaptively optimizes the balance between predictive performance of discovery set and sample size of external validation set. The approach is showcased on 4 applications, demonstrating advantage over traditional fixed data split (e.g., 80/20). I generally enjoyed reading the manuscript. I believe pre-registration is one important tool for reproducible ML analysis and the ideology behind the proposed framework (investigating the balance between discovery power and validation power) is urgently needed. My main concerns are all around Fig. 3, which represents the core quantitative analysis but lacks many details.
  
  Fig. 3 is mostly about external validation. What about training? For each n_total, which stopping rule is activated? What is the training accuracy? What does l_act look like? What is \hat{s_total}?
  
  Results section states "the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on Figure 3)". I'm confused by this because the blue curve is often below other methods in accuracy (e.g., comparing with 90/10 split in ABIDE and HCP).
  
  Why does the half split have the lowest accuracy but the highest statistical power?
  
  How was the range of x-axis (n_total) selected? E.g., HCP has 1000 subjects, why was 240-380 chosen for analysis?
  
  The lowest n_total for BCW and IXI is approximately 50. If n_act starts from 10% of n_total, how is it possible to train (nested) cross-validation on 5 samples or so?
  
  Two other general comments are: 1. How can this be applied to retrospective data or secondary data analysis where the collection is finished? 2. Is there a guidance on the minimum sample size that is required to perform such an auto-split analysis? It is surprising that the authors think the two studies with n=35 and n=38 are good examples of training generalizable ML models. It is generally hard to believe any ML analysis can be done on such low sample sizes with thousands of rs-fMRI features. By the way, I believe n=25 in Kincses 2024 if I read it correctly.
  
  Reviewer 2: Lisa Crossman
  
  External validation of machine learning models - registered models and adaptive sample splitting Gallito et al. The Manuscript describes a methodology and algorithm aimed at better choosing a train-test validation split of data for scikit-learn models. A python package, adaptivesplit, was built as part of this MS as a tool for others to use. The package is proposed to be used together with a suggested workflow to integrate an approach invoking registered models as a full design for better prospective modelling studies. Finally, the work is evaluated on four alternative publicly available datasets of health research data and comprehensive results are presented. There is a trade-off in the split between the amount of sample data to be used for training and the amount of data to use for validation. Ideally the content of each must be balanced in order for the trained model to be representative and equally for the validation set to be representative. This manuscript is therefore very timely due to the large increase in the use of AI models and provides important information and methodology.
  
  This reviewer does not have the specific expertise to provide detailed comments on the statistical rule methods.
  
  Main Suggested Revision: 1. The Python implementation of the "adaptivesplit" package is described as available on GitHub (Gallitto et al., n.d.). One of the major points of the paper is to provide the python package "adaptivesplit", however, this package does not have a clear hyperlink, and is not found by simple google searches, and it appears is not yet available. It is therefore not possible to evaluate it at present. There is a website found available with a preprint of this MS after further google searches, https://pnilab.github.io/adaptivesplit/ however, adaptive split is here shown as an interactivate jupyter-type notebook example and not as a python library code. Therefore, it is not clear how available the package is for others' use. Can the authors comment on the code availability?
  
  Minor comments: 1. Apart from the 80:20 Pareto split of train-test data, other splits are commonly used in ratios such as 75:25 (the scikit-learn default split if ratio is unspecified), and 70:30. Also the cross-validation strategy with train-test-validation split 60:20:20, yet these strategies have not been mentioned or included in the figures such as Fig 3. The splits provided in the figure and discussed are 50:50, 80:20 and 90:10 only. Could the authors discuss alternative split ratios?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.01.569626v2
www.biorxiv.org www.biorxiv.org

Spatial Integration of Multi-Omics Data using the novel Multi-Omics Imaging Integration Toolset

2
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf035), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 1 version
  
  Reviewer 1: Hua Zhang
  
  The quality of this manuscript has significantly improved in this revision. I appreciate the author's effort in thoroughly addressing all concerns and comments.
  
  Reviewer 2: Santhoshi Krishnan
  
  All my concerns have been adequately addressed by the authors and I have no further questions.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf035)), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original Submission Reviewer 1: Hua Zhang
  
  Wess et al reports a Python framework, MIIT (Multi-Omics Imaging Integration Toolset), for integrating spatially resolved multi-omics data. Multi-omics imaging represents a pivotal approach for systems molecular biology and biomarker discovery. This method introduces a timely and valuable tool to advance the field. However, in my opinion, this paper still has some issues that need to be addressed before consideration for publication. Cancer tissue exhibits significant heterogeneity effects, in this study, different molecular information obtaining from different tissue sections, this means from different cells as the tissue section is 10 um thickness, almost the diameter of the cells. Please height the meaningful of co-registration information if they are obtained from different cell layers. In particular, for the datasets of spatial transcriptomics and MSI, the experiments were conducted on serial sections with an axial sectioning distance of 40 to 100 Î¼m. This means that the mRNA and metabolites originate from different cells, raising questions about how integrating these two datasets can provide meaningful insights. The multi-omics imaging integration toolset is based on the GreedyFHist, a non-rigid registration algorithm, it suggests including more details about this algorithm and highlight the difference comparing to previously reported non-rigid image co-registration algorithm. The author should demonstrate the accuracy of background segmentation, it concerns certain low signal sample area would be removed in the denoising step. What is criterion to define the background region and sample region in the background segmentation.
  
  In the Method section, more details need to be included in the spatial transcriptomics part, what the spatial resolution of the 10x Genomics was used. As the MALDI resolution is 30 um, how the pixel alignment of the ST and MSI data if their spatial resolution is different. In the MALDI-MSI of prostate tissue, on tissue MS/MS data is missing to confirm the identification of target analytes of citrate, ZnCl3-, and spermine.
  
  **Reviewer 2: Santhoshi Krishnan **
  
  Overview: In this paper, the authors present the Multi-Omics Imaging Integration Toolset, which is a python framework for integration multiple spatial omics datatypes. To facilitate this, they also development a registration method (GreedyFHist) for jointly analyzing sequential tissue layers that have undergone different types of staining/phenotyping regimens. The method validation was done on a 244 fresh-frozen prostrate tissue sections. The highly detailed methods and results section is well appreciated and helps fully contextualize the significance of the study. The definitions of study-specific terms mentioned throughout the paper at the beginning are also appreciated. Data and Code Availability: Detailed code, tutorials and associated instructions have been made available for use by the public, which is appreciated. All systems requirements have also been explicitly laid out for ease of installation and use. The workflow examples provided are quite detailed; however, a more extensive codebase with stepwise explanations within the code will be appreciated. Data has not been made available publicly, except for the raw and processed spatial transcriptomics data; however, detailed and explicit instructions have been provided on data access, keeping in mind local regulations. Revisions: Major Revisions: 1. In recent years, a lot of other platforms, both free and paid, tend to support registration across multiple slides. For example, HALO has a registration feature available as well, along with a host of other open-source datatypes. In that regard, how is your platform different? 2. It is mentioned that downscaling occurs during the registration process in order to reduce runtime - how are nuances in features selected as registration landmarks preserved in such a case? 3. How is the fixed image determined in this case? The assumption would be that a standard H&E image is selected for this purpose- is that assumption, correct? 4. The authors have stated and justified their rationale for using the mentioned evaluation metrics in the paper. However, in the general image registration space, metrics such as the dice coefficient and jaccard index are commonly used and accepted. Is there a particular reason why these were not used as well? It would offer a more complete picture for the general user if these metrics were provided as well. 5. The validation of registering distance neighboring sections is quite a valuable contribution, as the authors rightly stated that in many multi-omics experiments, this might be a necessity. However, when looking at tissue sections that are 80-100 microns apart, it is quite likely that the set of cells that one may be looking at on the x-y coordinate system may not be the same at all; in fact, for a highly heterogeneous/flexible piece of tissue, they might be completely different. In such a circumstance, how much value is there in registering these two sections together instead of, say, separately analyzing them and using alternative methods to combine the results downstream? 6. In the proof of concept presented in the paper, the authors mention using ST and MSI data for validating their framework. Have they also investigated ST integration with more commonly available datatypes such as IHC/mIF? 7. The work that the authors have put in to validate the registration and MIIT framework using different approaches (selecting spatially distant slides, integration using augmented/artificial data) is thorough. However, different tissue types bring in their own challenges, and thus validation of this framework on an external dataset would lend more credence to this much needed framework, especially in the era of increased multiomics analyses.
  
  Minor Revisions: 1. Please ensure all typos/grammatical mistakes are corrected. 2. In the 'preprocessing of stained histology images', can more details be given on the thresholding process? It is also stated that the threshold is manually adjusted for each image if necessary - how is this determination done? 3. The headings/subheadings organizations within sections can be done in a more organized manner, in some parts it was challenging to determine the organization of sections/subsections. 4. Can some more details be given on the landmarks that were identified per image? Could some examples be provided on what these landmarks are, and how they remain consistence across tissue layers? 5. Currently, the way various samples are used for validating the GreedyFHist and MIIT frameworks are listed out in the paper is quite confusing. It would be appreciated if the authors can distinctly mention the number of samples out of the set of samples, and the associated stained slides are used for each. 6. How were the annotations from the 3 annotators cross validated?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.11.598306v1
May 2025
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly of the lemon sole Microstomus kitt (Pleuronectiformes: Pleuronectidae)

2
1. GigaScience 30 May 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This Data Release paper presents the first genome assembly of the lemon sole (Microstomus kitt), a commercially important flatfish found in European coastal waters. It is also interesting that this work was carried out in a University course setting involving the students. The resulting chromosome-level genome was assembled using long-read PacBio HiFi sequencing and the Hi-C technique. The 628 Mbp reference (which is consistent with other Pleuronectidae fish species) is assembled into 24 chromosome-length scaffolds with high completeness, achieving a scaffold N50 of 27.2 Mbp. Peer review and data curation made the author clarify a few points and share all of the data and results in an open and well curated manner. The annotated genome of the lemon sole, with its high continuity, should therefore provide important reference data for future population genetic analyses and conservation strategies of this organism.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 30 May 2025
  
  in GigaByte
  
  AbstractBackground The lemon sole (Microstomus kitt) is a culinary fish from the family of righteye flounders (Pleuronectidae) inhabiting sandy and shallow offshore grounds of the North Sea, the western Baltic Sea, the English Channel, the shallow waters of Great Britain and Ireland as well as the Bay of Biscay and the coastal waters of Norway.Findings Here, we present the chromosome-level genome assembly of the lemon sole. We applied PacBio HiFi sequencing on the PacBio Revio system to generate a highly complete and contiguous reference genome. The resulting assembly has a contig N50 of 17.2 Mbp and a scaffold N50 of 27.2 Mbp. The total assembly length is 628 Mbp, of which 616 Mbp were scaffolded into 24 chromosome-length scaffolds. The identification of 99.7% complete BUSCO genes indicates a high assembly completeness.Conclusions The chromosome-level genome assembly of the lemon sole provides a high-quality reference genome for future population genomic analyses of a commercially valuable edible fish.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.156), and has published the reviews under the same license.
  
  Reviewer 1. Alejandro Mechaly
  
  Are all data available and do they match the descriptions in the paper? No. The BioProject number is not included in the submitted manuscript.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? No. The BioProject number is not included in the submitted manuscript.
  
  Comments: The paper presents a valuable contribution to the genomics of Microstomus kitt (lemon sole), a commercially important species. The study introduces a chromosome-level genome assembly using PacBio HiFi sequencing, resulting in a highly contiguous assembly with 99.7% completeness in BUSCO genes. This high-quality genome will serve as a key resource for future population genomics and aquaculture studies. Overall, this assembly offers a solid foundation for advancing research on the biology and management of lemon sole. The main critique of this study is that, while it highlights the sexual dimorphism in lemon sole, where females are larger than males, it does not delve into this aspect in detail. Although the research presents valuable data through a high-quality chromosomal-level genome assembly, it focuses exclusively on male specimens. Comparing the genomes of both sexes would be highly insightful, potentially revealing the genetic mechanisms or pathways underlying this dimorphism through comparative genomics. Recent studies on flatfish (Villarreal et al., 2024. https://doi.org/10.1186/s12864-024-10081-z) have used comparative genomics to examine sex determination genes, and applying this approach to lemon sole would significantly enhance the study’s impact. Furthermore, there are numerous sequenced flatfish genomes that should be analyzed alongside these results to provide a more comprehensive context.
  
  Re-review: Thank you for addressing my comments. While I understand the study's limitations, including its focus as part of a university course and the use of a single specimen, I believe the manuscript lacks sufficient impact without exploring the genetic basis of sexual dimorphism or incorporating comparative analyses with other flatfish genomes. The genome assembly and annotation are well-executed, but the absence of biological context limits the broader relevance of the work. Sexual dimorphism in lemon sole, a commercially important species, is a key topic that could inform aquaculture and fisheries management. Without addressing this, the manuscript misses an opportunity to answer important scientific questions. For these reasons, I cannot recommend the manuscript for publication in its current form. While the technical work is solid, additional analyses or a broader scope are needed to enhance its contribution to the fieldS
  
  Reviewer 2. Yongshuang Xiao
  
  This MS presents the chromosome-level genome assembly of Microstomus kitt, a species belonging to the Pleuronectidae family and mainly distributed in the North European seas. The study utilized PacBio HiFi sequencing technology combined with Hi-C data for chromosome-level assembly, resulting in a high-quality reference genome of approximately 633 MB, including 23 chromosomal length scaffolds, completing 99.7% of BUSCO genes, demonstrating high assembly completeness and gene annotation quality. Further analysis revealed abundant repetitive sequences and gene features in the lemon sole genome, providing important resources for future genetic studies of this species and its close relatives. The paper presents several issues as follows: 1. From the evaluation of the genome, the estimated size is around 542 Mb, while the manually curated Hi-C results yielded a genome size of 633 Mb. The authors are requested to explain why there is a difference of nearly 100 Mb between the second-generation sequencing evaluation and the third-generation results. 2. Utilizing PacBio HiFi sequencing technology, which generates long reads, and its associated assembly software, the authors were able to assemble the genome at the chromosome level. The authors explicitly state that the size of the 23 chromosomal level genomes assembled using YaHS and Chromap software is around 500 Mb, which is consistent with the genome survey results. How does the author know that the assembled genome is erroneous? 3. Based on the author's description, it is not clear what the size of the assembled genome from a single chain using PacBio sequencing is. The author needs to provide this data in the results. 4. The authors performed quality assessments of the assembled genome using various methods such as Merqury. However, the description of the evaluation results is lacking. The authors are requested to include the QV evaluation values and additional results of SNP alignment for the second-generation sequencing data. 5. For gene annotation, the authors used the genomes of five species of Pleuronectidae as references. We are eager to see the results of the alignment analysis between the genome obtained using PacBio Revio and the aforementioned five fish genomes. Although these results do not need to be included in the main text, they should be provided as part of the response to the reviewers, including the alignment results and alignment rates for both sets of assembled genomes (500 Mb and 633 Mb). 6. The authors are requested to include the length information of each chromosome in the supplementary files. From the assembly results, it appears that the PacBio Revio results are not as impressive as anticipated, particularly with a Scaffold N50 of 29.4 Mbp. Is this due to limitations in the length of the chromosomes themselves, affecting the quality metrics of this genome? 7. The data should be uploaded to NCBI and obtain the corresponding registration code.
  
  Re-review: This study aims to perform chromosome-level genome assembly of the lemon sole (Microstomus kitt) and conduct a comprehensive analysis of its genome using high-throughput sequencing technology. Researchers utilized PacBio HiFi sequencing technology to carry out whole-genome sequencing of this species, resulting in a high-quality and complete genome sequence. The genome sequence has a length of 633 Mbp, with 23 chromosome-level sequences successfully assembled. Additionally, BUSCO analysis indicated that this genome sequence possesses a high level of completeness. These results suggest that the lemon sole genome sequence can serve as an important reference for future population genetic studies of commercially valuable edible fish species. However, there are certain issues with the paper that need to be addressed: The authors emphasize that female lemon soles grow larger than males, yet they chose to sequence the male genome instead of focusing on the more unique female. The authors should clarify this choice. The HI-C assisted assembly results show that male lemon soles have 23 chromosome pairs. Are there any heteromorphic chromosomes? The authors need to elucidate the karyotype of the lemon sole, as this information is significant for both the genome assembly and subsequent research. The survey results indicate a high level of heterozygosity in lemon sole. How did the authors account for this high heterozygosity to obtain a relatively complete genome? Could this affect the accuracy of the genome? Although the authors achieved high-quality genome results through PacBio sequencing, they used BUSCO for genome quality assessment. To further highlight the completeness and accuracy of the assembled genome, it is recommended that the authors utilize QV for additional evaluation. To ensure high levels of data sharing and reproducibility, the authors are requested to provide the chromosome-level genome fasta file and gff annotation file. In summary, the authors are encouraged to provide additional information and make necessary revisions.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.29.651060v1
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assemblies of five Sinocyclocheilus species

2
1. GigaScience 14 May 2025
  
  in GigaByte
  
  **Editors Assessment: ** Sinocyclocheilus are a genus of freshwater cavefish fish that are endemic to the Karst regions of Southwest China. Having diverse traits in morphology, behavior, and physiology typical of cavefish, that make them interesting models for studying cave adaptation and phylogenetic evolution. The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. Assembling S. grahami (the golden-line barbel), using PacBio and Hi-C sequencing technologies, a final chromosome-level genome assembly was 1.6 Gb in size with a contig N50 of 738.5 kb and a scaffold N50 of 30.7 Mb. With 93.1% of the assembled genome sequences and 93.8% of the predicted genes anchored onto 48 chromosomes. Subsequently the authors conducted a homologous comparison to obtain chromosome-level genome assemblies for four other Sinocyclocheilus species: S. maitianheensis, S. rhinocerous, S. anshuiensis, and S. Anophthalmus. With over 82% of the genome sequences anchored on these constructed chromosomes. Peer review provided clarification on the assembly strategy and provided more benchmarking. This data having the potential to contribute to species conservation and the exploitation of potential economic and ecological values of diverse Sinocyclocheilus members.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 May 2025
  
  in GigaByte
  
  ABSTRACTSinocyclocheilus, a genus of tetraploid fishes, is endemic to the karst regions of Southwest China. All species within this genus are classified as second-class national protected species due to their unique and fragile habitat. However, absence of high-quality genomic resources has hindered various research efforts to elucidate their phylogenetic relationships and the origin of polyploidy. To address these academic challenges, we at first constructed a high-quality genome assembly for the most abundant representative, golden-line barbel (Sinocyclocheilus grahami), by integration of PacBio long-read and Hi-C sequencing technologies. The final scaffold-level genome assembly of S. grahami is 1.6 Gb in length, with a scaffold N50 up to 30.7 Mb. A total of 42,205 protein-coding genes were annotated. Subsequently, 93.1% of the assembled genome sequences (about 1.5 Gb) and 93.8% of the total predicted genes were successfully anchored onto 48 chromosomes. Furthermore, we obtained chromosome-level genome assemblies for four other Sinocyclocheilus species (including S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous) based on homologous comparison. These genomic data we present in this study provide valuable genetic resources for in-depth investigation on cave adaptation and improvement of economic values and conservation of diverse Sinocyclocheilus fishes.
  
  Reviewer 1. Jun Wang
  
  The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. The manuscript was meaningful and provided valuable genome resources in Sinocyclocheilus genus, which will further help with the evolution and functional genomics of these species. The analysis was accurate, and the results were solid. My comments are as follows
  
  Please detail the method how you assembled four other species on homologous comparison? You just map the assembled scaffold to the reference genome?
  
  In the manuscript, the author only provide the sequencing info of S. grahami but not the other four species. What are the sequencing information of other four species, like how many reads have been sequenced with Illumina?
  
  There was no results description for figure 2 and why there are there only repeat annotation results for S. grahami and not the other four species?
  
  Reviewer 2. Fei Li and Shili Li
  
  This paper entitled “Chromosome-level genome assemblies of five Sinocyclocheilus species” reported a chromosome-level golden-line barbel genome by using combination of Pacbio and Hi-C data. Using this chromosome-level assembly as reference, the author also constructed other four psedo chromosome-level assemblies of S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous. These data are really important resource for conservation of these endangered species. However, some important results have not shown: 1. Protein BUSCO result has not been shown. 2. Raw reads were not uploaded to NCBI. 3. What’s the detailed number for functional annotation.
  
  Some minor suggestions: Add “,” before “and conservation”. What’s the meaning of “R & D”? Line 58, “a good model” should be “good models”. Line 64, remove “at first”. Line 84, change “a” to “the”. Line 90, change ‘muscle’ to “muscle tissue”. Line 105, remove ‘which was’. Line 112, remove ‘this study’. Line 122, change “Repeat annotation, gene prediction, and function prediction” to “Annotation of repeat, gene and function”. Line 137, ‘with’ should be ‘by using’. Line 127, remove ‘(TEs)’. Line 134, What’s meaning of NCBI GenBank? Remove GenBank. Line 140, ‘was’ should be ‘were’. Line 178, ‘Species’ should be ‘species’.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.27.640546v1
www.biorxiv.org www.biorxiv.org

RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

1
1. GigaScience 06 May 2025
  
  in GigaScience
  
  Leveraging the use of multiplex multi-omic networks, key insights into genetic and epigenetic mechanisms supporting biofuel production have been uncovered. Here, we introduce RWRtoolkit, a multiplex generation, exploration, and statistical package built for R and command line users. RWRtoolkit enables the efficient exploration of large and highly complex biological networks generated from custom experimental data and/or from publicly available datasets, and is species agnostic. A range of functions can be used to find topological
  
  Reviewer name: Francis Agamah Reviewer Comments: The paper introduces a species agnostic random walk with restart toolkit built for R and command line users. The tool enables constructions of multiplex networks from any set of data layers and enables the discovery of gene-to-gene relationships. The tool offers a collection of functions for network analysis. Overall, the tool is a significant contribution to network analysis. Major Comments The manuscript's background section should provide a more comprehensive overview of the rationale behind the development of RWRtoolkit. It should clearly outline the existing RWR implementation tools, identify the gaps in these tools, and explain how RWRtoolkit addresses these limitations or offers a new approach. To demonstrate the effectiveness of RWRtoolkit, the authors could evaluate the ranking performance against other established random walk with restart algorithms that can handle heterogeneous multiplex networks. Additionally, a detailed explanation of the scoring approach implemented in RWRtoolkit is necessary to justify its choice and potential advantages. The authors have indicated in the section "network layer and multiplex statistics" that the tau parameter affects the probability of the walker visiting each specific layer. To address potential bias issues in the network exploration, it would be beneficial to provide an exploration of the parameter space and indicate how it informs the stability of the RWR output scores under variations of the various algorithm parameters.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.17.603975v1
www.biorxiv.org www.biorxiv.org

Defining the limits of plant chemical space: challenges and estimations

2
1. GigaScience 06 May 2025
  
  in GigaScience
  
  model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering—to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, conservatively estimating that the total plant chemical space likely spans millions, if not more, with the vast majority still unexplored.
  
  Reviewer name: Kohulan Rajan Reviewer Comments: Review: Defining the limits of plant chemical space: challenges and estimations This work presents an important contribution to understanding the chemical diversity of plants through a systematic analysis combining metabolomics data and literature mining. The authors address a question in the field and employs multiple complementary approaches to estimate the size of the plant chemical space. Here are my few suggestions and question to the authors to clarify, 1. When introducing an abbreviation one could use caption letters "Natural Products (NP)" 2. There is no list of abbreviations in the document, so introduce them first and then use them. There may be some readers who are unfamiliar with the terms COCONUT and LOTUS. 3. Is there any prior work using similar combined metabolomics/literature approaches to estimate plant chemical space? If so, these should be cited. If not, please state this explicitly to highlight the novelty of your method. 4. Cite SMILES 5. While the paper describes the use of 'literature datasets,' it appears that only existing databases (COCONUT and LOTUS) are being utilized. It would be helpful if authors could clarify whether any direct literature mining was conducted. If not, consider revising terminology to more accurately reflect the use of curated databases rather than primary literature sources. 6. Great to see the data and code openly shared on both Zenodo and GitHub. I also find the GitHub repository very useful with regard to all the provided notebooks. To maximize reusability, please consider adding a detailed "How to Use" section to the README that guides others in replicating or building upon this work. 7. The different clustering thresholds (0.7 vs 0.8) lead to notably different estimates. Could you discuss which threshold might be more appropriate for this specific application to plant metabolomics data?
2. GigaScience 06 May 2025
  
  in GigaScience
  
  The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species have been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches—a formula prediction
  
  Reviewer name: Carlos RodrÃ-guez-LÃ³pez Reviewer Comments In the reviewed manuscript, Chloe Engler Hart et al. utilize different approaches to estimate the size of plant chemical space through analysis of publicly available datasets of mass spectrometry-based metabolomics. The authors tackle this issue by using data from ca. 2,000 LC-MS runs, and different formula predictors and structure annotation algorithms, and extrapolate to the estimated number of plant species. While the approach is useful at estimating structural variation, and the collected data and here-published source code can certainly be of use to the plant metabolomics community, I consider the manuscript requires modifications before it can be recommended for publication. Particularly, the language of the article should more accurately reflect the nature of this estimate; for example, mentions of the approach being "the most accurate estimate possible" (p.8, section 3.2) are not supported, and throughout the article, mentions of the calculation as a "conservative estimate" are not consistent with the approaches used, beyond formula prediction. E.g. it is mentioned that the MS2 curve being lower than formula prediction suggests that the curves may be conservative without further clarification on why this might be the case and not, e.g., a product of estimates dispersion. The authors mention that since they identify most limitations (in table 2, p. 13) are underestimations (again, with limited or no explanation) their estimate is conservative. Since no effect size can be calculated on these limitations, this statement is not true; e.g. if the approach is missing half of molecules due to extraction, and another half due to tissue coverage (total, Â¼), but overestimating the plateau of plant chemical diversity by 100-fold, even if more factors underestimate the chemical space, the effect size of the latter would be dominant by far. I recommend the authors to change mentions of this estimate being a conservative approach, and instead clearly mention that this is a fragmentation-based estimate, or a similar term that better reflects the nature of the figure. Similarly, assumptions on the models should be explicitly stated, along with their limitations. The authors, for example, rely on CID induced fragmentation, and they mention that the estimate "[relies] on the predominant adduct ([M+H]+)" (p.15) and thus "this likely underestimates the true chemical diversity, as other adduct forms" (p.15). It should be stated that this is an assumption: the authors do not have evidence for the adducts being [M+H]+, which is nigh impossible with the available data, they are assuming all features are [M+H]+ adducts. This carries the implicit assumption that fragmentation mechanisms will be the same for all MS2 spectra and thus structural diversity can be estimated through MS2 clusters. It is unclear how this would yield an underestimation, as the authors claim, but rather yields an overestimation, as fragmentation of [M+H]+ and e.g. [M+Na]+ adducts of the same molecule would yield different fragmentation patterns, given the former favors charge migration dependent mechanisms compared to the latter. Thus, since the authors consider all features to be [M+H]+, two adducts of the same molecule might be considered as different moieties, given that fragmentation patterns will differ, even if no difference exists. On the same vein, since similarity thresholds of the MS2Mol algorithm are essential for the estimation of diversity, the authors should clearly state how are they calculated in text, not by reference, along with potential limitations. Finally, I believe the work would greatly benefit from including data on phylogenetics of the samples, adding diversity estimates to their sample and extrapolation data. If, for example, most of the 400,000 plant species are phylogenetically distant from the sampled species, then the reader can reasonably assume that this might be an underestimation of chemical diversity when presented with the evidence. If, on the other hand, the original sample has more diversity than the total number of plant species, this might not be the case. In any case, all of the relevant assumptions should be clearly stated. Minor note: One of the main arguments for extrapolating the diversity estimate into the rest of the plants comes from Figure 3D, where increasing MS1 adducts increases with number of samples; it would greatly help explaining the difference seen between species if the authors clarify the tissues sampled per species. E.g. if the species that only doubles the number of features contains only aerial and vegetative tissue, compared to the species that increases 6fold which might include root or reproductive tissue, etc. This might also help the authors in justifying the extrapolation of the estimate.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.08.631938v1
Apr 2025
www.biorxiv.org www.biorxiv.org

Efficiently Constructing Complete Genomes with CycloneSEQ to Fill Gaps in Bacterial Draft Assemblies

3
1. GigaScience 28 Apr 2025
  
  in GigaByte
  
  Editors Assessment:
  
  With the recent official launch of BGI’s new CycloneSEQ sequencing platform that delivers long-reads using novel nanpores, this paper presents benchmarking data and validation studies comparing short, long-rea data from other platforms and hybrid assemblies. This study tests the performance of the new platform in sequencing diverse microbial genomes, presenting raw and processed data to enable others to scrutinise and verify the work. Being openly peer-reviewed, and having scripts and protocols also shared for the first time helps provide transparency in this benchmarking process to increase trust in this new technology. On top of benchmarking typed strains, the technology also was tested with complex microbial communities. Yielding complete metagenome-assembled genomes (MAGs) which were not achieved by short- or long-read assemblies alone. By directly reading DNA molecules without fragmentation, the study demonstrating CycloneSEQ delivers long-read data with impressive length and accuracy, unlocking gaps that short-read technologies alone cannot bridge. Future work is expanding to real samples, with and fine-tuning the balance between short-read and long-read data for even faster, higher-quality assemblies.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 28 Apr 2025
  
  in GigaByte
  
  Competing Interest StatementThe CycloneSEQ was developed by BGI-Research and will be marketed as an advanced technology. All the authors are employees of BGI-Research and may potentially benefit from it.
  
  See also the Ryan Wick Blog reviewing the preprint: https://rrwick.github.io/2024/12/17/cycloneseq.html
3. GigaScience 28 Apr 2025
  
  in GigaByte
  
  AbstractBackground Current microbial sequencing relies on short-read platforms like Illumina and DNBSEQ, favored for their low cost and high accuracy. However, these methods often produce fragmented draft genomes, hindering comprehensive bacterial function analysis. CycloneSEQ, a novel long-read sequencing platform developed by BGI-Research, its sequencing performance and assembly improvements has been evaluated.Findings Using CycloneSEQ long-read sequencing, the type strain produced long reads with an average length of 11.6 kbp and an average quality score of 14.4. After hybrid assembly with short reads data, the assembled genome exhibited an error rate of only 0.04 mismatches and 0.08 indels per 100 kbp compared to the reference genome. This method was validated across 9 diverse species, successfully assembling complete circular genomes. Hybrid assembly significantly enhances genome completeness by using long reads to fill gaps and accurately assemble multi-copy rRNA genes, which unable be achieved by short reads solely. Through data subsampling, we found that over 500 Mbp of short-read data combined with 100 Mbp of long-read data can result in a high-quality circular assembly. Additionally, using CycloneSEQ long reads effectively improves the assembly of circular complete genomes from mixed microbial communities.Conclusions CycloneSEQ’s read length is sufficient for circular bacterial genomes, but its base quality needs improvement. Integrating DNBSEQ short reads improved accuracy, resulting in complete and accurate assemblies. This efficient approach can be widely applied in microbial sequencing.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.154), and has published the reviews under the same license.
  
  Reviewer 1. Ryan Wick
  
  This manuscript introduces CycloneSEQ data as a means for producing complete bacterial genome assemblies, with a focus on hybrid assemblies made using a combination of CycloneSEQ data and DNBSEQ data. It also publicly provides deep CycloneSEQ+DNBSEQ read sets for a range of bacterial species.
  
  Major comments
  
  The reads for the project were made publicly available via CNGBdb (https://db.cngb.org/search/project/CNP0006129), but I found it to be unusably slow (both the HTTP website and the FTP data downloads). To ensure the data is accessible to a wide audience, I request that it also be hosted in another location to make it available to readers. For example, SRA, ENA or GigaDB.
  
  The paper makes no mention of the other major long-read platforms: Oxford Nanopore Technologies and Pacific Biosciences. Given the widespread use of these platforms (especially ONT) in bacterial genome assembly, some discussion on CycloneSEQ’s relative advantages or limitations would be beneficial.
  
  Minor comments
  
  Lines 100-103: this sentence (‘The GC content was sensitively affected…’) is not clear to me. How are the completeness and accuracy of the assembly affecting GC content?
  
  Figure S2 unnecessarily includes reference-vs-reference difference counts, which are by definition zero.
  
  Figure S2 could mention the genome (Akkermansia muciniphila ATCC BAA-835) in the caption – I did not immediately understand what 'for type strain' meant.
  
  I found Figure 5 difficult to read, with its use of colour to indicate accuracy. This data would be better shown using another visualisation (e.g. bar plot) that more clearly shows quantitative values.
  
  For the mixed microbial community analysis, it should be stated that Unicycler is exclusively designed for bacterial isolates (its documentation explicitly says to not use it on metagenomes).
  
  Some of the supplementary figures are erroneously labelled 'Supplementary Table'.
  
  Some stats on the metagenomic reads would be helpful: e.g. total bp for short and long reads, N50 for long reads, etc.
  
  The methods describe using seqtk, but the reference for this (#25) is SeqKit (a different tool), so either the tool in the methods or the reference is wrong. Re-review: Thank you for the revisions to the manuscript. While many of my minor comments have been addressed, I still have concerns regarding my major comments, which have not been fully resolved.
  
  First, I appreciate that the data has now been made available on NCBI. However, the long-read datasets are labelled as Oxford Nanopore MinION data, which is misleading (example: SRR31850034). I understand this may be because SRA does not yet provide CycloneSEQ as a platform option, but this can be clarified through additional metadata. Specifically, the ‘design’ field for each SRA entry simply says ‘genome’, but it could have more detail, including that these are CycloneSEQ reads. The BioProject (PRJNA1194773) description could also include a clear statement that the long-read data is generated using CycloneSEQ.
  
  Second, I had requested a brief discussion of existing long-read platforms (ONT and PacBio) to provide context on where CycloneSEQ fits into the broader sequencing landscape. The authors have chosen not to include this, stating that they do not have direct comparison data. While I understand that such a comparison is not the purpose of this paper, I still believe that some mention of these platforms is necessary in the Background and/or Discussion sections. This paper introduces a new long-read technology for bacterial genome assembly, and readers will naturally want to understand how it relates to widely used alternatives.
  
  Finally, regarding my comment about supplementary figure labels, I still see the issue in the revised version provided for review. For example, the caption for Supplementary Figure S3 begins with ‘Supplementary Table S3.’ The authors stated that there were no errors, but this mislabelling remains in the PDF I received.
  
  As these concerns remain unresolved, I do not consider the manuscript acceptable in its current form.
  
  Reviewer 2. Keith Robison
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  N/A - no software presented (relates to other software questions)
  
  Additional comments: This is a useful presentation of an emerging sequencing platform.
  
  Given the complex nature of nanopore signals and the difficulty of decoding them, it has been a pattern with the prior nanopore platform that improvements in basecalling software have yielded significant changes in basecalling performance. Therefore, it would be highly advantageous if the manuscript listed which specific versions / revision numbers of the basecalling software were used so that these results are properly contextualized for comparison to future results which may use newer basecalling software.
  
  Ideally, the publication would include a link to git (or similar) repository with the complete pipeline used to generate the results
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611410v1
www.biorxiv.org www.biorxiv.org

Network-based anomaly detection algorithm reveals proteins with major roles in human tissues

2
1. GigaScience 28 Apr 2025
  
  in GigaScience
  
  Background Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Dan Shao
  
  This manuscript provides an approach to highlight critical proteins within protein-protein interaction networks by Weighted Graph Anomalous Node Detection (WGAND). I see a lot of serious issues, as follows.
  
  Overall, the author submitted the article to GigaScience, so the problem he needs to solve should be the protein-disease relationship rather than anomaly detection in graphs. However, from the Abstract to the Introduction, the article always introduces the methods and applications of anomaly detection.
  
  Also, the logic of the whole article is confusing. There is a repetition of the specific method design in Methods (2.1 and 2.2). The overall program lacks method diagrams or flowcharts for explanation. In addition, the results should be in Results and not in Methods.
  
  The results do not go to the significant achievements and cannot fully reflect the superiority of the methods.
  
  Conclusion is missing from the text. 5.The use of the English language is very awkward at times.
  
  The font in some panels of some Figures (e.g., 6) is way too small.
  
  Re-review: Comments to the Authors The manuscript " Network-based anomaly detection algorithm reveals proteins with major roles in human tissues" triggered a positive initial impression, regarding abstract, introduction and figures, but going deeper, I see a lot of serious issues, as follows.
  
  Methods and Results are very hard to read at times. In many cases, where tools or parameters are used without further justification, the impression is given that various choices were tried extensively until some setup gave plausible results. In this study, the authors treated an anomaly as a node that behaves differently from most of the nodes in the network. However, the basis for this assumption requires further substantiation. The authors' research is fundamentally rooted in this premise, yet it is not adequately verified in the article. In the evaluation, the authors employed non-standard parameters to validate the effectiveness of the model. For example, they used the value of 24% associated with Mendelian disease among the top 10 proteins calculated by WGAND to compare with results obtained from other models. However, is this method of comparison credible? Results contain a lot details that I would expect to be part of Methods. Details of the model are missing in Methods. The use of the English language is very awkward at times. Minor, nice to have
  
  The font in some panels of some Figures (e.g., 2) is way too small.
  
  If a Figure consists of more than one part, e.g. A part, B part, each part should be explained separately.
  
  In the explanatory part of Figure 5, (a) (b) ... should be replaced by (A) (B) .... to maintain consistency with the figure.
2. GigaScience 28 Apr 2025
  
  in GigaScience
  
  AbstractBackground Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf034), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1. Yong Zhang
  
  This study introduces the WGAND method, an innovative weighted graph anomaly detection algorithm to identify key anomalous proteins in human tissues using machine learning techniques. Given the critical role of abnormal proteins in disease prediction and treatment, this research area is pivotal for understanding complex systems' dynamic behaviors, especially in bioinformatics. In general, this article contributes to weighted graph anomaly detection. While this study provides valuable insights and demonstrates the WGAND method's good performance and practicality, here are some suggestions and potential directions for improvement:
  
  Building on existing research, conducting a detailed performance comparison analysis between the WGAND algorithm and similar cutting-edge methods (such as OddBall, Yagada, etc.) is recommended, explicitly highlighting WGAND's advantages in anomaly detection accuracy. A series of standard metrics should be used, including but not limited to precision, recall, F1 score, and AUC curve, to quantify WGAND's effectiveness and superiority rigorously.
  
  While AUC and P@K are valuable as main evaluation metrics, introducing additional metrics such as recall, precision, and F1 score for anomaly detection tasks can provide a more comprehensive assessment of model performance.
  
  Delve into optimizing the selection of node embedding methods and edge weight estimators based on different application scenarios and explore more systematic model selection and hyperparameter optimization strategies.
  
  Investigate strategies for dynamically setting thresholds to allow the WGAND method to adapt to changes in the data environment and various task demands.
  
  Discuss the applicability of WGAND across different types of weighted graphs (such as undirected and directed graphs) and assess its generality and adaptability.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.19.572354v1
www.biorxiv.org www.biorxiv.org

Genome assembly and annotation of Acropora pulchra from Mo’orea, French Polynesia

2
1. GigaScience 13 Apr 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Acropora pulchra is a species small polyped stony corals in the family Acroporidae from the the Indo-Pacific. This Data Release is the first study in stony corals to present the DNA methylome in tandem with a high-quality genome assembled utilizing PacBio long-read HiFi sequencing. Sequencing an A. pulchra specimen from Mo’orea, French Polynesia. From this single molecule sequencing data DNA methylation data was also called and quantified, and additional short-read Illumina RNASeq data was used for gene annotation. This producing an assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp, and 40,518 protein-coding genes called. Peer review requested some improved benchmarking, and it is impressive to see from the results that the genome assembly represents the most complete and contiguous stony coral genome assembly to date. As an important indicator species and this data will hopefully serve as a resource to the coral and wider scientific community. Further quantification of the genome-wide methylation is needed aid the study epigenetics of non-model organisms, and specifically future analyses on methylation in coral.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 13 Apr 2025
  
  in GigaByte
  
  AbstractReef-building corals are integral ecosystem engineers in tropical coral reefs worldwide but are increasingly threatened by climate change and rising ocean temperatures. Consequently, there is an urgency to identify genetic, epigenetic, and environmental factors, and how they interact, for species acclimatization and adaptation. The availability of genomic resources is essential for understanding the biology of these organisms and informing future research needs for management and and conservation. The highly diverse coral genus Acropora boasts the largest number of high-quality coral genomes, but these remain limited to a few geographic regions and highly studied species. Here we present the assembly and annotation of the genome and DNA methylome of Acropora pulchra from Mo’orea, French Polynesia. The genome assembly was created from a combination of long-read PacBio HiFi data, from which DNA methylation data were also called and quantified, and additional Illumina RNASeq data for ab initio gene predictions. The work presented here resulted in the most complete Acropora genome to date, with a BUSCO completeness of 96.7% metazoan genes. The assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp. Structural and functional annotation resulted in the prediction of a total of 40,518 protein-coding genes, and 16.74% of the genome in repeats. DNA methylation in the CpG context was 14.6% and predominantly found in flanking and gene body regions (61.7%). This reference assembly of the A. pulchra genome and DNA methylome will provide the capacity for further mechanistic studies of a common coastal coral in French Polynesia of great relevance for restoration and improve our capacity for comparative genomics in Acropora and cnidarians more broadly.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.153). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Yanshuo Liang
  
  The manuscript by Conn et al. detail the high-quality genome assembly of Acropora pulchra, a Acropora of ecological and evolutionary significance, and also analyzes its genome-wide DNA methylation characteristics. These data complement the genetic resources of the Acropora genome. This manuscript is well written and represents a valuable contribution to the field. I have some comments below for the authors to address but look forward to seeing this research published. Q1: In the first sentence of the second paragraph of the Context: This is the first study to utilize PacBio long-read HiFi sequencing to generate a high quality genome with high BUSCO completeness, in tandem with its DNA methylome for scleractinian corals. Language such as "new", "first", "unprecedented", etc, should be avoided because it often leads to unproductive controversy. As far as I know, the genome you assembled is not the first stony coral to be sequenced using PacBio long-read HiFi sequencing. Back in 2024, He et al. assembled Pocillopora verrucosa (Scleractinia) to the chromosome level using PacBio HiFi long-read sequencing and Hi-C technology. Here I would suggest please rephrase. Reference： He CP, Han TY, Huang WL, et al. Deciphering omics atlases to aid stony corals in response to global change, 11 March 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4037544/v1]. Q2: In this sentence: “On 23 October 2022, sperm samples were collected from the spawning of A.pulchra and preserved in Zymo DNA/RNA shield.” Please “A.pulchra” to “A. pulchra”. Q3: Please change all “k-mer” into “k-mer” in the manuscript. Q4: Please change “Long-Tandem Repeats” to “Long Terminal Repeats” Q5: In this sentence: “Funannotate train uses Trinity [18] and PASA [19] for ab initio predictions. Funannotate predict was then run to assign gene models using AUGUSTUS [20], GeneMark [21], and Evidence Modeler [19] to estimate final gene models.” Please write versions of these software. Q6: [20] Later references do not correspond well in the manuscript, please check!
  
  Reference 2. Jason Selwyn
  
  Is the language of sufficient quality? Yes. There are some minor grammatical issues throughout that warrent a closer reading to correct. E.g. Abstract: "...urgency to identify how genetic, epigenetic, and environmental...", "...management and and conservation...". Context: "...we aim to provide..." etc. Are all data available and do they match the descriptions in the paper? Yes. The link to the OSF repository in the PDF did not work. However, the link to the OSF repository from the github did work. Is the data acquisition clear, complete and methodologically sound? No. It isn't mentioned in the manuscript where the RNAseq data used to annotate the genome is from, nor any quality filtering steps that may have been applied to the RNA data prior to its use for annotation. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. Excluding the above comment about the RNA data. Additional Comments: This is a well assembled, and annotated genome that will contribute to the growing database of Acropora genomes. The manuscript could do with a simple pass to identify and correct some relatively minor grammatical issues and inconsistencies (Table 1 includes a thousands comma separator in some instances and not others) and needs to include details about the source of the RNA data used to train the ab initio gene predictors. There also appears to be a problem with the citation numbering after 20.
  
  **Reviewer 3. Benjamin Young ** Are all data available and do they match the descriptions in the paper? Yes. Raw reads, metadata, and genome assembly are publicly available and have a NCBI project number in which they are all linked. Is the data acquisition clear, complete and methodologically sound? Yes. Collection of sperm samples, HMW DNA extraction, and SMRT Bell Library prep are written clearly. I have asked for a few clarifications on wording in this section in the attached edited pdf document. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. I think the pipeline used for de-novo genome generation (including raw read cleaning and assembly), repeat masking, and gene prediction and annotation is of high quality and best practices. With the inclusion of the GitHub and all analyses scripts, it is possible to reproduce the assembly generated. Is there sufficient data validation and statistical analyses of data quality? Yes. This is not super relevant for a genome assembly paper so I have no additional comments here. Is the validation suitable for this type of data? Yes. The authors use tools such as GenomeScope2 and BUSCO for validation of their data. It would be nice to see the tool they used to identify N50 and L50 (maybe Quast) included in the methods. Additionally, I would like to see a Merqury analysis of the HifiAsm primary and alternate assemblies to show that duplicate purging was successful. Additional Comments: I would first like to commend the authors for a well assembled genome resource for a coral species that will be greatly beneficial to the wider coral and scientific community. I have provided a PDF with comments throughout for the authors to address. The majority of these are easy fixes, including things such as sentence structure, inconsistent capitalisation of subheadings, additional references for methods, clarification of statements, and other suggestions. I do have a few larger requests for this to be published, and these are the reasons for selecting the major revision option as there may need to be figure updates, and quick additional analyses to be run. 1. Can you please correct the verbiage around BUSCO analysis throughout the manuscript. It is often stated "BUSCO completeness of xx%". BUSCO doesn't directly measure completeness, rather completeness of single copy orthologs against a specific database. I have left comments throughout on potential rewording for these instances. Please also specify the exact database you used (i.e. odb10_metazoa). Finally, can you please be more specific when stating BUSCO results, specifically when you use 96.9% this is single copy and duplicated complete BUSCOS. I have left comments in the pdf again for this. 2. In the results for Genome Assembly section can you please include results (i.e. length, N50, L50, number contigs/scaffolds) for the primary assembly and the scaffolded assembly. 3. I think it would be not much work and provide additional information to show successful duplicate purging to run a Merqury analysis on the primary and alternative assemblies from HiFiAsm. 4. Can you include some additional information in the "Structural and Functional Annotation section". Specifically, can you provide information on the results from the funannoatate predict step, and then how funannotate update improved this (if at all). 5. Please double check the methods section for funannotate. From reading the funannoatate documentation I think there may be some confusion on what each step (train, predict, update, annotate) is doing. I have provided comments in the pdf to help clarify, and have also linked the funnannotate documentation. 6. On NCBI I see that an additional Acropora pulchra genome has just been made available (29th Jan 2025), with this to the chromosome level (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_965118205.1/). I think it would be prudent to include this assemblies statistics in your Table 1, and also run a BUSCO analysis on this other assembly to compare with your one. While they got to chromosome level, you do have markedly less contigs. I do not think this is necessary for this manuscript, but future work you could look to use their chromosome assembly to get your scaffolded assembly to chromosome level. Again, I want to say this is a wonderful resource for the coral and wider scientific community, and the pipeline for de-novo assembly and annotation is best practices in my opinion. Annotated additional file: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTk0L2Nvbm5ldGFsMjAyNV9yZXZpZXdjb21tZW50cy5wZGY=
  
  Re-review:
  
  The authors have addressed all my comments and queries, and included nearly all recommendations. Thank you ! A few quick notes to fix before publication -
  
  "The input created Funannotate train uses Trinity v.2.15.2 [22] and PASA v.2.5.3 [23] for transcript assembly prior to ab initio predictions". This sentence reads weird, reword before publishing. I think maybe just remove "created Funannotate train" and then it reads correctly. Or "Funnannotate trains uses .....". - "PFAM v.37.0 [28], CAZyme [29], UniProtKB v[30] and GO [31]." Missing a few version numbers, and UniProt just has a v. - "The mitochondrial genome was successfully assembled and circularized using MitoHifi v3.2.2 The final assembled A. pulchra mitogenome is". Just missing a period i think before "The final assembly". Great job and a very useful resource for the coral community !!
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.27.645822v1
www.biorxiv.org www.biorxiv.org

Healthy microbiome - moving towards functional interpretation

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Saritha Kodikara
  
  In this study, the authors present a novel metagenomic health index designed to differentiate between healthy and unhealthy microbiomes. This area of research is crucial for developing a non-invasive, cost-effective method to assess patient health status. However, I have several suggestions that I believe will enhance the study and address some key points.
  
  Main Comments:
  
  1.) The study would benefit from additional post-analysis to provide greater depth. Although the authors applied their approach to several diseases, they did not elaborate on the significance of individual microbiome features across different diseases. For instance, the GMHI parameters were identified as least important in IBD—does this observation hold universally across all diseases analysed?
  
  2.) The index Q2D performed worse in AGP1 compared to HMP2 and AGP2. Is there a specific reason for this discrepancy? For example, does the index underperform in the heterogeneous functional landscape presented in AGP1 (Figure 2C)? An explanation for the reduced performance in this cohort would provide valuable insights into the method's performance under varying conditions.
  
  3.) It would be beneficial to make all processed data and relevant scripts available in a GitHub repository to ensure that the results presented in the paper can be replicated by other researchers.
  
  4.) When attempting to run the script available at https://github.com/Kizielins/q2-predict-dysbiosis, I encountered an error related to the scikit-learn version. The script appears to be compatible with version 1.2.2, whereas I was using version 1.4.2. Please consider updating the script or providing instructions for resolving version compatibility issues.
  
  5.) The rationale behind considering only positive correlations when calculating the index is unclear. It would be helpful to clarify why negative correlations were excluded from the index calculations.
  
  6.) In analysing longitudinal alterations, did the authors account for dependencies from previous time points Q2D index? If not, how do these longitudinal alterations differ from those observed in independent studies?
  
  7.) For each dataset analysed, additional details would be useful, such as the number of samples, species, functions, core functions, and the number of species remaining after applying the MDFS algorithm.
  
  8.) On Page 13, the authors state that they chose GMHI as their benchmark because hiPCA and Shannon entropy produced worse results for the HMP2 cohort. However, Supplementary Table 3 indicates that Shannon entropy had a lower p-value than GMHI in the Mann-Whitney U test.
  
  Minor comments:
  
  1) Page 11 Original: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 5b)." Suggested: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 4b??)."
  
  2) Page 12 Original: "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts." Suggested: "Most importantly, Q2PD produced visually the highest median?? scores for all healthy in comparison to unhealthy cohorts."
  
  3) Page 12 Original: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in HMP2" Suggested: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in AGP2??"
  
  4) Page 14 Original: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 7)." Suggested: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 8??)."
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Vanessa Marcelino
  
  The manuscript proposes a new method to distinguish between healthy and diseased human gut microbiomes. The topic is timely, as to date, there is no consensus on what constitutes a healthy microbiome. The key conceptual advance of this study is the integration of functional microbiome features to define health. Their new computational approach, q2-predict-dysbiosis (Q2PD), is open source and available on GitHub.
  
  While the manuscript is conceptually innovative and interesting for the scientific community, there are several major limitations in the current version of this study.
  
  To develop the Q2PD, they define features associated with health by comparing it with microbiome samples from IBD patients. There are many more non-healthy/dysbiotic phenotypes beyond IBD, therefore it is not accurate to use IBD as synonymous of dysbiosis as done throughout this version of the paper.
  
  The study initially tests the performance of Q2PD against other gut microbiome health indexes (GMHI and hiPCA) using the same data that was used to select the health-associated features of Q2PD. Model performance should be assessed on independent data. On a separate analysis, they do use different datasets (from GMHI and hiPCA), but these datasets seem to be incomplete - GMHI and hiPCA publications have included 10 or more disease categories, and it is unclear why only 4 categories are shown in this study.
  
  While Q2PD does provide visible improvements in differentiating some diseases from healthy phenotypes, the accuracy and sensitivity of Q2PD isn't clear. To adopt Q2PD, I would like to know what are the chances that the classification results will be correct.
  
  There is very little documentation on how to use Q2PD. What are the expect outputs for example, do we need to chose a threshold to define health? Is the method completely dependent on Humann and Metaphlan outputs, or other formats are accepted? The test data contain some samples with zero counts. I got an error when trying it with the test data (ValueError: node array from the pickle has an incompatible dtype…).
  
  Therefore, I recommend including a range of disease categories to develop Q2PD and use independent datasets to validate the model in terms of accuracy and sensitivity. Alternatively, consider focusing this contribution on IBD. Making the code more user friendly will drastically increase the adoption of Q2PD by the community.
  
  Please also use page and line numbers when submitting the next version. Other suggestions:
  
  Abstract: I recommend replacing 'attributed' with 'linked', as 'attributed' suggests that dysbiosis may be causing (rather than reflecting) disease.
  
  Results: Please indicate what it is meant by 'function' here - it will be good to clarify that this method uses Metaphlan's read-based approach to identify metabolic pathways. What is used, pathway completeness or abundance?
  
  Results regarding Figure 3a are difficult to interpret. Is 'non-negatively correlated' the same as 'positively correlated'? What does the colour gradient represent - their abundance in those groups, or the strength of their correlation?
  
  "We observed that the prevalence of the pairs positively correlated in health was higher than in a number of disease-associated groups (Figure 3b)" . This is a very generalised statement considering that only half of the comparisons were significant. How co-occurring species were selected?
  
  "To test this, we compared the contributions of MDFS-identified species to "core functions" in different groups (Supplementary Figure 4)." How was this comparison made, based on species correlations? The caption of these figures could include more detail - it just says 'Top species contributions to functions.' but how do you define 'top' ? What do the colours represent?
  
  'This finding was congruent with our earlier suspicions of functional plasticity; modulation of function and thus altered connectivity in the interaction network, shifting towards less abundant, non-core functions upon perturbation of homeostasis.' This is reasonable, but I don't understand how you can draw this conclusion from these figures where there seems to be no significant difference between health and disease.
  
  Section 'Testing q2-predict-dysbiosis, GMHI and hiPCA accuracy of prediction for healthy and IBD individuals'
  
  What is the difference between fraction of "core functions" found the fraction of "core functions" among all functions?
  
  "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts" . This was not statistically significant. In fact, GMHI finds more significant differences between health and disease than Q2PD.
  
  Sup. Figure 7 - would be informative to add the name/description of these metabolites not just their ID).
  
  'Although the threshold of 0.6 as determinant of health by the Q2PD was not applicable to the new datasets'. Does the threshold to define health with Q2PD change depending on the dataset? What are the implications of this for the applicability of this index?
  
  Effects of sequencing depth - this is a very good addition to the paper, the effects of sequencing depth can be profound but are ignored in most studies, so I commend the authors for doing this here. It would be even better, in my opinion, if this was done with the same datasets used to test/compare Q2PD with other methods, as using a different dataset here adds a new layer of confounding factors.
  
  'the GMHI and the hiPCA produced the opposite trend, wrongly indicating patient recovery.' The difference here is striking, what is driving this trend?
  
  The Gut Microbiome Wellness Index 2 (GMWI2) is now published. I don't think it needs to be part of the benchmarking, but it could be acknowledged/cited here.
  
  Methods: More information on how the data was processed is needed - how were the abundance tables normalized? Which output from Humann was used for downstream analyses?
  
  To ensure reproducibility, please provide the scripts/code used for analyses and figures.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.04.569909v6
www.biorxiv.org www.biorxiv.org

Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Sandra Correa-Garhwal
  
  The manuscript "Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider" uses comparative genomics to study the underlying mechanisms of starvation resistance. I appreciate that the authors have produced a high-quality genome for an RTA species. The methods are sound and some interesting gene families are highlighted as key factors in starvation resistance.
  
  One primary concern I have relates to the study's setup and hypothesis. As currently written, the study comes across as a fishing expedition rather than a focused research project. Although the introduction is informative, it lacks a clear rationale for including this particular species. The reasoning only becomes apparent at the end of the gene family expansion and contraction section. Additionally, I am unsure if being an active hunter makes feeding more unpredictable compared to web-based prey capture. I recommend incorporating this information into the introductory paragraph to better establish the context for the analysis. While terms like "autophagy" and "energy homeostasis" are appropriate for a scientific audience, consider briefly defining them for clarity, especially if the intended audience might not be familiar with all the terminology. Although authors mention that there is no high-quality genome sequence for H. venatoria, it could be helpful to elaborate on why this is significant for understanding starvation resistance. A brief explanation of how genomic data could enhance understanding of the molecular mechanisms involved would strengthen this point. The conclusion provides a clear goal for your study, but it could be more impactful. You might want to emphasize the broader implications of your research findings for ecological conservation and biodiversity. End with a statement about the importance of understanding these mechanisms in the context of preserving ecosystems and addressing challenges posed by climate change.
  
  For the discussion, while the content is detailed, some parts feel slightly repetitive or could be more concise. For instance, the description of P450 gene expression could be streamlined by removing redundant mentions of their role in metabolic rate regulation. Example: In the discussion section "Interestingly, we found that some P450 families are expanded in H. venatoria, and most P450 genes are more highly expressed in the fat body than in other tissues…" This point is later reiterated in the sentence about other spider species. These ideas could be combined for efficiency. The paragraph about the phylogenetic analysis of the CYP3 clan could be shortened. While it is an interesting finding, some of the details (like the number of genes or proteins) might be better suited for the main text rather than a summary. Focusing more on the functional implications of these duplications would keep the reader engaged. Though the findings are well-explained, the broader significance could be emphasized more explicitly. For example, why is understanding these mechanisms important for the field of arachnid biology, evolutionary biology, or even practical applications (e.g., pest control, conservation)? You could add a closing sentence that ties everything together and highlights the broader relevance of the findings, such as the evolutionary or ecological importance of these adaptations in spiders.
  
  Other comments: Last paragraph of the introduction: When introducing Heteropoda venatoria, please spell out the species name the first time that is used. The sentence "However, these findings indicate that H. venatoria does not feed in a stable manner and often experiences periods of starvation." Does not fit the rest of the text. Finding from what study? Transcription design for starvation resistance in H. venatoria section: First sentence: What samples? confusing to start like this. Please add information about the samples. You could delete "the samples of H. venatoria were subjected to" it will read better. Are all 23 CYP# clan genes on chromosome 4 tandemly arrayed? Figure 4 - add more information about the figure. For pannel C, What do the red lines show? Grey? Numbers in the circles? While I know what they represent, other readers might not. The finding that H. venatoria chromosomes have undergone lots of chromosomal fragmentation is very interesting, and it is clearly shown on the figure. Which is why I think that more detail is needed. In this sentence "In Uloborus diversus, members of this subfamily are located on Chr5 and an unanchored scaffold." You need to specify which members. Figure 5 - Include a description of the tissues. What is Epi? Ducts? Tail?
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Hui Xiang
  
  In this study, the authors deciphered the chromosome-level genome of a RTA spider Heteropoda venatoria with large body size and generated comprehensive comparative transcriptomes of fat body and whole body among CK and starvation status. Generally, this study added important genomic and transcriptomic data of spiders and provided some cues in understanding the molecular changes during starvation. However, the organization of the manuscript is quite problematic. 1. As to the Results section, please be concise and highlight the main results，avoiding accumulating complex results. Do not present too many statements in terms of introduction and discussion in Results. Do not raise too many hypotheses in the results. 2. As for the involvement of the Hippo signaling pathway in lipid metabolism regulation, the cited literature and mentioned genes are not related to the results of this study. As for the analysis of P450 results, the descriptions of structural analysis are quite complex and difficult to understand. The authors did not explain clearly the relationship between the expansion of P450 genes and hunger resistance in the results of this study. 3. The author's analyses of DEG enrichment results in transcriptome analysis is confusing. Firstly,I can't agree with the authors in that "During the early stage of starvation (from CK to 2 W), many genes, specifically those involved in oxidative phosphorylation and thermogenesis pathways, were up-regulated (Fig. 2E). These findings indicate that during the early starvation stage, energy metabolism in H. venatoria occurs regularly, with sufficient supply of energy." There are a batch of DEGs between 2W and CK, and a lot of pathways involved in neurodegeneration related pathways. How to explain these changes? Secondly, as to 4W to 8W, I can not understand the relationship of down-regulation of hippo signaling pathway to the authors' speculation that "H. venatoria may reduce its cellular glucose uptake and utilization to adjust to the food-scarce environment.", as this pathway involved in lipid metabolism, as the authors stated. Thirdly, from 14 W to 19 W, pathways such Lysosome and apoptosis were down-regulated instead of up-regulated. So how the authors thought autophagy became more active? 4. "We speculate that during the evolution of spider genomes, two types of repeat sequences, TcMar and LTR sequences, had a significant impact on the size of spider genomes. Interestingly, we found that in H. venatoria chromosomes, regions with a high proportion of repeats also presented an increase in GC content (Fig. 1B)" The author's conclusion that high repeat region has higher CG content is based on Fig1B alone, which is too arbitrary. They needs more solid evidence and more detailed analysis. For example, the GC content of TE region could be compared with that of whole genome, and the GC content of gene region. The significance of the relevant results should be explained. In addition, the author should make a more convincing discussion of this result based on the more literature. 5. "We gathered genomic data and annotations for one scorpion and seven chromosome-level spider genomes using the scorpion as an outgroup [35-42]"。Many spider genomes have been published at the chromosomal level. What were the principles behind the spider genomes the authors selected in this study? 6. "Transcriptome design for starvation resistance in H. venatoria" in Results should be partially moved Methods and here the authors should straightforwardly highlighted the results . 7. I can't understand the significance of Fig 2C. The authors did not explain it in the manuscript, either. 8. "The PCA results from both the fat body and whole-body transcriptomes indicated that H. venatoria transcriptome at 19 weeks of starvation was markedly distinct from that at other stages (Fig. 2A, B). Consequently, we conducted a differential analysis of the transcriptome at 19 weeks." Please clarify how the comparative transcriptomes were conducted for differential analysis. 9. The language should be polished.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.31.605936v1
Mar 2025
www.biorxiv.org www.biorxiv.org

CompactTree: A lightweight header-only C++ library for ultra-large phylogenetics

2
1. GigaScience 23 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 23 Mar 2025
  
  in GigaByte
  
  AbstractMotivation The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes.Results We introduce CompactTree, a lightweight header-only C++ library for traversing ultra-large trees that can be easily incorporated into other tools, and we show that it is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages.Availability CompactTree can be accessed at: https://github.com/niemasd/CompactTreeContact niema{at}ucsd.eduSupplementary information Supplementary data are available at Bioinformatics online.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.152). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Jeet Sukumaran
  
  Is the documentation provided clear and user friendly? Yes. Excellent documentation. A pleasure to read. Are there (ideally real world) examples demonstrating use of the software? No.
  
  Reviewer 2. Ziqi Deng
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I'm able to run all the tests and used CompactTree c++ correctly except for encounter issue installation installing Python Wrapper via pip install CompactTree.
  
  Are there (ideally real world) examples demonstrating use of the software? Yes. CompactTree has provided examples of simulated trees for testing comparing to other peer packages. In the meanwhile it mentioned its ability to load the ~22M nodes greengenes2 tree. It would be great to see the test workflow so users can verify.
  
  Additional Comments: CompactTree is aimed at a very specific task, that of loading large phylogenetic trees with millions of nodes. The result shows that it is significantly faster than the other peer tools not only in loading but also in traversing trees, with less peak memory usage. It also includes the test workflow for users to repeat the test in comparison with other peer tools.
  
  Reviewer 3. Giorgio Bianchini
  
  Is the language of sufficient quality? Yes. It is slightly confusing that the paper is written using plural pronouns ("We"), when there is a single author.
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. The statement of need is present; however, it does not clearly explain what kinds of problems the software will be able to solve, beyond generic statements about addressing scalability issues. The aims of the library should be explored in more detail: as noted by the author, this library offers great speed and efficiency, but at the cost of reduced flexibility and functionality compared to other tools. Speed and efficiency are always good things, but what does the library actually do? A very fast library that does nothing is not particularly useful. So, what specific analyses does CompactTree allow, that would be impractical using other tools? For example, they could select a case study from the literature, where the analyses were limited by the algorithm, and use their library to extend the analysis to a larger dataset. The author mentions clustering, ancestral state reconstruction, and transmission risk prediction as examples of analyses that involve tree traversals, so they could start here (although I am not convinced that the efficiency of the tree representation is the computational bottleneck in these cases). The results should also be briefly mentioned in the abstract. Furthermore, the author mentions a number of packages used to analyse trees, but these are all Python packages. Since CompactTree is presented as a C++ library, this seems odd; other tools and programming languages should be mentioned/compared. For example, “ape” and “phytools” are very popular R packages, while “Bio++” is another C++ library; a literature review (or a simple web search) may reveal other such libraries. Also, the reference given for bp (“[4]”) is incorrect.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. Everything works fine if the header is included in a single source file, but if multiple distinct files contain the #include statement, a compilation error will occur due to the multiple definitions. In a real-world application, the library would reasonably need to be included in multiple source files, so this should be fixed.
  
  Is the documentation provided clear and user friendly? Yes. The documentation "Cookbook" is very nicely organised.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No. While the author compares CompactTree to a number of Python packages, no comparison is made against tools that use other programming languages. In particular, the author states that there is no C++ library for loading and traversing phylogenetic trees; however, as I mentioned, at least Bio++ exists and appears to be reasonably well cited. Furthermore, the memory plot does not consider the baseline memory usage. This is evident in the first two datapoints (n=100 and n=1000) for each tool, which show a very small difference, despite the leaf count increasing by an order of magnitude. If the first datapoint is subtracted from all subsequent datapoints, the memory plot looks quite similar to the other plots. If you re-run the benchmarks to include other tools, I would suggest including a “control” datapoint with a very small n (or even, loading the library without opening a tree), and subtracting this from all other datapoints; this will provide an estimate of the memory actually used to load the trees.
  
  Are there (ideally real world) examples demonstrating use of the software? No. As I mentioned above, having at least one example demonstrating an analysis that is significantly improved by the use of this library would be beneficial. Discussion of the improvements should also consider usability trade-offs in a real-world scenario.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.
  
  Additional Comments: The library looks promising and is reasonably well documented, the only two things that are really missing are a real-world practical application and a comparison with other relevant alternatives (especially Bio++). A large portion of the manuscript is spent describing how the library could be improved, rather than what it can currently do. This could be summarised in just one or two sentences, thus leaving more space for describing the real-world example.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.15.603593v1
www.biorxiv.org www.biorxiv.org

Draft Genome of the Endangered Visayan Spotted Deer (Rusa alfredi), a Philippine Endemic Species

2
1. GigaScience 17 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  The Visayan spotted deer (Rusa alfredi), is a small, endangered, primarily nocturnal species of deer found in the rainforests of the Visayan Islands in the Philippines. The present study reports the first draft genome assembly for the species, addressing a critical gap in genomic data for this IUCN-redlisted cervid. Using Illumina sequencing, the resulting genome assembly spans 2.52 Gb in size with a BUSCO completeness score of 95.5% and encompasses 24,531 annotated genes. Phylogenetic analysis suggests a close evolutionary relationship between R. alfredi and Cervus species suggesting that the genus Rusa is sister to Cervus. Peer-review teased out more benchmarking results and the annotation files, demonstrating this genomic resource is useful and usable for advancing population genetics and evolutionary studies, thereby informing conservation strategies and enhancing breeding programs for the critically threatened species. Providing whole genome sequences for other native species of Rusa could further provide genomic resources for detecting hybrids, which will also help the management and monitoring of these species, especially for the reintroduction of captive populations in the wild.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Mar 2025
  
  in GigaByte
  
  ABSTRACTThe Visayan Spotted Deer (Rusa alfredi) is an endangered and endemic species in the Philippines facing significant threats from habitat loss and hunting. It is considered as the world’s most threatened deer species by the International Union for Conservation of Nature (IUCN) thus its conservation has been a top priority. Despite its status, there is a notable lack of genomic information available for R. alfredi and the genus Rusa in general. This study presents the first draft genome assembly of the Visayan Spotted Deer (VSD), Rusa alfredi, using Illumina short-read sequencing technology. The RusAlf_1.1 assembly has a 2.52 Gb total length with a contig N50 of 46 Kb and scaffold N50 size of 75 Mb. The assembly has a BUSCO complete score of 95.5%, demonstrating the genome’s completeness, and includes the annotation of 24,531 genes. Phylogenetic analysis based on single-copy orthologs reveals a close evolutionary relationship between the R. alfredi and the genus Cervus. The availability of the RusAlf_1.1 genome assembly represents a significant advancement in our understanding of the VSD. It opens opportunities for further research in population genetics and evolutionary biology, which could contribute to more effective conservation and management strategies for this endangered species. This genomic resource can help in assuring the survival of Rusa alfredi in the country.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.150). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Endre Barta
  
  Are all data available and do they match the descriptions in the paper? No. The authors provided only the assembly in Fasta and GenBank format and the contigs (scaffolds?) in GenBank format. Neither the annotation nor the raw Illumina reads are available.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. In the cases where the data is uploaded, the provided metadata is consistent.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? The exact parameters used during the processing are completely missing. For example, it is unclear how the RagTag-based correcting and scaffolding were carried out.
  
  Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise
  
  Is the validation suitable for this type of data? No. Without having the raw Illumina reads and the exact command line parameters used, it is not possible to validate the provided results.
  
  Additional Comments:
  
  Assembling the reference genomes of endangered species is a task of immense importance, with the potential to significantly advance our understanding and conservation of these species. This work provides an initial genome assembly based on Illumina short-read sequencing. The correction and scaffolding of the contigs were made with the RagTag program using the red deer PacBio-based chromosome-level assembly. The potential benefits of this work are vast, from gaining knowledge to initiating and furthering population studies to preserve the species. According to the annotation and the BUSCO analysis, the final assembly seems especially good, considering that it is short-read based. However, there are some concerns about the methodology and the provided data. 1. The Illumina short reads and the annotation data (GFFs, VCFs) are not available. 2. The methods used are not reproducible because the descriptions of the exact parameters are missing. 3. It seems that the authors did not use the ‘-r’ parameter during the scaffolding, which resulted in inserting 100bp Ns instead of the actual size insertion based on the red deer reference genome. 4. There is no K-mer based genome size estimation. 5. The chromosome number is not known. Is there any chromosomal rearrangement between the red deer and the Visayan Spotted Deer? 6. It is not justified why the protein- and mitochondria-based trees are drawn as cladograms and not as phylograms. This way, the actual distances between the different species cannot be seen. 7. Although the short reads were mapped back to the assembly, no variation data is provided. 8. Is it necessary to include these high number (46104) short (1000>) contigs in the assembly? 9. Although the red deer assembly was used for the correction and scaffolding, the annotation was compared to the mule deer.
  
  Re-review: I thank the authors for their efforts to address the concerns raised. I broadly agree with the answers, but three further details need clarification: 1. Calculating the raw reads and the resulting genome size yields a coverage of about 62x. The authors mapped the raw reads back to the resulting reference genome sequence, which gave 47x coverage. However, both Genomescope and Merqury K-mer analysis showed 22x coverage. What is the reason for this discrepancy? 2. The K-mer analysis does indeed, and a bit strangely, show what appears to be a haploid genome. However, the 0.302% heterozygosity measured by GenomeScope is not remarkably low. To have an accurate picture of this, it would be important to count the number of heterozygous sites based on the raw reads mapped back at 47x coverage. 3. Although we do not know the exact chromosome number, fitting the reference to the red deer reference could be interesting. It would show how many scaffolds fit more than one red deer chromosome. Of course, this could be either due to chromosome rearrangement or because the contigs' scaffolding or assembly was incorrect.
  
  Reviewer 2. Haimeng Li
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The genomic annotation file is not publicly available.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. Genomic annotation information and protein sequence information were not found in the NCBI database.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No.
  
  Additional Comments:
  
  The manuscript, 'Draft Genome of the Endangered Visayan Spotted Deer (Rusa alfredi), a Philippine Endemic Species,' contributes to the field of conservation genomics. The study presents the first draft genome assembly of the Visayan Spotted Deer, utilizing Illumina short-read sequencing technology to generate valuable genomic resources for this endangered species. Here are some questions and comments.
  
  Q1. Why was gene annotation conducted using only homology-based annotation? It is recommended that the annotation approach includes de novo, RNA-based, and homology-based methods. Combining these approaches would provide a more comprehensive gene set, particularly for species with limited genomic resources. Please revise the method section to include these additional annotation strategies. The authors have stated that due to sampling limitations, RNA-based experiments could not be conducted. RNA extraction might be performed using the tissue samples that were previously collected for genome assembly. In Lines 167-172 Q2. Before proceeding with genome assembly, it is essential to conduct a genome survey. This initial step provides crucial information about the genome's size, complexity, and composition, which is vital for planning the assembly strategy and selecting appropriate sequencing technologies and bioinformatics tools. The survey should include estimates of genome size, GC content, repetitive elements, and ploidy level. Additionally, the result could be used to assess the completeness of the assembly. Please include a section on the genome survey in the Method section. Q3. To enhance the quality and contiguity of the assembly, utilizing another species as a reference genome for scaffolding might introduce errors due to discrepancies in karyotype. It is essential to ascertain whether there is a definitive karyotype study that verifies the consistency of the karyotype between the Visayan Spotted Deer and the reference species, indicating the absence of chromosomal fission or fusion events. In Lines 236-238 This information is crucial for the reliability of the scaffolding process. Q4. Although the length of scaffold N50 is long, the high number of scaffolds and contigs suggests fragmentation. Have you addressed redundancy in the assembly? In Line 238 Q5. Have you used software like Merqury to detect assembly errors and assess the completeness of the assembly? This is useful for evaluating the quality of the genome sequence and identifying potential issues that may need to be addressed. Q6. Are the species divergent, which might explain the low number of orthologous genes? Is this an annotation issue or does it reflect true biological divergence? Further investigation into the annotation process and comparative genomic analyses may be warranted to understand the extent of divergence and the implications for the study. In Lines 313-317 Q7. Please standardize the format of numbers throughout the manuscript to maintain consistency in the number of significant figures. In Lines 224, 225, 227, 239, 245
  
  Re-review: Q1：Why is the estimated genome size from the genome survey much smaller than the assembled genome size? Q2:In the method section, I did not see a description of the de novo method for gene structure annotation. Q3:I am concerned about using a reference genome with unclear karyotype relationships for scaffolding. Q4:Are there other published comparative genomic studies on deer that have identified such a small number of homologous genes?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.05.636739v1
www.biorxiv.org www.biorxiv.org

The assembly and annotation of two teinturier grapevine varieties, Dakapo and Rubired

2
1. GigaScience 17 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Teinturier grapes produce berries with pigmented skin and flesh, and are used in red wine blends, as they provide a deeper colour. This paper presents the genomes of two popular teinturier varieties (Dakapo and Rubired); sequenced, assembled, and annotated to provide additional resources for their use in breeding. Combining Nanopore and Illumina sequencing for Dakapo, scaffolding to the existing grapevine assembly to generate a final assembly of 508.5 Mbp and 36,940 gene annotations. For Rubired PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long and 56,681 genes annotated. Peer review has helped validate their high quality, these genomes hopefully enabling more insight into the genetics of grapevine berry colour and their other traits like frost and mildew-resistance.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Mar 2025
  
  in GigaByte
  
  ABSTRACTBackground Teinturier grapevine varieties were first described in the 16th century and have persisted due to their deep pigmentation. Unlike most other grapevine varieties, teinturier varieties produce berries with pigmented flesh due to anthocyanin production within the flesh. As a result, teinturier varieties are of interest not only for their ability to enhance the pigmentation of wine blends but also for their health benefits. Here, we assembled and annotated the Dakapo and Rubired genomes, two teinturier varieties.Findings For Dakapo, we used a combination of Nanopore sequencing, Illumina sequencing, and scaffolding to the existing grapevine genome assembly to generate a final assembly of 508.5 Mbp with an N50 scaffold length of 25.6 Mbp and a BUSCO score of 98.0%. A combination approach of de novo annotation and lifting over annotations from the existing grapevine reference genome resulted in the annotation of 36,940 genes in the Dakapo assembly. For Rubired, PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long. The diploid genome has an N50 scaffold length of 24.9 Mbp and a BUSCO score of 98.7%, and both haplotype-specific genomes are of similar quality. De novo annotation of the diploid Rubired genome yielded annotations for 56,681 genes.Conclusions The Dakapo and Rubired genome assemblies and annotations will provide genetic resources for future investigations into berry flesh pigmentation and other traits of interest in grapevine.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.149). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Camille Rustenholz
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Overall, the authors give enough details except for the haplotypes of Chardonnay, Pinot noir, Cabernet sauvignon and Cabernet franc that were used for Figure 3.
  
  Is the validation suitable for this type of data? No. Overall, the authors provide accurate validation for this type of data except for the inversion that was identified on chromosome 10 of Dakapo assembly. In my opinion, more evidences need to be provided as Dakapo contigs were anchored using PN40024 12X.v2 assembly version. There is indeed a heterozygous region at the beginning of chromosome 10 in PN40024 genome which makes its assembly and scaffolding quality quite doubtful at that exact location and especially for this assembly version. I would suggest to check it using the latest PN40024 T2T version (Shi et al., Hort Res 2023) and to show some Dakapo short read alignments against its own assembly to validate the borders of this inversion, even though some wet lab validation would be even more convincing.
  
  Additional Comments: The authors provided the assemblies and gene annotations of the genomes of two teinturier varieties, Dakapo and Rubired. Dakapo was assembled using a combination of Nanopore and Illumina reads whereas Rubired was assembled using PacBio HiFi reads. Even though both assemblies are of high quality, quality metrics are better for Rubired assembly than for Dakapo assembly, in terms of contiguity and of phasing. I would have liked the authors to comment and explain these differences more extensively maybe in a dedicated paragraph in the Discussion section: - Why Dakapo assembly could not be phased? - Are these differences in terms of quality due to the sequencing technologies (Nanopore versus PacBio HiFi) used? Or to different year of dataset acquisition? Or to assembly methods? Both assemblies were also annotated: 36,940 genes in the Dakapo assembly and 56,681 genes in the diploid Rubired. I assume that 56,681 is the sum of the number of genes annotated on haplotype 1 and haplotype 2 of Rubired. If so, it needs to be clearly stated line 328 otherwise it can be confusing for the reader who will think that Rubired has 20,000 more genes than Dakapo. Also, the authors used two different annotation pipelines, which complicates the gene content comparison and the synteny analysis later on. I would have liked the authors to comment and explain it: - Is it due to the difference in the quality of the assemblies? If so, the authors need to highlight the limits of their annotation pipeline regarding assembly quality. - Any other explanation? Some minor suggestions : - Line 74: please use the word “clone” in the sentence for a matter of clarity. - Line 292-293: PN40024.v4 assembly is not the most recent but the PN40024 T2T is (Shi et al., Hort Res, 2023) The quality of the assemblies and annotations are very good and the resources of the paper will be very valuable for the grapevine community, especially to study the anthocyanin production in grapevine.
  
  Reviewer 2. Andrea Gschwend
  
  Are all data available and do they match the descriptions in the paper? No. The supplementary files were not made available to me for review.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  I recommend including additional details for the programs used for the Rubired genome assembly and annotation in this manuscript, though.
  
  Is there sufficient data validation and statistical analyses of data quality? No. It is unclear from the manuscript if the large Dakapo inversion was validated experimentally. See additional comments from the uploaded word document https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTQ1L1JpdHRlcl9ldF9hbC5fMjAyNF9HaWdhYnl0ZV9yZXZpZXdlcl9jb21tZW50c184LTIzLTI0LmRvY3g=
  
  Reviewer 3. Yongfeng Zhou and Kekun Zhang
  
  Are all data available and do they match the descriptions in the paper? No. Is there sufficient data validation and statistical analyses of data quality? No. Is there sufficient information for others to reuse this dataset or integrate it with other data? No. Additional Comments: My main concerns: 1. Please explain why different sequencing methods were chosen for the genome assembly of Dakapo and Rubired, given that HiFi sequencing is currently mainstream and provides more accurate assembly? 2. Recently, the T2T level genome of many grape cultivars has been assembled including the reference genome PN_T2T and the teinturier grape Yan73, Please align with the latest complete reference genome PN_T2T in Line 172, and add the genome information about PN_T2T and Yan73 in Table 1. ( DOI10.1093/hr/uhad061, DOI10.1093/hr/uhad205 ) 3. Line 387-389: How did you verify the correctness of this inversion? Is it contained within a single contig without orientation or assembly errors in the Dakapo genome? Have you identified any other genomes with this inversion? 4. Line 255: can you explain why is the contig N50 so low? 5. Line 328: whether the total number of annotated genes in the two Rubired haplotypes are all 56,681? it would be more appropriate to describe them separately. 6. The phenotypes of these two grapes should be included, not just in the pattern diagram. 7. The sequence difference in Figure 2 should be verified using other methods, such as PCR results and Sanger sequencing.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.03.592245v2
www.biorxiv.org www.biorxiv.org

SqueezeCall: Nanopore basecalling using a Squeezeformer network

2
1. GigaScience 15 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  The accuracy of basecalling of nanopore sequencing still needs to be improved. With recent advances in deep learning this paper introduces SqueezeCall, a novel end-to-end tool for accurate basecalling. This uses Squeezeformer-achitecture which integrates local context extraction through convolutional layers and long-range dependency modeling via global context acquisition. Testing and peer review demonstrated that SqueezeCall outperformed traditional RNN and Transformer-based basecallers across multiple datasets, indicating its potential to refine genomic assembly and facilitate direct detection of modified bases in future genomic analytics. Future work is ongoing that will focus on training on highly curated datasets, including known modifications, to further increase research value. SqueezeCall is MIT licensed and available from GitHub here: https://github.com/labcbb/SqueezeCall
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 15 Mar 2025
  
  in GigaByte
  
  ABSTRACTNanopore sequencing, a novel third-generation sequencing technique, offers significant advantages over other sequencing approaches, owing especially to its capabilities for direct RNA sequencing, real-time analysis, and long-read length. During nanopore sequencing, the sequencer measures changes in electrical current that occur as each nucleotide passes through the nanopores. A basecaller identifies the base sequences according to the raw current measurements. However, due to variations in DNA and RNA molecules, noise from the sequencing process, and limitations in existing methodology, accurate basecalling remains a challenge. In this paper, we introduce SqueezeCall, a novel approach that uses an end-to-end Squeezeformer-based model for accurate nanopore basecalling. In SqueezeCall, convolution layers are used to down sample raw signals and to model local dependencies. A Squeezeformer network is employed to capture the global context. Finally, a connectionist temporal classification (CTC) decoder generates the DNA sequence by a beam search algorithm. Inspired by the Wav2vec2.0 model, we masked a proportion of the time steps of the convolution outputs before feeding them to the Squeezeformer network and replaced them with a trained feature vector shared between all masked time steps. Experimental results demonstrate that this method enhances our model’s ability to resist noise and allows for improved basecalling accuracy. We trained SqueezeCall using a combination of three types of loss: CTC-CRF loss, intermediate CTC-CRF loss, and KL loss. Ablation experiments show that all three types of loss contribute to basecalling accuracy. Experiments on multiple species further demonstrate the potential of the Squeezeformer-based model to improve basecalling accuracy and its superiority over a recurrent neural network (RNN)-based model and Transformer-based models.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.148). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Tao Jiang
  
  In this study, Zhongxu ZHU presents a novel approach combining the Squeezeformer architecture with masking techniques for nanopore basecalling, demonstrating meaningful improvements over existing methods. However, several concerns need to be addressed before publication. 1. The rationale behind the chosen hyperparameter values (e.g., mask_time_prob = 0.05 and mask_time_length = 5) is unclear. Did the authors experiment with other hyperparameter settings? If so, please provide results or justification for selecting these specific values. 2. The signal preprocessing methodology would benefit from a more detailed explanation. Specifically, the current description should clarify whether standard signal normalization techniques were applied to the raw current signals and detail any FFT preprocessing steps. Since nanopore sequencing signals can vary significantly between different species and experimental runs, explaining how SqueezeCall handles these variations would help other researchers implement and potentially improve upon this work. The author could consider including a flowchart or detailed pseudocode of the preprocessing pipeline. 3. A more detailed analysis of the model's error handling would strengthen the paper. Specifically, how effectively does SqueezeCall address key challenges in nanopore sequencing, such as homopolymer errors? 4. The manuscript requires attention to detail in presentation,such as: I) In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which should be corrected for consistency. II) On page 12, line 19, there is an unnecessary "e.g." before "SqueezeCall," which should be removed. 5. Instances of "Error! Reference source not found" are present in the manuscript. Please resolve these citation errors to ensure clarity and credibility.
  
  Re-review: The revised manuscript addresses most of my concerns; however, I have a few additional suggestions before recommending it for publication: 1) The newly added experimental Mask module presents only the results. Charts should be included to provide a more intuitive and visual representation of these results. 2) The images included in the Response should also be incorporated into the main text or published as supplementary materials alongside the manuscript. 3) The formulas in the manuscript are missing corresponding numbers. It is recommended to add numbers to each formula for clarity and ease of reference.
  
  Reviewer 2. Ximei Luo
  
  This manuscript describes a tool called SqueezeCall, designed for accurate nanopore basecalling. The authors compare SqueezeCall with four existing basecalling methods across 11 different datasets and report that it outperforms them in terms of basecalling accuracy. However, the study has several shortcomings and requires further clarification. Below are my comments. 1) The current discussion and conclusion section lacks sufficient analysis of the scientific and practical value of the proposed algorithm for nanopore sequencing. To strengthen the manuscript, consider expanding the conclusion section to provide a detailed discussion on the practical applications of the tool in real-world nanopore sequencing workflows. Additionally, include potential directions for further improvement of the algorithm to inspire future research and development in this area. 2) The figures in the manuscript are blurry and should be improved for clarity. Additionally, the layout requires better structuring and alignment, ensuring that the borders are neat and consistent. Efforts should be made to enhance the visual appeal of the figures, and the accompanying descriptions should provide sufficient detail to enable readers to understand the content by reviewing the figures alone. 3)To enhance the showcasing of SqueezeCall's superiority, it is advisable to include one or two of the latest methods for comparison.
  
  Minor comments: 1) There are instances of missing punctuation marks in sentences throughout the article. For example, the sentence on page 3, line 9, is missing a period at the end. 2) Address the "Reference not found" issues that appear in several places in the manuscript. 3) Number all formulas in the manuscript for easier reference and citation. 4) Verify that all references are complete and formatted according to the target journal's guidelines. 5) Some areas in Table 1 that necessitate emphasis through bold formatting are inaccurately labeled. 6) Certain content in Figure 1 and Figure 2 appears redundant; consolidation is recommended to streamline the visuals.
  
  Reviewer 3. Yongtian Wang
  
  The manuscript presents SqueezeCall, an innovative approach that combines Squeezeformer architecture with masking techniques for nanopore basecalling. The work demonstrates promising accuracy improvements through comprehensive evaluation across multiple datasets, including human, lambda phage, and nine bacterial datasets. The architecture thoughtfully integrates convolution layers for signal downsampling, employs a Squeezeformer network for capturing global context, and introduces a novel masking technique inspired by Wav2vec2.0. While the research direction and initial results are valuable, several aspects could be strengthened to enhance the work's impact: 1. Several formatting inconsistencies in the manuscript require attention for improved clarity. In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which affects the table's readability. On page 12, line 19, the redundant "e.g." before "squeezecall" should be removed. The citation system needs review as multiple instances of "Error! Reference source not found" appear throughout. 2. The mask hyperparameter selection (mask_time_prob = 0.05 and mask_time_length = 5) requires empirical justification. Including ablation studies showing model performance with different masking probabilities (e.g., 0.01, 0.03, 0.07, 0.1) and lengths (e.g., 3, 7, 10) would provide valuable insights. This analysis could reveal whether the chosen values are optimal or if there's room for improvement. A visualization of how different masking parameters affect model performance could be particularly instructive. 3. The error analysis could be expanded to provide deeper technical insights. The author should particularly analyze the distribution of skip and stay errors in homopolymer regions (e.g., AAAAA or GGGGG) where nanopore basecalling typically struggles. 4. The manuscript would benefit from exploring modified base calling capabilities. The author could train and evaluate the model on datasets containing known DNA modifications (e.g., 5mC, 6mA). This could start with synthetic sequences containing known modifications and extend to well-characterized genomic regions. Even if full modified base calling is beyond the current scope, preliminary results or architectural considerations for future extension would be valuable.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.21.634194v1
www.biorxiv.org www.biorxiv.org

xRead: a coverage-guided approach for scalable construction of read overlapping graph

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #2: Anuradha Wickramarachchi
  
  Overall comments.
  
  Authors of the manuscript have developed an iterative overlap graph construction algorithm to support genome assembly. This is both an interesting and a demanding area of research due to very recent advancements in sequencing technologies.
  
  Although the text in the manuscript is interesting, grammar must be rechecked and revised. At some point it is difficult to keep track of the content and references to supplementary to make sense out of the content.
  
  Specific comments
  
  Page 1 Line 13: I believe the authors are talking about assembly sizes and not genome sizes. The sentences here could be a bit short to make them easy to understand.
  
  Page 2 Line 19: Theoretical time complexity O(m2n2) is bit of an overstatement due to the heuristics employed by most assemblers. For example, mash distance, minimisers and k-mer bins are there to prevent this explosion of complexity. Either acknowledge such methods or provide a range for the time complexity. I would be interesting to know the time complexities of the methods expressed in sentence starting Line 15.
  
  Page 5 Line 11: Was this performed with overlapping windows of 1gb? Otherwise, simulations may not have reads spanning across such regions.
  
  Page 5 Line 14: It seems you are simulating 9 + 4 + 4 datasets. This is unclear, please make this into bullet points or separate paragraphs and explain clearly. Include simulator information in the table itself by may be making it landscape (in supplementary).
  
  Fig 2: I believe authors should expand their analysis to more recent and popular assemblers. For example, wtdbg2 is designed for noisy reads and not specifically for more accurate R10/ HiFi reads. So please include, HiFi-asm, Flye where appropriate. Flye supports ONT out of the box and in my experience does produce good assemblies.
  
  Although, you are evaluating read overlaps, it is hard to ignore assemblers themselves just because they do not produce intermediate overlaps graphs.
  
  Page 5-9: In the benchmarks section, please include how True Positives and False Positives were labelled. Was this from simulation data?
  
  Page 11: Use of xRead has been evaluated on genome assemblies. This is a very important and it is a bit unfortunate that existing assemblers are not very flexible in terms of plugging in new intermediate steps. It might be worth exploring into creating a new assembler using the wtpoa2 cli command of wtdbg2.
  
  Page 16: What will happen if you only capture reads from a single chromosome due to longer length? I believe the objective is to gather longest reads capturing as much as possible covering the whole genome. Please comment on this.
  
  Page 19: In the Github Readme the download URL was wrong. Please correct it to the latest release
  
  Correct: https://github.com/tcKong47/xRead/releases/download/xRead-v1.0.0.1/xRead-v1.0.0.tar.gz Existing: https://github.com/tcKong47/xRead/releases/download/v1.0.0/xRead-v1.0.0.tar.gz
  
  Make command failed with make: *** No rule to make target main.h', needed bymain.o'. Stop.
  
  It seems the release does not have source code, but rather the compiled version. Please update github instructing how to compile code properly with a git clone.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #1: Antoine Limasset
  
  The manuscript describes xreads, a novel method that enables resource-efficient overlap graph computation based on new strategies to compute it quickly and with controlled memory usage. The authors introduce several quality metrics to assess the quality of the overlap graph and integrate their tool into NextDenovo, improving its resource usage.
  
  The manuscript is overall clear, although the section order can make it hard to read as concepts are defined backward. Some typos and minor phrasing issues should be corrected.
  
  Remarks:
  
  The manuscript spends a lot of time evaluating the quality of the overlap graph, which is a very commendable approach and is often overlooked. I thank the authors for this contribution. However, I have issues with the definition of ground truth overlap. Even if two reads do not come from successive parts of the genome, if they share, let's say, a very large perfect overlap, they should indeed overlap in the graph. Considering that the actual biological overlap is necessarily the best one found in the reads is a greedy strategy that could harm the final assembly. Because of this definition, I am not fully convinced by xreads' performance, which seems to employ an overall very greedy strategy.
  
  A key selling point of the abstract is the ability of xreads to work with controlled memory usage at the expense of time and external memory usage. Showing some results on this feature would be very interesting, such as a plot showing the time performance depending on memory usage, for example. Also, the amount of external memory used should be discussed.
  
  As far as I understand, the end goal of xreads is to perform efficient de novo assembly. The assembly results should be the primary results of the manuscript and not relegated to the supplementary section. The assembly benchmark should include other assemblers and not only NextDenovo. The assembly results and justification are not quite convincing since the proposed assembler is slightly more resource-efficient at the cost of degraded assembly quality. While the case studies are interesting, it is hard to avoid concluding that the overall quality is degraded compared to regular NextDenovo.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.05.23.541864v1
www.biorxiv.org www.biorxiv.org

Genomic evidence for hybridization and introgression between blue peafowl and green peafowl and selection for white plumage

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #2: Subhradip Karmakar
  
  I read with interest the manuscript " Genomic evidence for hybridization and introgression between blue peafowl and endangered green peafowl and molecular Foundation of peafowl white plumage" by Lujiang et al. . This is a well-drafted, well-executed study that investigated the effect of introgression in shaping the genomic diversity landscape of peafowl. I am glad the authors undertook this much-needed study which is so critical from an evolutionary point of view. I have few queries and clarifications needed : 1. Fig S21 : Manhattan Plot : What is the loci on Chr 4 & Chr 6 that showed above threshold? What are the consequences of IL12b and IL25 ? 2. Page 50, Line : 929 : " The genes (IGF2BP3, TGBR1, ISPD, MEOX2, GLI3 and MC4R) related to body size in blue peafowl were also found to have introgression areas from green peafowl" What is the evidence for this ? Were these genes absent before the introgression events in blue peafowl? What are the modifications of IGFBP after introgression? Is it under positive selection? If yes why 3. There is not much discussion on Fig S 22 ( Suppl) on the KEGG Pathway hits. What is the significance of ribosome biogenesis? Protein processing in ER, etc 4. The white peafowls were homozygous for the mutant (A/A), resulting in the loss of EDNRB2 transcript. What is the reason for this mutant gene's fixation in white plumage birds? 5. The images, almost all of them, appear very hazy and blurry. It may be an issue with my computer. Please recheck 6. Please elaborate on the significance of IL6 and other immune-related genes in the discussion.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #1: Huirong Mao
  
  The authors had finished very systematic and comprehensive research. They obtained a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence by combined several sequencing technologies（HiFi sequencing and Hi-C sequencing）. Based on this, they further confirmed the evidence of introgression between blue peafowl and green peafowl. In addition, it is finding the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene as the causative mutation for white feather color of blue peafowl that identifies an important gap on the genetic mechanism of the white plumage in the peafowl. Overall, The results and resources obtained from this study are valuable further comparative genomic studies in birds. The analyses are also sound and comprehensive. However, before considering acceptance, there are some questions and clarifications needed from the authors to fully substantiate the findings and their implications. i) The "Results" section of the paper contains extensive analysis and discussion, which overlaps significantly with the "Discussion" section. It is recommended to consolidate and streamline these sections. ii) The authors used 'white feather' peafowl throughout the manuscript. Actually there are scientific terms about these color abnormality, for instance, leucism or albino plumage. Please define whether your samples from leucitic or albino populations. Also please change the term 'white feather' throughout the manuscript. iii) The authors used three types of data (one-to-one orthologs datasets, four-fold degenerate sites datasets and mitochondrial sequence datasets) to study the genetic relationships between peacocks, chickens, and turkeys, and proved that the genetic distance between peacocks and chickens is closer (See Line 859-862). However, from the results section, in Figure 1C, the pattern of tree3 shows that the genetic distance between peacocks and turkeys appears to be closer, suggesting a certain contradiction between the results and the discussion sections. iv) Why were individuals with the "pied" phenotype not selected as controls for the corresponding transcriptomic study to validate the molecular mechanisms of feather formation in blue peacocks using RNA-Seq results? v) The statement in the sentence "Compared with the peafowl, the ROH length of all peafowl populations is short and the total is small (see Line 624-625)" seems to be incorrect. vi) The entire paper still needs further improvement in terms of writing norms and grammar. （eg. Line 642, "as an outgroup", Line 647 "The mitochondrial phylogenetic" etc ）
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.27.573425v1
www.biorxiv.org www.biorxiv.org

TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractBackground Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.Results We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an AUC of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.Conclusions TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Markus Müller
  
  The authors present TransHLA, a deep learning tool to predict whether a peptide is an HLA binder or not. They use the ESM2 language model to create peptide embeddings for structural and sequence features and then use transformers and CNNs for the binding prediction. The article is well-written and clear. However, the authors must better justify the choice of their model and its potential application.
  
  Major comments:
  
  1) In personalized medicine, the HLA alleles of a patient can be obtained via WES and there is no need for such a HLA agnostic binding predictor. Could you briefly outline the most important medical applications where your TransHLA predictor could be most useful?
  
  2) Could you give more information about your IEDB training set? What are the frequencies of the HLA alleles, and the number of peptides per allele? How did you perform the splits into training, validation, and test sets? Were peptides from the same allele all present in all 3 sets? How does TransHLA perform for peptides binding to alleles not present in the training set compared to peptides binding to alleles present in the training set? How does the performance depend on the number of peptides of the allele in the training set? Is the model biased to these frequent alleles?
  
  3) Peptides are processed by many steps before being presented on HLA molecules. These include cleavage in the proteasome, transport via TAP to the ER, cleavage by ERADs and finally loading on the HLA complex. Why don't you perform your study on extended peptide sequences, where you take into account several amino acids before and after the peptide termini? Like this, you could also include the other processing steps. It would be interesting to see whether this sequence extension would improve prediction.
  
  4) Could you compare your approach with a 'simpler' approach, where you calculate all biopython features (such as flexibility), ev. choose the n most informative ones by feature selection, and use a standard classifier such as logistic regression or XGBoost to predict the HLA binding. This method has the advantage that it tells you directly which features are most relevant.
  
  5) Please provide the results of the ablation study in a table in the main text, where you compare the ablated models to the base model.
  
  6) Could you briefly explain what the different terms in the TIM loss are and why they are important?
  
  7) Does the flexibility depend on the length of the peptides? Peptides longer than 10 often bulge out of the binding groove, and naively one would expect them to be less stiff than peptides of length 8 or 9.
  
  Minor:
  
  1) In Equation 10, please define ^p_k. In the text, you use T for the number of classes, in the formulae K.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer 1: Georgios Fotakis **
  
  1) General Comments In this manuscript, the authors present TransHLA, a hybrid transformer model that integrates a transformer-based language model with a deep Convolutional Neural Network (CNN) module. The transformer encoder module leverages a pre-trained large language model (Evolutionary Scale Modeling - ESM2) to extract global features using a multi-head attention mechanism. The feature extraction is further enhanced by two consecutive CNN modules, maximizing the mutual information between query features (sequences) and their label predictions (epitope/non-epitope) through a modified Transductive Information Maximization (TIM) loss function. TransHLA is designed to collectively consider all HLA sites across all alleles and is the first neoantigen prediction tool of its kind, since it does not require HLA alleles as input. The authors also present benchmark study results, showcasing the increased predictive accuracy of TransHLA and its potential as a valuable pre-screening tool.
  
  The computational method presented in this manuscript demonstrates a strong scientific foundation and shows promise for future refinement and extension, suggesting significant potential for meaningful research output. However, there are some conceptual and technical concerns that need to be addressed.
  
  2) Specific comments for revision a) Major Manuscript: i) Introduction - The authors distinguish between two categories of models: those that need only epitopes as input and those that require both epitopes and HLA alleles as inputs. However, the basis for this classification is unclear. For instance, MHCNuggets and DeepSeqPanII, cited as examples of the first category, actually require both an allele and an epitope to predict neoantigens. This is supported by the algorithms' manuals and the supplementary material provided by the authors, where they specify the need for HLA alleles to execute the commands.
  
  The authors state: "Considering that TransHLA is the first epitope prediction software that does not impose restrictions on HLA alleles" This needs clarification, as all available "pan-allele" models do not impose restrictions on HLA alleles (the models are trained on nearly all sequenced HLAs). Perhaps the authors meant that TransHLA does not require HLA alleles as input?
  
  ii) Results - The reason for conducting two separate benchmarks (case study and validation) with different HLA binding affinity predictors is unclear. For instance, it is not explained why netMHCpan/netMHCpanII were not included in the first benchmark and only used in the validation part.
  
  It would be very informative if the authors were able to include other widely used HLA binding affinity predictors in their benchmarks, such as mixMHCpred and mixMHCpred2.
  
  The authors state: "the details information of alleles used in each tool can be found in the Supplementary File" However, no information about the alleles used in this study is provided (or at least it was not made available to me at the time of reviewing this version of the manuscript).
  
  The "protein structural flexibility" should be briefly explained and properly cited (Vihinen et al., 1994, Proteins, 19(2), 141-149).
  
  iii) Conclusion and Discussion - The authors claim that TransHLA alleviates "the restrictive requirement of knowing the specific HLA alleles." However, this is not typically a restriction, as serological typing of HLA is routinely performed in clinics, and samples usually come with relevant metadata. Additionally, HLA typing can be easily performed with RNAseq and/or WES data, the same data usually required to produce the putative epitopes initially, with high accuracy (e.g., OptiType can reach 93.5% [CI95: 91.8-95.1%] accuracy for HLA class I). Therefore, this information is generally readily available for processing. While the authors effectively demonstrate the accuracy of TransHLA, they fail to clarify the context in which this computational tool could be utilized.
  
  To the best of my knowledge, in the research field of personalized medicine, neoantigen vaccines are typically produced at the patient level, taking the patients' HLA alleles into consideration. Binding affinity, by definition, can quantitatively differentiate between strong (low IC50) and weak (high IC50) binders. Thus, binding affinity predictions are a pivotal step for neoantigen prioritization. Given that the authors suggest TransHLA as an "alternative for filtering potential epitopes", how would TransHLA perform in such situations? To enhance clarity, the authors should elaborate on a scenario where TransHLA would be a superior choice compared to high-performing HLA binding affinity predictors in this context.
  
  The authors mention in the introduction that TransHLA can be used to "expedite the precise screening of peptides". Additionally, in their GitHub repository it is stated that TransHLA "can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity", which is quite accurate. They might consider incorporating this concept into their concluding remarks as well.
  
  Implementation: - Since neoantigen prediction is typically carried out using computational pipelines, it would be very helpful if the authors could provide instructions for end-users to install the software and its dependencies in isolated (contained) computational environments. To enhance clarity, I am attaching the files I used to create these environments via Conda (transhla_env.yaml), Singularity (TransHLA.def), and Docker (Dockerfile).
  
  Following the previous point, the authors should consider providing a CLI (similar to the "train.py" and "inference.py" scripts in their GitHub repository) to enhance the software's usability in computational pipelines. As an example, I am attaching the script I used to test the software (TransHLA.py).
  
  b) Minor - It would enhance the clarity (especially for readers who are not familiar with artificial intelligence) if the authors would briefly explain each technical term and then use the abbreviations. For example, "Evolutionary Scale Modeling (ESM2)" and so on.
  
  Additionally, the manuscript and its supplementary material contain several grammatical and spelling errors that need to be rectified.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.20.634002v1
Feb 2025
www.biorxiv.org www.biorxiv.org

A chromosome-level, haplotype-resolved genome assembly and annotation for the Eurasian minnow (Leuciscidae - Phoxinus phoxinus) provide evidence of haplotype diversity

2
1. GigaScience 10 Feb 2025
  
  in GigaScience
  
  In this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Alice Dennis
  
  I previously reviewed this paper previously for Peer-Community-In-Genomics and you can read these comments via the PCI-review page here: https://genomics.peercommunityin.org/articles/rec?id=333]
  
  I actually did three rounds form PCI and was more than happy with the result. I'm attaching them all here in case they didn't all make it to you.
  
  The original preprint linked to the PCI-review is here: https://doi.org/10.1101/2023.11.30.569369.
  
  I have no other concerns on the manuscript. Glad to see it published on GigaScience.
2. GigaScience 10 Feb 2025
  
  in GigaScience
  
  AbstractIn this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.
  
  After initial review in PCI-Genomics (see https://genomics.peercommunityin.org/articles/rec?id=333), a version of this preprint has now been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae116), where the paper and peer reviews are published openly under a CC-BY 4.0 license. The PCI-Genomics reviewers were consulted if they had any additional comments and these were as follows.
  
  Reviewer 1: Henrik Lantz
  
  I previously reviewed this paper previously for Peer-Community-In Genomics and you can read these comments via the PCI-review page here:
  
  https://genomics.peercommunityin.org/articles/rec?id=333
  
  The original preprint linked to the PCI-reviews is here:
  
  https://doi.org/10.1101/2023.11.30.569369.
  
  I am satisfied with the latest version of manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.30.569369v2
www.biorxiv.org www.biorxiv.org

Knowledge Graph-based Thought: a knowledge graph enhanced LLMs framework for pan-cancer question answering

2
1. GigaScience 04 Feb 2025
  
  in GigaScience
  
  Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering
  
  This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Cody Bumgardner
  
  We are just beginning to get a glimpse into the ways that large language models (LLMs) might advance biomedical informatics. The framework you have described I would consider a serious contribution to the state-of-the-art in the area of bridging LLMs and structured data. The use of LLMs for code generation and interpretation within the same request is also innovative. The application of your framework to MeSH (https://www.nlm.nih.gov/mesh/meshhome.html) and other broader linked ontologies would be very interesting. You might also consider integrating tool calling as well (which in a way you are with subgraphs), to either further reduce the demential space or accessing data that does not otherwise have a graph structure. In this case, the content of your subgraph nodes might be the result of a function call. Congratulations on your work, it is a real contribution to our community.
2. GigaScience 04 Feb 2025
  
  in GigaScience
  
  Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering.
  
  This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Linhao Luo
  
  Summary: This paper proposes a novel framework called KGT that integrates Large Language Models (LLMs) with Knowledge Graphs (KGs) for pan-cancer question answering. The KGT framework can effectively retrieve knowledge from KGs and improve the accuracy of LLMs for question answering. Moreover, it can provide interpretable and faithful explanations with the help of structured KGs. Comments: 1. This paper construct a new dataset denoted as PcQA form a customized KG called SOKG for the evaluation of pan-cancer question answering. This is a great contribution to the community. However, it is unclear how to constuct such a dataset. More details about the construnction process and statistics of the final datasets should be disscussed in the paper. For example, how to generate the natural language questions and answers? How to link the question with relatived KG information (i.e., entity and relation)? How many questions can be answered by the KGs (i.e., answer converage rate). How many questions have been generated? What is the ratio of each quetion types defined in Table 2? 2. In Table2, the author define 4 reasoning types. How about other reasoning types such union and negation? Can we incorpate these tpes into the datasets? 3. The propsed method is novel and interesting. However some details are unclear. In the candidate path search, do we want to search reasoning paths or relational chains? The definition of these two paths are also unclear. Please give clear definition of them in prelimary. If is the reasoning paths, do we only keep the type information during the BFS? 4. I do not understand why we need to generatea cypher query to retrieve subgraph then construct relation paths from KG. We can directly retrieved relational paths from KGs by BFS. What are the benefits and motivations of using this two-stage pipeline? 5. What are the meanings of the X and âˆš in the figure. How to get them? 6. In experiments, other advanced KGQA methods can be compared, e.g., RoG [1] and ToG [2]. 7. The analysis of used token, time, and cost should be disscussed in the paper. 8. Can we apply the proposed metod to other KGs (i.e., SynLethKG, and SDKG) or KGQA tasks (MetaQA, and FACTKG) to show the generability. [1] LUO, L., Li, Y. F., Haf, R., & Pan, S. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In The Twelfth International Conference on Learning Representations. [2] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., ... & Guo, J. (2023). Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph. arXiv preprint arXiv:2307.07697
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.04.17.589873v2
www.biorxiv.org www.biorxiv.org

A practical DNA data storage using expanded alphabet introducing 5-methylcytosine

2
1. GigaScience 04 Feb 2025
  
  in GigaByte
  
  Editors Assessment:
  
  DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and validation are reusable and provide convincing results. As nanopore sequencing has enabled reading of these modified bases, it is timely making them applicable as extra letters in the molecular alphabet for DNA data storage
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 04 Feb 2025
  
  in GigaByte
  
  AbstractDNA molecular is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, the feasibility of the strategy is challenging due to the difficulty in synthesizing and the complex structure of non-natural DNA sequences. Here, we described a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet by introducing 5-methlcytosine(5mC). We also demonstrated the experimental validation by encoding one representative file into several 1.3~1.6 kbps in vitro DNA fragments for nanopore sequencing. The results show an average data recovery rate of 98.97% and 86.91% with and without reference respectively. This work validates the practicability of 5mC in DNA storage systems, with a potentially wide range of applications.Availability & Implementation R+ is implemented in Python and the code is available under the MIT license at https://github.com/Incpink-Liu/DNA-storage-R_plus
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.147). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Abdur Rasool
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? However, the Git links have a typo; the working code is available at https://github.com/Incpink-Liu/DNA-storage-R_plus
  
  Is the code executable?
  
  Unable to test. Complete execution of the given code requires time and resources.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test. Additional Comments: This manuscript focuses on DNA data storage based on an expanded molecular alphabet. In view of the challenges of non-natural bases in synthesis, sequencing, and compatibility, the manuscript proposes a DNA data storage scheme containing 5-methylcytosine based on the theory that modified bases can replace non-natural bases as extra molecular letters and develops an adaptive transcoding algorithm named R+ for corresponding experimental validation. The high data recovery rate obtained from sequencing analysis demonstrates its practicability.
  
  This manuscript provides a simple but relatively universal transcoding algorithm for DNA data storage that introduces additional molecular letters. The proposed DNA data storage scheme outperforms conventional DNA data storage in the potential development of information density. Considering the anticipated decrease in future synthesis costs and the expected advancements in relevant transcoding algorithms, my outlook remains optimistic regarding the potential application of this scheme. I suggest that the manuscript could be accepted after a few minor revisions listed below:
  
  Figure 3 in the paper could be further modified, specifically minimizing the excess white space on both sides of Subfigure A to make it more aesthetically pleasing.
  
  The subfigures A, B, and D in Figure 2 and Figure S2 both demonstrate the difference between poem.txt/program.py and the other four files. However, the manuscript lacks an explanation for this phenomenon. Is it relevant to the file size?
  
  The 8 nt adaptors play a key role during the sequence assembly in the experimental validation, so I suggest supplementing the specific generation process of these linkers. Text descriptions or flow charts are acceptable.
  
  It’s better to add the silico simulation to the Methods to make its structure more complete.
  
  For the practicality of DNA storage, I suggest to cite https://onlinelibrary.wiley.com/doi/10.1002/smtd.202301585 and https://academic.oup.com/bib/article/25/5/bbae463/7759103.
  
  Provide the correct URLs of GitHub links for reproducibility.
  
  Reviewer 2. Bi Kun
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  No. Additional Comments:
  
  In this study, a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet is proposed to increase the information density. The experimental validation demonstrates the practicability of DDS-5mC and highlight the enormous potential of modified bases represented by 5mC in the field of DNA data storage. Overall, the methods and results look appropriate and promising, but it has minor issues that need to be addressed currently.
  
  1.Please indicate the proportion of substitution: insertion: deletion in the error rates of Fig. 4C and D. 2.What is the meaning of the vertical axis of Fig. 2B? Is it the number of homopolymers per sequence, the longest length of homopolymers, or something else? 3.Line 304, please add s, "References" 4.The last sentence of the Abstract: "This work validates the practicability of 5mC over other non-natural bases in DNA storage systems". Please correspond it with the last paragraph of Results (151-154). 5.If necessary, according to the guideline of this journal, section Conclusion can be added or not.
  
  Reviewer 3. Lifu Song
  
  This manuscript explores the application of 5-methylcytosine (5mC) as an additional molecular letter in DNA data storage systems, expanding the molecular alphabet to increase information density. The authors present a novel transcoding scheme (R+) and validate it with both in silico and experimental data. The study explores GC content, homopolymer distribution, and data recovery rates under various conditions, offering detailed insights into practical applications. Experimental validation with nanopore sequencing demonstrates real-world feasibility. By improving storage density and ensuring compatibility with nanopore sequencing, the study addresses significant challenges in incorporating non-natural bases into DNA storage systems. Overall, the manuscript is well-structured and addresses a highly relevant topic in DNA data storage, offering valuable contributions to the field. I recommend it for publication, subject to minor revisions to enhance clarity and precision.
  
  Suggested minor revisions: 1) Although substitution errors, particularly between C and 5mC, were discussed, the manuscript does not provide a detailed explanation of how these errors affect long- term storage or large-scale applications—both of which are critical for archival storage, the primary use case of DNA data storage technology. 2) The manuscript could benefit from a broader comparison with other high-density DNA storage strategies, such as composite DNA letters, to contextualize the benefits and limitations of 5mC. 3) The discussion could be expanded to address practical challenges, such as strategies to reduce synthesis costs and improve sequencing accuracy for modified bases like 5mC, to provide a more holistic perspective on the technology's scalability.
  
  Protocol Review: I have taken a look at the experiment protocol associated with this manuscript in the website of protocols.io. The protocol looks sensible. I don't have any additional comments about it and am happy for it to go live.
  
  See: https://dx.doi.org/10.17504/protocols.io.q26g7mr78gwz/v1
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.26.630439v1
Jan 2025
www.biorxiv.org www.biorxiv.org

Polyploid genome assembly of Cardamine chenopodiifolia

2
1. GigaScience 08 Jan 2025
  
  in GigaByte
  
  This evaluation refers to version 1 of the preprint
  
  This work presents the genome of Cardamine chenopodiifolia, an amphicarpic plant (developing two fruit types, one above and another below ground) in the mustard (Brassicaceae) family. Cardamines also known as bittercresses and toothworts. As an octoploid species it has been challenging to create a genome reference for this species, and in this case the authors finally managed to achieve this using PacBio HiFi long-reads and Omni-C technology to assemble a fully phased, chromosome-level genome. Obtaining a 597Mb genome assembled into 32 phased chromosomes (plus mitochondrial and plastid genomes), and only having one gap in the centromeric region of chromosome 9. Peer review asked for additional QC and benchmarking, helping demonstrate the genome quality was very high, with only one gap and a N50 of 18.80Mb. The data presented here potentially helping to develop this species as an emerging model organism in the Brassicaceae for studying the development and evolution of amphicarpy by allopolyploidy.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 08 Jan 2025
  
  in GigaByte
  
  AbstractBackground Cardamine chenopodiifolia is an amphicarpic plant that develops two fruit morphs, one above and the other below ground. Above-ground fruit disperse their seeds by explosive coiling of the fruit valves, while below-ground fruit are non-explosive. Amphicarpy is a rare trait that is associated with polyploidy in C. chenopodiifolia. Studies into the development and evolution of this trait are currently limited by the absence of genomic data for C. chenopodiifolia.Results We produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using high-fidelity long read sequencing with the Pacific Biosciences platform. We successfully assembled 32 chromosomes and two organelle genomes with a total length of 597.2 Mbp and an N50 of 18.8 kbp (estimated genome size from flow cytometry: 626 Mbp). We assessed the quality of this assembly using genome-wide chromosome conformation capture (Omni-C) and BUSCO analysis (97.1% genome completeness). Additionally, we conducted synteny analysis to infer that C. chenopodiifolia likely originated via allo-rather than auto-polyploidy and phased one of the four sub-genomes.Conclusions This study provides a draft genome assembly for C. chenopodiifolia, which is a polyploid, amphicarpic species within the Brassicaceae family. This genome offers a valuable resource to investigate the under-studied trait of amphicarpy and the origin of new traits by allopolyploidy.
  
  Reviewer 1. Rie Shimizu
  
  This manuscript deciphers the complicated genome of an octoploid species, Cardamine chenopodiifolia. They successfully assembled a chromosome-level genome with 32 chromosomes, consistent with the chromosome counting. They evaluated the quality of the genome by several methods (mapping Omni-C reads, BUSCO, variant calling etc.). All benchmarks ensured the high quality of their assembly. They even tried to phase the chromosomes into four subgenomes, and one subgenome was successfully phased thanks to its higher divergence compared to the other three sets. Despite their intensive effort, the other three subgenomes could not be phased, suggesting the relationship originated from the same or closely related species. As a whole, the manuscript is very well written and describes enough details, and the genome data looks like it is already available in a public database. They even added a description of the biological application of this assembly about the amphicarpy.
  
  I only found a few minor points for which I kindly suggest reconsideration/rephrasing before publication, as listed below. *As the review PDF does not contain the line numbers, I suggest the original description at the first line and then write my comments.
  
  –C. chenopodiifolia genome is octoploid …, suggesting that its genome is octoploid. They compare the 8C peak of C. hirsuta and 2C peak of the target, but considering the genome size variation among Cardamine species, I do not think this is an appropriate expression. The pattern may mean ‘consistent’ with the expectation from C. hirsuta peaks but does not ‘suggest’ octoploidy. -C. chenopodiifolia chromosome-level genome assembly PacBio Sequel II platform. Here and nowhere, they do not mention the mode of sequencing (only found in method and the title of a table). Maybe ‘HiFi’ could be added here to make the method clearer. -Table 2. It would make more sense to overview the genome quality if the N90 and L90 (or similar, if it is already fragmented at L90) values are added. (maybe the same for Table 1). Otherwise Nx curves would be also fine for the same purpose. -We obtained only 20800 variants,…as expected for a selfing species. It might be partially due to selfing in wild habitat, but also by selfing (5 times) in the lab. This should be mentioned here to avoid misleading. -Table 4 The unit of each item (bp, number, frequency…?) should be suggested. In addition to the points listed above, I appreciate more Information about the phased chromosomes set: Total subgenome sizes of this set and the other three sets?(1:3 or imbalanced?) It would be even better with a synteny plot in addition to the colinear plot as Fig 3C. (e.g. by GENESPACE or something similar, including phased and unphased chenopodiifolia chromosome sets and C. hirsuta)
  
  Reviewer 2. .Qing Liu
  
  This manuscript “Polyploid genome assembly of Cardamine chenopodiifolia” produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using highfidelity long read sequencing with the Pacific Biosciences platform with two organelle genomes with a total length of 597.2 Mb and an N50 of 18.8 Mb together with BUSCO analysis (99.8% genome completeness), and phased one of the four sub-genomes. This study provides a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. The manuscript is suitably edited and significant data for amphicarpy breeding of C. chenopodiifolia except for the below revision points. The major revision is suggested for the current version of the manuscript.
  
  1 Please elucidate “an N50 of 18.8 Mb”, which is Contig or Scaffold N50 length. 2 Please elucidate “originated via allo- rather than auto-polyploidy”, which is “originated via allopolyploidy rather than autopolyploidy”. 3 Please substitute the word “understudied trait” using alternative sensible word. 4 “to phase this set of chromosomes by gene tree topology analysis”, it is suggested to be “to phase this set of chromosomes by gene phylogeney analysis”. 5 In the first section of Resuts, Cardamine chenopodiifolia genome is octoploid is suggested. 6 Could Table 1 and Table2 be combined as one table to present the sequencing and assembly characterization of C. chenopodiifolia genome. 7 Could the entromere locations be predicted in Table 5, which is the 32 chromosome summary of C. chenopodiifolia genome. 8 In Table 2, assembly 32 chromosomes including two organelles, which is not close related with the C. chenopodiifolia genome, from my point of view, two organelle genome assembly do not critical section of manuscript. 9 Could all figure numbers are ordered below each group figures, for example the below figure should be numbered before the Figure 2A (according group figure presence order). I wonder it is Figure 2, authors want to elucidate the chromosome number 2n=42, while I can’t count out 42 chromosomes from present format.Could authors using alternative clear figure to show the cytological evidence of C. chenopodiifolia chromosome number. 10 In Figure 5A, it is difficult to point out the clear meaning for first-diverged chromosome from gene tree, which is a phylogenetic meaning tree or just framework, could author redraw this Figure 5A in order to reader got what you mean.
  
  Reviewer 3. Kang Zhang.
  
  The paper produced a chromosome-scale assembly of the C. chenopodiifolia genome in the Brassicaceae family, and offers a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. I have the following comments which can be considered to improve the ms.
  
  Major points. 1.The introduction states that Cardamine is among the largest genera within the Brassicaceae family. The octaploid model species C. occulta and the diploid C. hirsuta have been sequenced. Therefore, I propose that a description of the evolutionary relationships among various species be included here. Additionally, the significance of the amphicarpic trait in the study of plant evolution and adaptation could be highlighted when discussing their octoploid characteristics. 2.The paper omits a detailed description of genome annotation and significant genomic features, which are essential for clearly illustrating the characteristics of the genome. To enhance this aspect, it would be beneficial to include a circular chart that displays fundamental components such as gene density, CG content, TE density, and collinearity links, among others. 3.The authors employed various techniques to differentiate the four subgenomic sets within the C. chenopodiifolia genome and ultimately managed to isolate a single sub-genomic set. The paper references the assembly of the octaploid genome of another model plant, C. occulta, within the same genus. Could it be utilized to compare with C. chenopodiifolia to achieve improvements? In addition, I suggest the authors to examine the gene density differences among these subgenomes, which could be helpful in distinguishing them. 4.Little important information were included in Table 1, 3, and Figure 4. These tables and figures should be moved to Supplementary data. 5.Evidence from Hi-C heatmap should be provided to validate the structural variations among different sets of subgenomes, such as those in Figure 3.
  
  Minor points. 1.Figure 5B, please change the vertical coordinate ‘# gene pairs’ to ‘Number of gene pairs’. The fonts in some figures are a little bit small. I suggest to adjust them to make it easy to read.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.24.576990v1
www.biorxiv.org www.biorxiv.org

The genome of the sapphire damselfish Chrysiptera cyanea: a new resource to support further investigation of the evolution of Pomacentrids

2
1. GigaScience 08 Jan 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Among hot topics in coral reef research, the difference between anemonefish and other damselfish is currently a popular area of research. In this study the authors provide a new high-quality non-anemonefish genome, which will be of high relevance to further the depth of such analyses. In this case of the sapphire damselfish Chrysiptera cyanea, a widely distributed damselfish in the Indo-Pacific area, often studied to elucidate the roles of various environmental controls on their reproduction, and investigate related hormonal processes To further the potential of biomolecular analyses based on this species, this study generated the first genome of a Chrysiptera fish from a male individual collected in Okinawa, Japan. Using PacBio and HiFI long-read sequencing with 94.5x coverage, a chromosome-scale genome was assembled and 28,173 genes identified and annotated. Peer review gathered more parameters and details on the quality, and the final assembly comprised of 896 Mb pairs across 91 contigs, and a BUSCO completeness of 97.6%. This reference genome should therefore be of high value for future genetic-based approaches, from population structure to gene expression analyses.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 08 Jan 2025
  
  in GigaByte
  
  AbstractThe number of high-quality genomes is rapidly growing across taxa. However, it remains limited for coral reef fish of the Pomacentrid family, with most research focused on anemonefish. Here, we present the first assembly for a Pomacentrid of the genus Chrysiptera. Using PacBio long-read sequencing with a coverage of 94.5x, the genome of the Sapphire Devil, Chrysiptera cyanea was assembled and annotated. The final assembly consisted of 896 Mb pairs across 91 contigs, with a BUSCO completeness of 97.6%. 28,173 genes were identified. Comparative analyses with available chromosome-scale assemblies for related species identified contig-chromosome correspondences. This genome will be useful to use as a comparison to study the specific adaptations linked to symbiosis life of the closely related anemonefish. Furthermore, this species is present in most tropical coastal areas in the Indo-West Pacific and could become a model for environmental monitoring. This work will allow to expand coral reef research efforts and highlights the power of long-read assemblies to retrieve high quality genomes.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.144). These reviews are as follows.
  
  Reviewer 1. Darrin T. Schultz
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The genome is also not yet on NCBI, but it would be good to upload it.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. I suggest later that there should be more information about the HiFi library preparation details, as the manuscript lacks them and it appears to be a non-standard (large insert size) library.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. See above comment-
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. No parameters are provided for the genome assembly software, for read trimming, or for other software used.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. See extended comments - the read data could use more QC, as well as the genome assembly.
  
  Is the validation suitable for this type of data?
  
  No.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  Yes. There is a degree of information missing about the data, but another researcher could use them for their study.
  
  Additional Comments:
  
  Thank you for the opportunity to review the work, The genome of the sapphire damselfish Chrysiptera cyanea: a new resource to support further investigation of the evolution of Pomacentrids, by Gairin and colleagues. In this manuscript, the authors collect an individual of the pomocentrid fish, Chrysiptera cyanea, in Okinawa, Japan. After isolating DNA, the sequencing center at OIST prepared and sequenced a SMRT sequencing library. Additionally, the authors generated some bulk RNA-seq data and sequenced it on the Illumina platform. The authors assembled the genome with two assemblers, and performed some comparisons of the C. cyanea contigs aligned to the chromosome-scale scaffolds of closely related pomacentrids. Given my background, I will mostly comment on the genomic analyses.
  
  I appreciate the authors' diligence in exploring different genome assembly methods and their efforts in running BUSCO and QUAST to QC the assemblies. The DNA sequencing data and assembly produced contigs that align well with the chromosomes of closely related species (which is convenient for comparative genomics!), and the manuscript presents a solid foundation for better understanding the chromosomal evolutionary history of the Pomacentridae.
  
  While this work represents an important step toward providing a new genomic resource for Chrysiptera cyanea, I see a few areas where the manuscript could be refined to enhance it as a community resource:
  
  (1) More information about data generation: Including additional details about the HiFi library preparation, specifically the chemistries used, the number of SMRT cells sequenced, and the bioinformatics steps used to generate the HiFi reads, would improve the manuscript's clarity and reproducibility. I have some questions regarding whether these libraries were prepared for HiFi sequencing: the reported mean read length of 25kbp is 10kbp longer than the standard HiFi library insert size; and the reported amount of bases in the reads, 84 Gbp, is more data than one would expect from a single CCS-processed SMRT cell, but could be the amount of data produced from one CLR run. Characterizing the quality score vs read length distribution could be helpful to characterize the read data. Clarifying these steps taken before the genome was assembled would strengthen the reliability of these reads as a resource.
  
  (2) Incorporating a few more important quality control (QC) steps would better clarify the completeness of the genome assembly. For instance, an estimate of genome size from the HiFi reads could be performed with jellyfish and GenomeScope, taking advantage of the k-mer fidelity of HiFi reads. This would provide a more conclusive estimate than the current comparison. Additionally, steps such as checking for contamination and providing an explanation for decisions like haplotig removal would make the assembly process more transparent. Lastly, supplementing the QC analysis with Merqury will provide a reliable answer to how complete the assembly represents the information in the individual HiFi reads in a way that complements BUSCO and QUAST.
  
  (3) The initial analyses of chromosome structure are a promising look into some yet-unexplored chromosomal changes in the Pomacentridae, and I think that incorporating a deeper phylogenetic analysis would build on this strength. Situating the chromosomal findings within a phylogenetic framework could provide stronger support, or actually resolve, the evolutionary interpretations presented. Doing this analysis likely could also help resolve whether the structures seen are genome misassemblies, or instead reflect lineage-specific chromosomal changes. The authors could supplement their beautiful figures using other tools that leverage whole-genome alignments and chromosome visualization to help answer these questions. One tool to try for two-genome comparisons, that the authors may have explored already in place of their ggplot script, is D-GENIES.
  
  Overall, this is a valuable resource, and I commend the authors for taking the steps to analyze the chromosomal evolutionary history within the pomacentrids. I look forward to seeing the authors’ future contributions to the field of genomics and chromosome evolution.
  
  Minor Points Line 125: Sharing the specific Trimmomatic settings used would enhance the reproducibility of the RNA-seq data processing. The parameters for genome assembly should also be added. Line 212: Are there any replicates for the RNA-seq data? Line 294: Consider uploading the assembly to NCBI for broader visibility and accessibility.
  
  Reviewer 2. Yue Song.
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The authors have provided clues for accessing the data in public databases such as NCBI, but it seems that the data has not been released; At least, I haven't been able to obtain available data using the provided accession number (e.g. PRJNA1167451). I'm not sure if I've missed any information, but I believe it would be better if the data could be easily accessible to the public.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. The authors used PacBio's third-generation sequencing technology for genome sequencing, which has become a "necessary option" for obtaining high-quality genomes in current genomic research. However, they did not further advance on the path of "assembling a chromosome-level genome" based on this version. Providing a chromosome-level genome would likely be more meaningful.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. Regarding the genome assembly and annotation process, the method described by the authors is overly simplistic and lacks detailed information on the parameters and procedures used. This makes it difficult for other researchers to effectively replicate the results described in the article.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. The authors have calculated the N50 of contigs and the completeness of BUSCO genes, which are indeed two commonly used indicators for assessing the quality of genome assemblies. However, it is still challenging to gain a clear understanding of the assembly quality based solely on these two indicators. Could other measurements be added, such as comparing the continuity and completeness of the assembly with those of closely related species or other comparable species' genomes? Additionally, there is a point that is difficult to understand: the authors report a BUSCO completeness of approximately 94% for the genome, yet a BUSCO completeness of 97% for the gene set. It is puzzling how BUSCO genes that are not annotated in the genome can still be present in the gene set.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. As I mentioned earlier, the authors did not provide detailed information about the processing procedures and parameters, which makes it difficult for other researchers to replicate their results.
  
  Additional Comments: It is recommended that the authors provide a detailed description of the methods and easily accessible data retrieval methods. It would be even better if the authors could further provide a chromosome-level genome, as T2T (telomere-to-telomere) level genomes are becoming increasingly popular.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.06.622371v1
www.biorxiv.org www.biorxiv.org

PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure

4
1. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **David Burstein ** Version: Revision 1
  
  Review content: The authors thoroughly answered all my questions and addressed all the raised concerns. I have no further comments, and I congratulate them on a well executed study.
2. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Nguyen Quoc Khanh Le ** Version: Revision 1
  
  Review content: No further comments to authors.
3. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **David Burstein **
  
  Review content:
  
  In this paper, the authors introduce "PlasGO," a language model for GO annotation of plasmid proteins. The model takes into account two levels of representation: (1) the amino acid level, producing embeddings of the analyzed proteins based on a foundation protein language model, and (2) the plasmid gene level, where the aa-based embeddings are considered as part of a language model representing each protein in the genetic context in which it is encoded. This approach leverages the modular organization of different functions on plasmid genomes. Benchmarking performed by the authors against other deep-learning GO annotation algorithms demonstrates a considerable improvement of PlasGO over existing methods. The research is timely, well-performed, and clearly explained. Main issues: 1. The authors acknowledge that only a relatively small portion of the proteins in their database have GO term annotations, which may limit the model's ability to learn plasmid patterns effectively. As they correctly point out, an iterative approach could be useful to improve performance. Specifically, high-confidence GO annotations predicted by PlasGO could be used as input for another round of prediction, and this process can be repeated until no new reliable predictions are produced. Given that the authors have all the data and models required to run such an iterative search, I would warmly recommend doing so and reporting if and how the predictions improve. 2. The gLM model (Hwang et al.) is highly similar to PlasGO in terms of the general approach, combining protein embedding (ESM2 in gLM) with genomic contextual data. Discussing the differences between the approaches and comparing their performances would provide important context and highlight the novelty of PlasGO. 3. The agreement of the PlasGO prediction with the GO terms retrieved from sequence databases ("ground truth") was determined by calculating the ratio of terms shared between the high-confidence predictions and ground truth, divided by the number of high-confidence predictions. This measure is asymmetrical and might generate over-optimistic results. At the extreme, if the algorithm produces a very large number of predictions, this value will tend to be very high just because there are many more GO terms predicted than GO terms in the ground truth. I strongly recommend using a symmetrical measure, such as the Jaccard index. 4. The methodology for calculating average precision and recall is potentially skewed. The authors compute average precision over proteins with at least one annotation, ignoring proteins lacking annotation (instead of counting these as misclassifications). This approach makes sense given that numerous plasmid proteins lack GO annotations. However, the average recall is calculated across all proteins (N). For unannotated proteins, the correct classification is not defined. Since these cases are also considered in the measure of recall, I assume PlasGO high-confidence predictions were considered correct. This seems like a problematic assumption that might lead to skewed results. I would therefore suggest that unannotated proteins be omitted from the recall calculation, as was done in the precision calculation. 5. The authors identify and filter out "elusive" GO terms that are difficult to predict. This is reasonable in the scope of this paper, but since it is still a central limitation of PlasGO, I would suggest discussing (even if not implementing) approaches to improve the predictions in these challenging cases. 6. In Figures 8 and 9, a perfect AUPR of 1 is reported in several cases. Such perfect classification performances are highly unusual and warrant an examination to double-check this result and if it persists discuss the underlying reasons for these perfect results. 7. The masking approach during training is not entirely clear. If I understand correctly, annotated proteins are being masked during prediction. This is expected to lead to the loss of a lot of contextual information. On the other hand, during training, the unannotated proteins are masked, losing potentially informative sequence data. I would suggest splitting complete plasmids between train/test/validation sets, and if needed, performing cross-validation to cover the entire dataset. This way for each plasmid the entire protein sequence and context information will be used. 8. There seems to be somewhat of a contradiction between the two following statements appearing in the paper: (1) "CaLM, despite being a pre-trained PLM, did not surpass the top three tools using ProtTrans, which is consistent with the results reported in CaLM's paper" and (2) "Experimental results demonstrate that the protein representations derived from CaLM outperform other PLMs in the classification of GO terms." Furthermore, other PLMs, such as ESM, performed better at GO annotation prediction according to the CaLM paper. These might have been more appropriate for this task. CodonBERT, a codon-based PLM also based on ProtTrans, could also have been a suitable alternative.
  
  Minor issues:- To improve the reading flow of the paper, consider reordering the ablation section to precede the "Performance on the RefSeq test set" section, since the ablation studies section provides the rationale for the choices of architecture and foundation protein language model.- "We initially downloaded all available plasmids from the NCBI RefSeq database" - I would suggest specifying the query or approach used to acquire all plasmids from RefSeq.- I would recommend using the term "protein embedding" instead of "protein token," which may be misleading. The term "token embeddings" used in Figure 3 is more accurate than "protein token," and yet "protein embeddings" is probably the most accurate term in this case.- Figure 1: To provide an accurate depiction of representative plasmids, I suggest including unannotated genes in Figure 1.- Figure 4: "Global average pooling" was misspelled.- Figure 10: "The prediction precision of PlasGO is determined by calculating the ratio of the number of proteins in set A that are also present in set B to the total number of predicted high-confidence proteins (|A|)". If I understand the figure correctly, it should be "number of GO terms" instead of "number of proteins" in both cases.- A figure (or supplementary figure) depicting one of the plasmids with some of the high-confidence predictions in the case study section (along the same lines as Figure 1 but with a distinction between previously known and unknown annotations) could enhance the clarity and impact of the results.
4. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Nguyen Quoc Khanh Le **
  
  Review content: 1. The manuscript introduces PlasGO, which leverages a hierarchical architecture for GO term prediction in plasmid-encoded proteins. However, the novelty of the approach could be questioned. While the combination of protein language models and BERT for GO prediction is innovative, similar methods have been applied in other contexts. 2. The study heavily relies on data from the RefSeq database, yet there is limited discussion on the quality and completeness of this data. The manuscript should address potential biases introduced by incomplete or incorrect GO annotations in the database. Moreover, the study uses protein sequences of up to 1K length, which might exclude relevant longer sequences, potentially limiting the model's applicability to all plasmid-encoded proteins. 3. The manuscript claims that PlasGO can generalize well to novel proteins, but this claim is based on a specific dataset. The model's generalizability should be tested on more diverse and independent datasets, including plasmids from different bacterial species or environmental contexts. 4. While the model's performance is quantitatively evaluated, the interpretability of the results remains unclear. The study should include an analysis of how well the model's predictions align with known biological functions and pathways. Additionally, it would be helpful to include examples where PlasGO provides novel insights that other models do not, thereby demonstrating its practical utility. 5. The manuscript does not provide detailed information on the computational resources required to train and run PlasGO. Given the complexity of the model, there are potential concerns about its scalability, particularly for larger plasmid datasets or in settings with limited computational power. 6. The manuscript compares PlasGO with several state-ofthe-art tools, but the comparison might not be fully exhaustive. Additionally, statistical significance tests for performance differences should be provided to support the comparative analysis. 7. Language models have been used in previous bioinformatics studies i.e., PMID: 37381841, PMID: 38636332. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership. 8. The study should discuss any ethical considerations related to the use of public datasets, particularly regarding data privacy and consent if any sensitive data is involved. Furthermore, the potential commercial implications of the PlasGO tool, especially if it is used for proprietary research, should be addressed. 9. While the manuscript mentions that PlasGO's code will be made available, it is crucial to ensure that all aspects of the research are fully reproducible. 10. The hierarchical architecture and the use of extensive training data might lead to overfitting, especially given the high dimensionality of the input features. The manuscript should discuss the measures taken to prevent overfitting, such as regularization techniques, dropout, or cross-validation strategies. 11. The study could benefit from a more detailed discussion on the practical implications of using PlasGO in real-world plasmid research. How can this tool be integrated into existing workflows for plasmid function prediction? What are the potential limitations in practical applications?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.03.602011v1
Dec 2024
www.biorxiv.org www.biorxiv.org

NeuroVar: An Open-source Tool for Gene Expression and Variation Data Visualization for Biomarkers of Neurological Diseases

2
1. GigaScience 26 Dec 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Coded and written up as part of the African Society for Bioinformatics and Computational Biology (ASBCB) Omicscodeathons, NeuroVar is a new tool for visualizing genetic variation (Single nucleotide polymorphisms and insertions/deletions) and gene expression data related to neurological diseases. The open source R-tool is available as an online Shiny Application and a desktop application that does not require any computational skills to use. Initial validation and case studies for the tool present analyses of biomarkers in ALS, exemplifying its relevance in personalized medicine and genomic discovery. Being an Open Source project, after peer review more detail has been added in paper and GitHub repo on how to contribute, report issues or seek support. Alongside some improved installation guidelines. The paper states future developments will expand its dataset beyond the ClinGen database to encompass new databases and broader genetic inquiries.
  
  *This evaluation refers to version 1 of the preprint *
2. GigaScience 26 Dec 2024
  
  in GigaByte
  
  AbstractBackground The expanding availability of large-scale genomic data and the growing interest in uncovering gene-disease associations call for efficient tools to visualize and evaluate gene expression and genetic variation data.Methodology Data collection involved filtering biomarkers related to multiple neurological diseases from the ClinGen database. We developed a comprehensive pipeline that was implemented as an interactive Shiny application and a standalone desktop application.Results NeuroVar is a tool for visualizing genetic variation (single nucleotide polymorphisms and insertions/deletions) and gene expression profiles of biomarkers of neurological diseases.Conclusion The tool provides a user-friendly graphical user interface to visualize genomic data and is freely accessible on the project’s GitHub repository (https://github.com/omicscodeathon/neurovar).
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.143). These reviews are as follows.
  
  **Reviewer 1. Joost Wagenaar **
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. There is a clear statement of need, but the audience is not very targeted. The investigators outline the need for tools to help users identify phenotypic subtypes of disease and describe how the tool would help with this. Although the investigators mention that the tool will allow users to analyze biomarker data, the scope of the types of analysis that can be performed is relatively small. I think that it would benefit the tool to better define the targeted users (clinicians, data scientists, enthusiasts?) and develop specifically towards a single audience.
  
  The tool leverages several existing R packages to run the analysis over the data and the provided tool can be described as a user-friendly wrapper around these libraries. The interface allows users to submit a file, and plot the results of the analysis within the app.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. I did not see any guidelines for contributing to the project in the paper, or in the associated GitHub repository.
  
  Is the documentation provided clear and user friendly?
  
  Yes, the investigators did a great job providing documentation and installation instructions. [also video demo: https://youtu.be/cYZ8WOvabJs?si=DnxVuL65yr0wYYjq]
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes, the investigators provide a clearly-stated list of dependencies and instructions on how to install them prior to running the application. Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?
  
  Yes. The paper, and GitHub repository point to a public dataset that can be used to test the application.
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  Yes. The investigators provide a video highlighting the use of the application and provide a use-case where they use the app to validate some existing knowledge.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No. The application is sufficiently small that no automated testing or manual testing would necessary be required beyond validating that the application works.
  
  Additional Comments:
  
  The proposed application provides a nice tool that makes visualization of vcf data and analysis easier for users who are not comfortable working within R directly. It provides a nice demonstration how the scientific community can wrap scientific tools into deployable applications and tools that can be easily understood. A question remains on the target audience for an application like this as most people who are interested in these type of analysis and visualizations are, in fact, familiar enough with R, or other programming languages to directly leverage the libraries and plot the results.
  
  That said, as data integration and multi-omics visualization becomes more complex and the app provides more ways to visualize the data in meaningful ways, I do strongly believe that applications like this can provide a meaningful addition to the scientific tools that are available.
  
  Reviewer 2. Ruslan Rust
  
  Is the language of sufficient quality? Yes. The language quality of the document is of sufficient quality. I did not notice any major issues.
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. Yes, authors provide a statement of need. Authors mention that there is the need for a specialized software tool to identify genes from transcriptomic data and genetic variations such as SNPs, specifically for neurological diseases. Perhaps authors could expand on how they chose the diseases. E.g. stroke is not listed among the neurological diseases. Perhaps authors could expand a bit on the diseases they chose in the introduction.
  
  Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?
  
  Yes the source code is available in github under the following link: https://github.com/omicscodeathon/neurovar. Additionally authors deposited the source code and additional supplementary data in a permanent depository with zenodo under the following DOI: https://zenodo.org/records/13375493. They also provided test data https://zenodo.org/records/13375591. I was able to download and access the complete set of data
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. I did not find any way to contribute, report issues or seek support. I would recommend that the authors add this information to the Github README file.
  
  Is the code executable?
  
  Yes, I could execute the code using Rstudio 4.3.3
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. I could follow the installation process, but perhaps authors could add few more details how to download from Github in more detail. As some scientist may have trouble with it. Also perhaps an installation video (additionally to the video demonstration of the Neurovar Shiny App might be helpful.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The documentation is provided and is user friendly. I was able to install, test and run the tool using RStudio. Authors may consider to offer also a simple website link for the RshinyTools if possible. This may enable the access also for scientists that are not familiar with R.Especially, it is great that authors provided a demonstration video. I was able to reproduce the steps. However, I would recommend to add more information into the Youtube video. E.g. reference to the preprint/ paper and Github link would be helpful to connect the data.Perhaps authors could also expand a bit on the possibilities to export data from their software. And provide different formats e.g., PDF / PNG /JPEG. I think this is important for many researchs to export their outputs e.g., from the heatmaps.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes, dependencies are listed and are installed automatically. It worked for me with Rstudio version 4.3.3. In the manuscript and in the repository.
  
  Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?
  
  Yes the authors provide test data with this doi: https://doi.org/10.5281/zenodo.13375590
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  Yes, authors use the example of Epilepsy, focal epilepsy and the gene of interest DEPDC5. I replicated their search and got the same results. However, I find that the label in Figure 1 in the gene’s transcript could be a bit more clear. E.g. it is not clear to me what transcript start and end refers to. It might also be more helpful if authors provide an example dataset for the Expression data that is loaded in the software by default.Furthermore authors use a case study results using RNAseq in ALS patients with mutations in FUS, TARDBP, SOD1, VCP genes.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No. Automated testing is not used as far as I can access it.
  
  Additional Comments: The preprint version of this paper was also reviewed in ResearchHub: https://www.researchhub.com/paper/7381836/neurovar-an-open-source-tool-for-gene-expression-and-variation-data-visualization-for-biomarkers-of-neurological-diseases/reviews
  
  My expertise: I am assistant professor in neuroscience and physiology at University of Southern California and work on stem cell therapies on stroke. We are particularly interested in working with genomic data and the development of new biomarkers for stroke, AD and other neurological diseases.
  
  Summary: The authors provide a software tool NeuroVar that helps visualizing genetic variations and gene expression profiles of biomarkers in different neurological diseases.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.21.609056v1
www.biorxiv.org www.biorxiv.org

demuxSNP: supervised demultiplexing scRNAseq using cell hashing and SNPs

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  algorithm are used to train a KNN classifier that predicts the demultiplexing classes of unassigned or uncertain cells. We benchmark demuxSNP against hashing (HTODemux, cellhashR, GMM-demux, demuxmix) and genotype-free SNP (souporcell) methods on simulated and real data from renal cell cancer. Our results demonstrate that demuxSNP outperformed standalone hashing methods on low quality hashing data, improving overall classification accuracy and allowing more high RNA quality cells to be recovered. Through varying simulated doublet rates, we show genotype-free SNP methods are unable to identify biological samples with low cell counts at high doublet rates. When compared to unsupervised SNP demultiplexing methods, demuxSNP’s supervised approach was more robust to doublet rate in experiments with class size imbalance.Conclusions demuxSNP is a performant demultiplexing approach that uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct. Unassigned cells (negatives) with high RNA quality can be recovered, making more cells available for analysis, especially when applied to data with low hashing quality or suspected misassigned cells. Pipelines for simulated data and processed benchmarking data for 5-50% doublets are publicly available. demuxSNP is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.demuxSNP).
  
  Reviewer 2: Haynes Heaton Reviewer Comments: demuxSNP is a tool for combining the demultiplexing capabilities of hashtagging and SNP based genotype demultiplexing of scRNAseq with cells from individuals mixed for cost savings and batch effect reduction. The authors test this method in comparison with other methods for either hashtag demultiplexing or genotype based demultiplexing individually and show improvements in recovering cells not confidently assigned via hashtagging as well as overcoming cases where genotype demultiplexing fails.comments on results:For figure 2 this is mostly this is good for recovering low hash quality cells. Although because the low quality hashing has been simulated in order to have a ground truth to compare to, it is unclear if this simulation method or amount realistic? Does it compare to % unassigned from real datasets?For figure 3 my main issue is why would souporcell out perform demuxSNP at any % doublets? Souporcell is using strictly less information than demuxSNP because it does not assume hashtags. Ideally this would be fixed or at the very lease an adequate explanation given.Comments on methods:"SNPs are filtered to those located within genes expressed across most cells in thedataset" and "SNPs with few reads across cells in the dataset are removed." -- Can I get numbers on this? If you require say 50% of cells to express a SNP locus, it will throw out a huge amount of the still informative SNPs. I find that utilizing as much of the data as possible is generally better. I assume this is done because of the KNN method which will require high overlap in SNPs between cells being compared."Labels from high confidence singlets along with simulated doublets used to train KNNclassifier and predict negative/uncertain cells." Why a KNN model here. Genotype data is not euclidean. Each SNP locus for each cell should be drawn as a binomial with underlying p=0+some error (homozygous ref) p=0.5+/- some error (heterozygous), or p=1.0-some errorA statistical model would be more appropriate for this."To leverage classification techniques applicable to binary data, SNP status is recoded toabsent/present (1,0) and k-nearest-neighbour classification (KNN) [31] is performed usingJaccard coefficient." Ah, so you force the data to be euclidean, but this does not take full advantage of the data. One problem with this will be when two individuals are related. For SNP loci of a parent/child there are many cases where this potentially could have disambiguated them but wont because one individual is heterozygous (so snp present) and the other is homozygous alt (still snp present).General comments: these are small nitpicksThe primary failure modes of genotype demultiplexing in no particular order are 1. small number of cells in a minority cluster 2. large number of individuals multiplexed together. and 3. large number of doublets causing lots of noise in the statistical models. The authors have adequately addressed improvements in 1 and 3. However, I think the paper would be stronger if it also did experiments with >30 individuals multiplexed together. For 3, I think further discussion is merited on the tradeoffs of hyperloading scRNAseq protocols including the # of quality singletons vs loading rate and multiplet rate and how many multiplets escape detection. Experiment designers want to maximize the number of singletons while minimizing the number of doublets that escape detection and harm downstream analyses. 10x genomics gives the ballpark doublet % to be expected as 1% per 1000 cells recovered. But this is a poisson loading process, so the true effect is not linear. The authors test up to 50% doublets (which is good to test), and some experimenters do attempt to load enough to recover 50k cells from a single lane, but I doubt that would be a recommended loading level for downstream analysis unless the doublet detection is highly effective.
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Background Multiplexing single-cell RNA sequencing experiments reduces sequencing cost and facilitates larger scale studies. However, factors such as cell hashing quality and class size imbalance impact demultiplexing algorithm performance, reducing cost effectivenessFindings We propose a supervised algorithm, demuxSNP, leveraging both cell hashing and genetic variation between individuals (SNPs). The supervised algorithm addresses fundamental limitations in demultiplexing with only one data modality. The genetic variants (SNPs) of the subset of cells assigned with high confidence using a probabilistic hashing
  
  Reviewer 1: Lei Li Reviewer Comments: Lynch et. al developed demuxSNP, a supervised demultiplexing approach for single-cell cell hashing data in a multi-modal (hashtag expression and SNP profiles) fashion. They utilized a probabilistic method to infer sample identities of cells using cell hashing modality, and then build a KNN model using SNPs of high cofinance ones from previous step. They then use this KNN model to predict cell identities for cells assigned as uncertain or negative by cell hashing.They have demonstrated the performance through a comparison with existing single-modal methods using both real data and simulated data. They have published an R package for the research community. It is interesting and encouraging to see another study focuses on multi-modal demultiplexing for cell hashing data. Below are some major and minor points from my side:1. I am not surprised that a multi-modal demultiplexing beats single-modal methods across both real and simulated datasets. To my knowledge, there are at least two groups proposed multi-modal demultiplexing approach for cell hashing data. Both were uploaded to bioRxiv last year and get published recently. One called hadge (https://link.springer.com/article/10.1186/s13059-024-03249-z ), and another called HTOreader hybrid (https://academic.oup.com/bib/article/25/4/bbae254/7686601), which is discussed by this study. Hadge is a comprehensive framework that integrated popular cell hashing-based and SNP-based methods, allowing for a joint deconvolution by combining best method from each modality. HTOreader hybrid proposed an improved demultiplexing method for cell hashing signals, and then also integrates demultiplexing results from both modality for a better deconvolution in a hybrid fashion. Indeed, this work has implemented different method for the same purpose. I tried both methods, and there're some major updates between bioRxiv version and published version. Thus, even one of them has been discussed, I think it's still necessary to include these two published methods into comparison, to reveal pros and cons of different methods, therefore provide useful information for users to select the method according to their specific experiment configurations.2. demuxSNP method picked top N commonly expressed genes for SNP calculation. In the tutorial on Github, the N was set to 100. I am wondering in a more heterozygous dataset, the N = 100 still sufficient or not. Is there a way for users to determine the N for their specific dataset more systematically? Or the authors can show some data to demonstrate that N = 100 is robust across different datasets?3. The dataset GSE267835 is private. Please provide reviewer token in the Data Availability statement during submission process.4. Color of uncertain cells in Fig1-B is a bit misleading cause in Fig1-A the same color was used to represent "background staining". Even A and B and different panels, however, a big black arrow makes readers thought they're the same data. Therefore, change the color of uncertain cells into another color would be good to avoid confusions.5. In Fig2-A and B, what are the units for the X axis? Are they log2 or log2 hashtag counts? Please add that information to the figure and legend.6. For Fig-2 C and D, please use the formal spell of names of existing methods like you did in Fig2E.7. Please add line numbers to the draft for reviewers' convenience8. Some minor format issues exist. For example, the "Result" section should a header format instead of normal text.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.04.22.590526v1
www.biorxiv.org www.biorxiv.org

stMMR: accurate and robust spatial domain identification from spatially resolved transcriptomics with multi-modal feature representation

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  performance of stMMR in multiple analyses, including spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. In chicken heart development, stMMR reconstruct the spatiotemporal lineage structures indicating accurate developmental sequence. In breast cancer and lung cancer, stMMR clearly delineated the tumor microenvironment and identified marker genes associated with diagnosis and prognosis. Overall, stMMR is capable of effectively utilizing the multi-modal information of various SRT data to explore and characterize tissue architectures of homeostasis, development and tumor.
  
  Reviewer 2: Hongzhi Wen Reviewer Comments: The paper introduces stMMR, a multi-modal graph learning method designed to integrate gene expression, spatial location, and histological information for accurate spatial domain identification from spatially resolved transcriptomics (SRT) data. The method employs graph convolutional networks (GCN) and self-attention modules, along with cross-modal contrastive learning, to enhance feature integration and representation.Strengths:1. Using GCN to capture local spatial dependency is natural and effective. Introducing attention mechanism for capturing global relations intuitively make senses, however, need more justification. Contrastive learning for cross-modal feature fusion is also a natural choice in multimodal learning. Overall, the methodology is novel and solid.2. Extensive benchmark analysis across various types of spatial data and tissues demonstrates superior performance of the method in spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. The empirical evidence is very convincing.3. The method's application to chicken heart development, breast cancer, and lung cancer showcases its potential in reconstructing spatiotemporal lineage structures and delineating tumor microenvironments, highlighting its value in clinical research.Weaknesses:1. In Figure 4, SpaceFlow is the only baseline for the case study. However, the performance of SpaceFlow is not topranked in other experiments. There should be a justification for why SpaceFlow is highlighted here.2. The contribution of the global attention mechanism to the whole framework is not very clear. The authors may provide more intuition and empirical justification (e.g., ablation study) if they would like to highlight this design.3. By introducing the hyperparameters $\alpha$, $\beta$ and $\gamma$ in Eq. (11), the method has a significantly larger search space than other methods. It is important to note how these hyperparameters are chosen in practice, more importantly, whether the test performance is referred when adjusting these hyperparameters. This might result in an unfair evaluation.
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractDeciphering spatial domains using spatially resolved transcriptomics (SRT) is of great value for the characterizing and understanding of tissue architecture. However, the inherent heterogeneity and varying spatial resolutions present challenges in the joint analysis of multi-modal SRT data. We introduce a multi-modal geometric deep learning method, named stMMR, to effectively integrate gene expression, spatial location and histological information for accurate identifying spatial domains from SRT data. stMMR uses graph convolutional networks (GCN) and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. Comprehensive benchmark analysis on various types of spatial data shows superior
  
  Reviewer 1: Shihua Zhang Reviewer Comments: In this paper, the authors developed a multi-modal deep learning method for identifying spatial domains from ST data by integrating gene expression, spatial location and histological information. This method adopts the graphconvolutional networks and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. They did several typical analysis to valid this this method. Generally, the wiring of this paper is OK. More specific comments:1. Spatial domain has been overwhelmingly studied recently. The authors need to pay more attention to why it is needed to introduce a new method. The novelty of the current method should be carefully clarified. For example, how the histological information help to improve the performance? Does the "geometric" deep learning really help?2. This method has been applied to some stereotypical data. The authors should applied it to some recently generated data by some new ST techniques.3. Figure 3 stMMR enhances spatial gene expression profiles. It is hard to see how the method enhance the spatial gene expression (e.g., LPL).4. With the accumulation of multi-slice spatial transcriptome data, the integration and alignment of spatial transcriptome data will be essential. Can this method be extended for this situation like STAGATE (Nat Comput Sci.2023 Oct; 3(10):894-906)? This will be valuable for ST analysis.5. The scalability of this method should be carefully explored.6. The authors should provide a detailed tutorial for users.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.22.581503v2
www.biorxiv.org www.biorxiv.org

De novo assembly and characterization of a highly degenerated ZW sex chromosome in the fish Megaleporinus macrocephalus

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Conclusions The chromosome-level genome of piauçu exhibits high quality, establishing a valuable resource for advancing research within the group. Our discoveries offer insights into the evolutionary dynamics of Z and W sex chromosomes in fish, emphasizing ongoing degenerative processes and indicating complex interactions between Z and W sequences in specific genomic regions. Notably, amhr2 and bmp7 are potential candidate genes for sex determination in M. macrocephalus.
  
  Reviewer 2: Changwei Shao Reviewer Comments: The authors reported the M. macrocephalus reference genome with a highly degenerated ZW sex chromosome and analyzed the expression pattern of sex chromosomes. In a word, this work extends our understanding of the mechanisms of sex chromosome evolution of fish species. The interpretation of the results is sound for the most part, and gives enough proof to verify their results. I just have few concerns as followed.1.On line 54, please confirm it. In the tongue-sole, the size of Z chromosome (21.91Mb) is larger than the W chromosome(16.45Mb).2.On line 88, 89 and 116, the numbers mentioned do not correspond with the results in Figure 1A. Please confirm it.3.In the section on "Gene Prediction and Annotation", a more comprehensive prediction of gene structure can be achieved by combining three methods: de novo prediction, transcriptome prediction, and homology prediction. The results obtained from these three approaches can be integrated using the EVM software, followed by annotation assessment with BUSCO. The method section is somewhat vague and lacks clear logic. For protein prediction, it is advisable to utilize multiple databases, such as SwissProt, InterPro, and Nr, to corroborate evidence from various sources.4.On line 210, there is an error in the caption of Figure 3. Figure 3B should be a colinearity map of the linkage groups and chromosomes.5.The SNP sites identified in females may include those from the Z chromosome, linkage group 23 (LG23) will contain SNP information from both the Z and W chromosomes. This could potentially affect the demarcation of the region of sex conflict.6.On the sex chromosomes, are there candidate genes related to sex differentiation in regions with a high enrichment of specific SNPs? please provide a detailed explanation.7.What is the distribution of genes in the Z and W chromosome-specific regions, and what is the gene loss rate?
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractBackground Megaleporinus macrocephalus (piauçu) is a Neotropical fish within Characoidei that presents a well-established heteromorphic ZZ/ZW sex-determination system and thus, constitutes a good model for studying W and Z chromosomes in fishes. We used PacBio reads and Hi-C to assemble a chromosome-level reference genome for M. macrocephalus. We generated family segregation information to construct a genetic map, pool-seq of males and females to characterize its sex system, and RNA-seq to highlight candidate genes of M. macrocephalus sex determination.Results M. macrocephalus reference genome is 1,282,030,339 bp in length and has a contig and scaffold N50 of 5.0 Mb and 45.03 Mb, respectively. Based on patterns of recombination suppression, coverage, Fst, and sex-specific SNPs, three major regions were distinguished in the sex chromosome: W-specific (highly differentiated), Z-specific (in degeneration), and PAR. The sex chromosome gene repertoire was composed of genes from the TGF-β family (amhr2, bmp7) and Wnt/β-catenin pathway (wnt4, wnt7a), and some of them were differentially expressed.
  
  Reviewer1: Yusuke Takehana Reviewer Comments: The authors assembled a chromosome-level genomic sequence and identified the sex chromosomes of the fish Megaleporinus macrocephalus. This manuscript is potentially interesting because evolution of sex chromosomes and sex-determining genes are one of the most fundamental and popular topics in the evolutionary biology. However, the conceptual advance and the novelty of this study are quite limited. It is another paper adding now one more species to the list of assembled genomes in this fish family. In addition, there is nothing new about the description of the sex chromosomes such as their degenerative signature. Such studies have already been conducted many times and similar conclusions have been reported. Furthermore, the experimental evidence presented appears rather preliminary and is not sufficient to support the claims and interpretations presented in discussion. I am therefore afraid that I have to say that the manuscript does not provide new insights into evolution of sex chromosomes, and thus will not be of sufficient interest to the readers of Gigascience.1. Overall, the paper was very difficult to read due to a lack of logic structure and many errors, such as confusing between males and females, between chromosomes and linkage groups, and so on.2. The introduction is not logically written. It is unclear what is known and to what extent, and why the genome of this species is being determined.3. I did not understand why the authors concluded that Chr13 is the W chromosome and not the Z chromosome. They should assemble the Z and W chromosomes separately and confirm them from different information. It is also unclear how they rule out the possibility that the sequences are chimeric. If they really want to reveal the evolutionary process of sex chromosomes, they should use all the data (Hi-C, linkage analysis, Pool-seq, gene information) to compare the structure of Z and W in detail, including synteny with closely related species.4. The analysis on sex chromosome gene candidates is too poor. Basic analyses have not been conducted on whether these genes are W-specific, whether they are in both Z and W, whether they have paralogs or not on autosomes, how much sequence variation there is, when and in which cells they are expressed, etc.5. All of the discussions are superficial and lacking in logic, and it is unclear what they want to discuss.6. The figures legends are poorly explained, and contain incorrect information, so I don't understand the meaning of the data at all.7. This manuscript contained many grammatical errors leading to many confusing statements, and some sentences that were grammatically correct but awkward meaning. I strongly recommend that the authors seek advice of someone with a good knowledge of English, preferably a native speaker.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.03.07.583869v1
www.biorxiv.org www.biorxiv.org

CAT Bridge: An Efficient Toolkit for Gene-Metabolite Association Mining from Multi-Omics Data

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Conclusions We applied CAT Bridge to experimentally obtained Capsicum chinense (chili pepper) and public human and Escherichia coli (E. coli) time-series transcriptome and metabolome datasets. CAT Bridge successfully identified genes involved in the biosynthesis of capsaicin in C. chinense. Furthermore, case study results showed that the convergent cross mapping (CCM) method outperforms traditional approaches in longitudinal multi-omics analyses. CAT Bridge simplifies access to various established methods for longitudinal multi-omics analysis, and enables researchers to swiftly identify associated gene-metabolite pairs for further validation.
  
  Reviewer2: JITENDRA KUMAR Barupal Reviewer Comments: To the authors,Thank you for the opportunity to review the manuscript GIGA-D-24-00083. The authors created a tool to predict association between genes and metabolites using various algorithms. The authors provide the tool as a web application, and as a python package. To get the reciprocal relationship between gene and metabolites, i.e. which metabolites can change which gene or vice versa, this tool can be a toolkit for the biologist or bioinformatician.The tool has application specially the relationship between changes in genes and metabolites is not direct, many complex mechanisms exist e.g. epigenetic or polymorphism. So the tool can be alternate to other available tools.Also, the manuscript brings the community focus on causal relationships instead of just correlation based relationships. The tool used temporal causality algorithms for predicting relationships between genes and metabolites.However, I recommend major revisions before publication. Here are my reasons and comments for the revisions:General issues with web accessibility and package installation :1. There are concerns about web accessibility, as indicated by web browsers flagging the connection as insecure. This may stem from geographical restrictions or the absence of HTTPS certification. Addressing these issues would ensure secure access to the server.2. Despite successful initiation of the client application from the git repository as a python module, no results were generated upon launching. It is suggested that the authors distribute the tool as a Docker image to facilitate seamless usage, eliminating concerns regarding dependencies and version compatibility.Other comments :1. There are inconsistencies regarding data preprocessing. While the manuscript mentions that the tool will handle preprocessing, it also indicates that users need to provide processed files. Clarification is needed on whether preprocessing is required. It seems, the tool required preprocessed data.2. For clarity use "causality and correlation" instead of "causality/correlation" algorithms.3.Can the tool process any new temporal numerical data series, or does it specifically filter for genes? For instance, if I provide a list of proteins along with a list of genes, will I receive the association between them? It is suggested to include this in the discussion section.4.Does the tool offer the capability to generate a causal diagram or network from these vectors, thereby providing visual support for their assertion regarding the causal relationship between metabolites and genes? If the author is working in this direction, it is suggested that information can be added in the discussion section.5. What definition of causal relationship did the author use, and could they provide a citation for their definition. Predictability or any other criteria were used for causal relationships. Please include the definition or criteria in the introduction and method section.6. What are the minimum or maximum time points (interval) for input files? e.g. will the tool work if I provide only two times points or If I provide 48 times points. Please include the information in the method section.7. What is the influence of the number of time points on the vector relationship presented in the paper? Have any studies by the authors addressed this question? Please include the results and discussion.8. Could the authors clarify which heuristic algorithm was employed for ranking the genes? Additionally, can they elaborate on how their approach to gene ranking is heuristic rather than relying on mathematical optimization or algorithmic methods? Clarification on the term "heuristic" would be beneficial.9. Could the authors offer an example from studies conducted on yeast, E. coli, or other simple organisms, demonstrating how changes in gene sequences have readily been observed to affect metabolite levels? Please include that in the results section.10. Does the tool generate a vector indicating many-to-many relationships or one-to-one relationships? In other words, does it reveal whether one gene is associated with many metabolites, and vice versa, or if it establishes a single genemetabolite relationship? Please include this in the results section. Also, in the discussion section please include examples of application of these relationships in various fields e.g. metabolic engineering or cancer metabolism.11. Table 1 compares the features of CAT Bridge with other available methods. It should encompass features provided by other tools that are not available in the author's tool, such as knowledge-driven integration or integration with a third-party database. Additionally, it should address the limitation posed by the requirement of time series data, which is not just a strength but also a challenge, particularly for epidemiology studies where multiple time series for gene expression may not be feasible.12. Please use alternative phrases to "Self-generated data," such as "experimentally obtained data," to clarify that the author is utilizing data acquired in the lab to validate the tool. (e.g. line 42, 223, and 492).
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractBackground With advancements in sequencing and mass spectrometry technologies, multi-omics data can now be easily acquired for understanding complex biological systems. Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the non-linear and multifactorial interactions within cellular networks. The complexity arises from the interplay of multiple genes and metabolites, often involving feedback loops and time-dependent regulatory mechanisms that are not easily captured by traditional analysis methods.Findings Here, we introduce Compounds And Transcripts Bridge (abbreviated as CAT Bridge, available at https://catbridge.work), a free user-friendly platform for longitudinal multi-omics analysis to efficiently identify transcripts associated with metabolites using time-series omics data. To evaluate the association of gene-metabolite pairs, CAT Bridge is a pioneering work benchmarking a set of statistical methods spanning causality estimation and correlation coefficient calculation for multi-omics analysis. Additionally, CAT Bridge features an artificial intelligence (AI) agent to assist users interpreting the association results.
  
  Reviewer 1: Tara Eicher Reviewer Comments: The authors introduce a useful tool (CAT Bridge) for integrating multiple causal and correlative analyses for multi-omics integration, which also includes a visualization and LLM component. The authors further provide two case studies (human and plant) illustrating the utility of CAT Bridge. I believe that this work should be published, as it contributes to the field of multi-omics analysis.However, I am very concerned about the lack of description regarding the LLM. As explained by Mittelstadt et al (https://www.nature.com/articles/s41562-023-01744-0), LLMs do not always provide factual answers. The authors need to justify the use of the LLM to determine the relevance of a gene-metabolite association. In particular, the authors should add to the main text (or at least the supplementary) a detailed description of the prompt construction and should justify why this prompt is expected to result in factual information. Furthermore, the authors should discuss the caveats of using LLMs in this context, starting with the linked article above. I believe that the manuscript will only be publishable once this concern is addressed.In addition, the authors are recommended to address the following more minor concerns:Implementation:1. Your "example file" links at https://catbridge.work are broken. Please fix this.Abstract:1. Line 32: "Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the complexity of cellular networks." This is not a clear statement. What about the complexity of cellular networks presents challenges in determining the associations?2. Make sure you are using present tense consistently, not past tense (Line 39).3. Please use the scientific name with the common name in parentheses as follows: Capsicum chinense (chili pepper). Use only the scientific name throughout the rest of the document (Line 41).Background:1. Line 56: "Background" should not be plural.2. Lines 59-60: More comprehensive than what? Please elaborate here.3. In Line 60, please include and familiarize yourself with the following reference: Eicher, T., G. Kinnebrew, A. Patt, K. Spencer, K. Ying, Q. Ma, R. Machiraju and E. A. MathÃ© (2020). "Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources." Metabolites 10: 202.4. Lines 67-68: Citation needed.5. Line 72: Please use the scientific name with the common name in parentheses.6. Lines 74-77: Citations needed.7. Lines 77-78: Give an example of biologically naÃ¯ve conclusions from purely data-driven strategies.8. Line 78: Discuss how the machine learning models could address the drawbacks of the correlation models and vice-versa.Materials and Methods:1. It seems that CAT Bridge needs to be run on one metabolite at a time. In this case, I would not use the term "gene-metabolite pair association" in Line 104, but rather "associations between genes and the target metabolite".2. Line 115: Clearly state which of these methods are non-linear and which address the lag issue.3. Line 136: Your figures are out of order (Figure 1B comes after Figure 2B).4. Please take a look at the Minimum Standards Reporting Checklist (https://academic.oup.com/gigascience/pages/Minimum_Standards_of_Reporting_Checklist). In particular:a. In the section starting at Line 153, list the number of seedlings used.b. Were all timepoints collected from all seedlings? List the total number of samples.c. How many mg were collected per sample (can use a range here)?d. 3 biological replicates per seedling? Give more detail here.e. What machine was used for the ultrasonic process? If frequency settings are permitted by the machine, list the settings used.f. How many of the 28 younger and 54 older adults had both transcriptome and metabolome data?5. Line 209: "Younger" and "older" are better terms.Results:1. Line 248: How does the AI agent analyze the functional annotations?2. Lines 281-282: "This illustrates the advantage of causal relationship modeling methods over traditional methods".3. Line 290: Please also include the updated IntLIM paper (IntLIM 2.0): Eicher, T., K. D. Spencer, J. K. Siddiqui, R. Machiraju and E. A. Mathe (2023). "IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements." Bioinformatics Advances 3(1): vbad009.4. Make sure the colors are consistent in Table 1.5. Line 156: The scientific name of the pepper species is inconsistent with other areas of the text.Figures:1. S1 should be provided as a table, not a figure.2. Please make S2 larger. It is difficult to read.3. S3 needs labels (x axis, y axis, legend).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.21.576587v3
Nov 2024
www.biorxiv.org www.biorxiv.org

Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Fabio Cortesi
  
  In their manuscript, Frazer et al. developed a machine-learning approach to predict the spectral sensitivity of a visual pigment based on the gene/amino acid sequence of the opsin protein. First, they created a visual opsin database based on heterologously expressed genes from the literature. They then used deepBreaks, an ML tool developed to explore genotype-phenotype associations, to run several different models and test how well ML could predict spectral sensitivity. Their main findings are that the larger the dataset for training and the more diverse (both in opsin sequences themselves and phylogenetic breadth they were derived from) the dataset, the better the predictions will become. However, there is a plateau for the number of training sequences that should be used as a minimum (~ 200), with a slight gain afterwards. As such, the suggested ML approach works well for larger datasets but needs refining for smaller datasets. There are also several drawbacks to the approach that need to be carefully considered when interpreting the results, including the fact that ML cannot accurately predict the effect on phenotype if confronted with a new mutation or a new combination of mutations not used during training.
  
  I found the study to be well-written and easy to follow. The results support the conclusions, and as far as I can tell, the ML and associated analysis were performed accurately. All the code and the database are readily accessible, too. It is great to see that we are at a point now where computational power has reached a level that can be used to predict gene-phenotype relationships accurately. The use of ML to study the function of (visual) opsins, i.e., spectral sensitivity, especially if additional parameters can be included, will undoubtedly be of great help to the field and welcomed by the community. As such, I have no major concerns and only a few minor comments I recommend addressing before publication.
  
  Minor comments
  
  Introduction - Please add a sentence to explain that a visual pigment consists of an opsin protein bound to a chromophore/retinal and that the two units together lead to the 'spectral sensitivity' phenotype. You cover it in the discussion, but it would be helpful for the reader to have this information upfront.
  
  Please provide a reference for the following statement: '[…], and purification of heterologously expressed opsins followed by spectrophotometry [REF]'.
  
  You say, 'Despite opsins being a well-studied system with an extensive backlog of published literature, previous authors expressed doubts that sequence data alone can provide reliable computational predictions of λmax phenotypes [37-40]'.
  
  I agree that the spectral sensitivity predictions from sequences have been criticised in the past as they were sometimes oversimplified (including some of our work). However, spectral sensitivity predictions based on computational modelling, albeit not using ML, have previously been attempted successfully several times, e.g., by Jagdish Suresh Patel and colleagues, and should be mentioned here.
  
  You say that: 'The extensive data on animal opsin genotype-phenotype associations remains disorganized, decentralized, often in non-computer readable formats in older literature, and under-analyzed computationally'.
  
  Again, I agree that the opsin data can profit from a centralised databank like the one you created. However, there have been several previous attempts at summarizing opsin data in recent years (although not specific for heterologously expressed opsins), for vertebrates at least. For example, work by Schweikert and colleagues on fish visual opsins and recent work on frog opsins by Schott et al. These studies should be mentioned and cited appropriately here, as tremendous work went into collating the datasets in the first place.
  
  Results
  
  The use of MWS opsin is somewhat confusing. I presume this refers to vertebrate lws genes that are mid-wavelength shifted? Why have these as a separate group? Ancestrally, there are five sub-families of visual opsin genes in vertebrates: sws1 & sws2 (SWS), rh1, rh2 and lws (MWS & LWS). The MWS range in Figure 1 should be part of a larger lws derived grouping.
  
  This part reads like a discussion. It also needs a reference for the age of T1 opsins: 'The similar levels of performances between T1 and invertebrate models were unexpected, especially considering it has a training dataset five times larger than the invertebrate model. One possible explanation is that the very old age of T1 opsins [REF] might have led to a higher complexity of genotype-phenotype associations that are not yet well sampled enough to allow good predictions.'
  
  These two sentences could also be weaved into the discussion rather than the results section: 'These equations do not account directly for taxonomic, genetic, or phenotypic diversity, as the number of genes is on the x-axis. Therefore, one should be cautious about applying them to predict model performance based on training data size alone.'
  
  Table 1: What do MAPE and RMSE stand for, and what do those numbers mean? Maybe also include a short explanation of the acronyms and their meaning in the main body of the text.
  
  This should also be mentioned in the discussion: 'Until the models are trained with more invertebrate (r-opsin) data, we do not put high confidence in the estimates of λmax.'
  
  Figure 2 legend: Third line, why 'Mutant predictions …'? Aren't the predictions for all sequences?
  
  Figure 3 legend: It says 547 mutant sequences here and 546 sequences in Table 1.
  
  Provide a reference for the following sentence: 'The WT SWS/UVS model similarly highlighted p113, a site functionally characterized as the counterion in the retinal-opsin Schiff base interaction for all vertebrate opsins.
  
  Figure 4 legend: Please provide references for the following sentence: 'Positions 181, 261 and 308 are highlighted because they are among the highest scoring sites and have all been previously characterized as functionally important to opsin phenotype and function.'
  
  Discussion
  
  Please simplify and do not overstate the first sentence. I suggest: 'To better understand methods to connect genes and their functions, we initiated VPOD, a database of opsin genes and corresponding spectral sensitivity phenotypes.'
  
  Section: The important relationship between data availability and predictive power.
  
  You mention that ML could not accurately predict spectral sensitivity if mutant genes were excluded, especially if smaller datasets are used. This was to be expected since ML is not per-se 'smart' but learns from patterns in the underlying dataset. However, it is a significant drawback of the approach, and I encourage you to state this more clearly. My main concern is that future users will take the ML predictions as absolute truth instead of verifying or experimentally verifying the predictions.
  
  Provide a reference for the following sentence: 'One consequence of leaf-based tree construction is that due to its faster convergence/training time, it can be more prone to overfitting, as it constructs trees on a 'best-first basis' with a fixed number of n-terminal nodes.'
  
  You should include some information regarding the assumptions in the Introduction and the Methods section. For example, information about what chromophore interaction was modelled should be in the methods, and the basic information about how visual pigments are formed and what different chromophore types are being used by which species should be in the Introduction: 'We also assume the photopigment uses 11-cis-retinal, as all heterologously expressed opsins in VPOD were reconstituted using this chromophore. However, this assumption is violated in some organisms because they use 13-cis-retinal as the in-vivo chromophore [71-73], which is associated with a red-shift in λmax [32,71].'
  
  Conclusion
  
  I recommend being more cautious about the predictive power for epistatic effects since you tested it only on three samples and the predictions were severely restricted by the training dataset containing the single mutant samples.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.
  
  Reviewer 2. Nikolai Hecker
  
  The authors compiled a collection of opsin/rhodopsin proteins and their associated light absorption properties from literature; using the measured wavelength of maximum absorbance as a proxy. In addition, they include multiple sequence alignments (MSA) of the proteins including subsets of vertebrate and invertebrate sequences. The data is provided as tab-seperated, comma-seperated, and FASTA files. This is a valuable resource for studying opsins and vision related phenotypes. The authors then use gradient-boosting, random forests, and bayesian ridge regression to predict the wavelength of maximum absorbance from the protein sequence MSAs. Furthermore, they investigate whether their models can be used to identify amino acid changes that impact the wavelength of maximum absorbance and epistasis. This is based on a small set of opsin mutants that have been reported in literature. The manuscript is well structured and written. I have some concerns regarding the analysis, description and presentation of the data.
  
  A traditional cross-validation by random sampling can be inadequate for phylogenetically related sequences. If closely related species are part of the data training and test sets may contain nearly identical sequences. Excluding entire lineages instead of random sequences during training would circumvent this issue.
  
  Based on Fig. 3, Fig. 2, and p. 6, the models do not generalize well given that they only predict mutants well which exhibit a similar wavelength of maximum absorbance as the wild type. Based on the plots (Fig. 2 and Fig. 3) it does not look to me like the model trained on mutants+WT performs substantially better than the WT model for mutants with large wavelength shifts. This would be in contrast to p.16 "Particularly illustrative of these ideas are our analyses with and without experimentally mutated opsins". The authors should either show statistics regarding the improved performance for mutants with large shifts or change the corresponding parts.
  
  The data set description should be more detailed in parts. It should be shown how the opsins/rhodopsins classes (UVS, SWS, MWS, Rhodopsins, LWS) are distributed across the vertebrate and invertebrate phylogeny, for example by a phylogenetic tree and their number per species. Are the mutated opsins/rhodopsins derived from a small set of species or do they reflect most of the vertebrate phylogeny?
  
  How importance scores are estimated for the different models should be explained.
  
  The "ML often predicts the effects of epistatic mutations" section needs some clarifications. Why were only three sequences investigated? Do none of the other double mutants show epistasis when compared with the corresponding single mutations? In this paragraph, it is not always clear whether wavelengths and additive wavelengths are obtained from predictions or actual measurements.
  
  The description in git repository (https://github.com/VisualPhysiologyDB/visual-physiology-opsin-db) is very sparse. The content of the different files and how they relate to each other should at least be briefly explained in a README. It would also be helpful to add gene names, and the source of the sequence to the meta files.
  
  Minor comments
  
  For Fig. 4A, since MSAs are already computed it would be interesting to indicate the conservation amino acids per position. Are important amino acids correlated with sequence conservation?
  
  In Tab. 1, R2 is used to compare different models which are based on different subsets, and also pot. differently sized MSAs. An adjusted R2 might be more suitable to account for different numbers of features.
  
  It would be helpful to add a Docker image to the github repository to make it easier to use.
  
  Re-review: The authors have addressed the majority of my concerns and improved the manuscript. However, there are still some remaining issues that the authors should address before I would recommend the publication of the manuscript.
  
  Dependence between data points is not a novel problem for data analysis and machine learning in a broad range of subjects. While I appreciate that the authors added a paragraph discussing the issue of phylogenetic relatedness, the setup of the cross-validation and how the data is presented make it difficult to assess to which extend their models over-fit to the data. Referring to their previous reply, lineage-/group-based cross-validation should not be arbitrarily chosen but based on the structure of the data. This is not always a trivial problem and magic solution, I agree. The authors should at least incorporate references to literature discussing the problem and potential solutions for dealing with phylogenetic relatedness at p.11 "While these performance metrics are impressive, it is important to remember that phylogenetic relatedness between sequences..." or in the discussion. For example, Roberts et al. provide a nice overview for cross-validation strategies in various settings including phylogenetic data (they call it "block cross-validation"):
  
  Roberts et al. (2017). Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929.
  
  Regarding the comparison between models trained on wild type (WT) and WT + mutant data (WDS), I find the comparison rather difficult to follow concerning repeatably leaving out 25 mutants. For comparing the models, the test set (or test sets) should be the same. This would mean to assess the predictions for the same 25 left out mutants by both the WT and the WDS model (for each 25 left out mutants). If this was done already I would recommend rephrasing the corresponding part in the methods and results to improve the clarity. In addition, a visualization, for instance, a violin plot of the WT model RMSEs vs. violin plot of the WDS model RMSEs would be useful for the readers.
  
  I would still recommed adding a brief summary of how feature importance scores are computed. So, the reader does not have to look up another manuscript. This does not have to be detailed. As I understand, the feature importance is just the normalized number of feature occurrences or the Gini importance for gradient boosting/random forests or the coefficient for regression models.
  
  Minor details:
  
  Fig. S10: the text at the leaves is not readable. It could be replaced, for instance, with the name of the gene family if that make sense, or removed.
  
  Fig. 4A: the bars at position 181, 261, and 308, could be indicated, for example, in red or another color, to easier compare A and B.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  Background Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae073). The peer-reviews are as follows.
  
  Reviewer 1. Robert Lucas.
  
  Frazer and colleagues set out to assess the ability of machine learning (ML) approaches to predict spectral sensitivity (lmax) of animal opsins from their amino acid sequence. To this end they first develop a database of phenotyped opsins (opsin sequences with known lmax), which they term Vpod1. They then explore how various factors of the ML process impact its ability to predict lmax. These include the nature of the input training dataset (size, phylogenetic and gene family diversity, inclusion of data from mutagenesis experiments) and the ML method. For comparison they include a phylogenetic imputation approach that predicts lmax based upon overall sequence similarity. They test the validity of their approach according to their ML pipelines' ability to predict: lmax for the training dataset; the outcome of mutagenesis; lmax for a test dataset extracted from the training dataset; known epistatic interactions; and established spectral tuning sites. In all cases, they report various degrees of success and conclude that the ML approach can be used to predict lmax (almost as well as phylogenetic imputation but with reduced computational cost) provided that the training dataset is sufficiently rich (it performs poorly for invertebrate opsins for which data are limited) and, ideally, benefits from mutagenesis datasets.
  
  I am no expert in machine learning and will leave others to comment on that aspect of methodology but in general this study represents an interesting addition to the literature. The idea of predicting lmax from amino acid sequence is not new e.g. as the authors acknowledge the '5 sites rule' for cone pigments is long established. Applying ML holds the promise of a more efficient process for achieving similar predictability for other branches of the animal opsin family. In that regard, the inherent limitation in the ML approach is highlighted - it is particularly valuable in branches of the family for which information is sparse (invertebrate opsins), but performs poorly in those branches without more starting information about structure:function relationships (which itself replaces the need for ML to some extent). Nonetheless, it certainly has the potential to be a valuable tool and this paper represents a sound exploration of its characteristics and one important feature of the paper is that it confirms that ML can allow fairly good predictions based solely on data from wildtype opsin sequences.
  
  I have relatively few suggestions for improvement. The most important is that the authors appear to have omitted one technology for the process of defining lmax (introduction, method and discussion). We and others have used heterologous action spectroscopy to describe lmax for a growing number of animal opsins. In this technique spectral sensitivity is defined using live cell assays of light response for opsins expressed in immortalised cell lines. Those data could be included in the Vpod1 dataset. It would also be appropriate to mention the approach as a tool for populating the training dataset as it has the advantage of being applicable to opsins that don't reliably form pigments in vitro (e.g. many invertebrate opsins) and does not rely on access to the animal itself but only to its genome sequence. The authors also may wish to relate Vpod1 to another recently published database of animal spectral sensitivities albeit collected for a different purpose (https://doi.org/10.1016/j.baae.2023.09.002).
  
  Some minor points. The authors note with surprise that ML performed poorly for the rod opsin dataset. Could this be because their metric (R2) is sensitive to the degree of variability in lmax in the training dataset, which is constrained in rod opsins? I found the pastel colours in Fig2 and 3 hard to discern, more separation on the colour pallet would be appreciated.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.12.579993v1
www.biorxiv.org www.biorxiv.org

CoCoPyE: feature engineering for learning and prediction of genome quality indices

2
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractThe exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae079). The peer-reviews are as follows.
  
  Reviewer 1. Xiaoquan Su
  
  In this work, authors proposed CoCoPyE to evaluate the genome quality constructed from metagenomes by a two-stage approach. In general, this work is valuable for the research works in this field, and some issues should be addressed before further consideration for publication.
  
  In section 2.1, how the threshold of 60% and 30% were determined?
  
  In the 2.1.6 section, there were two different prediction method, including linear and non-linear prediction, so in actual senses, how to choose a proper way?
  
  For the simulation, I also suggest to make some simulation for specific habitat metagenomes, e.g. human-associated habitats (gut, oral, etc.), or natural environments (soil, marine).
  
  For the online demo at https://cocopye.uni-goettingen.de/, a demo fasta input file can be useful for quick startup.
  
  Re-review: All my previous comments have been addressed and I have no more question.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Robert Finn
  
  The paper by Birth et al describes CoCoPy, a two stage pipeline for the estimation of completeness and contamination of prokaryotic genomes, especially for the assessment of metagenome assembled genomes. The paper was well written and clearly outlined the aims of the software, the approach and the need for a two stage process. I also appreciate the candid nature of the discussion that CoCoPy should be considered as complementary to CheckM2. The performance in terms of time is a notable consideration why this tool should be considered by the field, and the benchmarks of completeness and contamination are encouraging. The main drawback of the tool is the need for a close reference genome for the second stage quality estimation, which will limit use for environmental metagenomics.
  
  Major comments
  
  While I appreciate the benefits in terms of speed offered by UProC, there are a number of questions that are not adequately addressed in the manuscript. The first is why the version of Pfam is so out of date, with version 24 and 28 being used in the feature classification. The authors remarked about the improvement between Pfam 24 and 28. Pfam is now on version 36, with a release produced about once a year. During this time, the Pfam entries have been expanded in number, increased in sequenced diversity and optimised in terms of boundaries. This is particularly pertinent now, with the use of AlphaFold models improving domain boundaries. Secondly, Pfam models have per model thresholds, but there was no discussion of thresholds used. Finally, Pfam Clans were introduced in Pfam version 18.0, as a way of modelling diverse families with multiple profile HMMs. While many of these families are unlikely to represent single copy marker genes, there is still the case that two families belonging to a same clan could be measured as a dissimilarity, when actually they are representing the same protein family. This is particularly important in the marker based estimates and count histogram ratio.
  
  It would also be beneficial for the reader to see the results from genomes simulate with a fragmentation profile that more closely represents that of MAGs, where there may be a few long contigs in the 100kbp range, and then quickly tails off to contigs in the 1000s bp range. Also, the authors should try and estimate the amount of blind contamination, i.e. contigs that have no single marker genes. This is an important metric which is typically overlooked by current tools. This particularly applies to those MAGs where they fail to be passed on to the second phase of contamination.
  
  The second stage of the CoCoPy should also be benchmarked against tools such as GUNC, which similarly uses features from reference genomes to estimate completeness and contamination. This would help guide the reader to understanding whether running CheckM2 with GUNC or CoCoPy would be advantageous.
  
  Minor In the introduction, the authors omit the part of the MIMAG standard that requires the presence of tRNAs and SSU/LSU also need to be present to refer to the genome as high quality, not simply based on completeness and contamination.
  
  In the "Reference database" section it would be informative to know the number of Pfam entries (and their accessions) that are considered single copy marker genes. Also, the best concept of completeness is having a closed, circular genome in RefSeq.
  
  In the construction of the test data it would be useful to provide a measure of taxonomic distance between the genomes in the training dataset and the test dataset. While this is difficult, a basic metric such as average branch length to nearest neighbour, or number of steps away from the nearest neighbour in the GTDB taxonomic tree, but some level of information would be informative, rather than simply not being the same taxID.
  
  How sensitive is the second stage to completeness? Conceivably, the use of MAGs to enrich the sequence space could improve the second stage, if strict completeness and contamination rules were applied?
  
  Re-review: I appreciate the authors effort in trying to update the version of the Pfam database. It is disappointing that new versions of the resource are not being considered down to a technical problem in the UProC implementation. While it is highlighted that the older versions of Pfam provide computational advantages, there are many potential solutions that could be found to overcome this, and the realms of memory requirements are not vastly different to CheckM1. While I understand the difference between HMMER and UProC, it is as much about the improvements in domain boundaries and increased coverage of sequence space that the more recent versions harbour that I would expect to improve the performance. The advent of AlphaFold has resulting is a large number of improvements to the Pfam boundaries. As the authors have a fix, it is slightly strange they have not included the results in the response to the reviewers comments. The fix may be limited in scope and not officially release, but it would be more convincing to show the results of Pfam v36 against v24/28, thus allowing an informed judgement. I look forward to the release of an updated version of UProC in the near future, as promised by the authors.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.07.579156v2
www.biorxiv.org www.biorxiv.org

spatiAlign: An Unsupervised Contrastive Learning Model for Data Integration of Spatially Resolved Transcriptomics

1
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Jose Fernandez Navarro
  
  The authors present a novel computational method to integrate SRT datasets claiming that the method adjusts for batch effects while retaining the biological differences. The method provides the possibility to adjust the gene expression counts to be used for downstream analysis. The method was benchmarked against other methods that are available for integration of single cell and spatial transcriptomics datasets obtaining positive results. The manuscript is well structured and clear, it provides a robust motivation and the comparisons with other methods are clear and well defined. The method has the potential to make a contribution to the field, specially considering that it has been developed to be compatible with scanpy and that an open-source library has been made available on GitHub.
  
  Introduction:- In the following sentence: "batch effects caused by nonbiological factors such as technology differences and different experimental batches." the authors could have elaborated more and perhaps included some references.- In the following sentence: "In contrast, popular MNN-based methods such as Seurat v3[16] efficiently address batch effects in gene expression, but their limitation lies in the ability to align only two batches at a time, and they become impractical when dealing with many batches" I do not think the MNN-based term is correct in that context. Also, I do not entirely agree in the claim. One generally does not have many batches to correct for and the referred methods can perform batch correction in datasets with more than 2 batches.- In the following statement: "However, PRECAST only returns the corrected embedding space, and GraphST requires registering the spatial coordinates of samples first to ensure its integration performance; thus, their applications are limited in certain scenarios. "I'm not in total agreement, I understand PRECAST provides a module to obtain corrected gene expression counts for downstream analysis. Results:- I find the introduction to spatiAlign a bit long. It could perhaps be simplified and then leave the implementation details to the Methods section.- In the following sentence: "..spatial neighbouring graphs between cells/spots (e.g., cellâ€’cell adjacent matrix A), where the connective relationships of cells/spots are negatively associated with Euclidean distance." I find it a bit misleading, are the authors building the spatial graph using a fixed radius? Or euclidean distances in a manifold?- I could not find a detailed description on how the different datasets were processed with the others methods that they used to benchmark.- I believe to measure the power of the methods to retain biological differences, comparing consecutive sections of the same tissueis not enough. I would also include a comparison using sections from different individuals (same region).- In the MOB datasets comparison, by looking at the UMAP figures, the differences in performance it is not so clear in the cases of SCALED and BBKNN.In the Hippocampus dataset, I did not see information on how the clusters were annotated. It would have been nice to include the ABA figures of the same region. I found it difficult to understand the basis and interpretation of the spatial autocorrelation analysis with Moran's I. In the MOB embryo dataset, did the authors consider include a comparison with the other methods? Figures:I observed some of the supplementary figures are out of order or the labels do not match the panels, I encourage the authors to revise this. I also noticed some of the panels showing expression plots do not have a bar with the range of expression. The labels in some of the panels are hard to read and I miss some labels (f.e. the section/dataset in some of the panels).Some figures make reference to the ABA and/or the tissue morphology. For these, I could suggest including the HE images and/or IF images from the ABA. Figure 2a-c: the fonts are hard to read. Figure 2d is hard to read, perhaps the layout would be better by making it one column per method?. Figure 3g would be easier to read if the 3 datasets were arranged side by side. Figure S4, I find the clusters hard to see clearly.
  
  Datasets and documentation: The authors provide links to the original datasets but they do not provide access to the processed and annotated datasets, this makes it difficult to replicate the results and the examples provided in the documentation. The manuscript would benefit if the authors would provide better documentation and means to reproduce/replicate the analyses.
  
  Software: I was able to install the package with PyPy in a Conda environment but I had to manually install some dependencies to make it work.Major comments:- I would like to suggest the authors to revise the figures. The supplementary figures descriptions do not seem to match the content of the figures. Some of the figures are missing labels and color bars.- I would like to suggest the authors to correct for grammar and misspelling errors and perform a throughout proof reading of the manuscript for consistency.- I would like the authors to provide links to access the processed/annotated datasets.- I would like the authors to provide more details on how the datasets were processed with their method and the others method (hyperparameters, versions, etc..). This could be complemented greatly if the authors could provide notebooks or step-by-step documentation.- I would like to suggest the authors to include a comparison with true biological differences such as different phenotypes and/or genotypes.- I would like to suggest the authors to include some of other methods in the MOB (stereo-seq) comparison.- I would like to suggest the authors to check their claim that PRECAST does not provide "corrected" gene counts or that the other methods do not provide means to perform downstream analyses (DEG, trajectory inference, etc…).- I would like to suggest the authors to include normalized counts as well as raw counts in some of the comparisons (for example when performing the trajectory analysis or showing the spatial distribution of features). Minor comments:- I would like to suggest the authors to not use the term "expression enhacenment", to me the gene expression is corrected or adjusted but not enhanced.- I would like to suggest the authors to improve the documentation of the open-source package to provide more information on the different arguments and options. It would also be nice to provide documentation and/or notebooks to reproduce the analysis (or some) presented in the manuscript.- I would like to suggest the authors to improve the installation of the PyPy package since some dependencies seem to be missing.- I would like to suggest the authors to improve the layouts and font size of some of the for clarity and readability.
  
  Re-review: I acknowledge the efforts made by the authors to address the comments and provide answers. However, I still find the manuscript not ready for publication. These are my comments: Major:- The authors have included a new analysis (sup. figure 7) using a dataset (tumor liver) that lacks a stereotypical structure. While this is a good addition to the manuscript, I would still like to see the performance of spatiAlign in correcting technicaleffects while retaining true biological differences (f.e. disease and control). In addiction to this, a comparison using a imaging-based technology (f.e Merfish or CosMx) would make the manuscript stronger.- The authors have made an effort to provide Jupyter notebooks with code to reproduce the analyses. Unfortunately, this is uncompleted. None of the notebooks contain code to reproduce the spatiAlign analyses and only the notebook with the tumor liver dataset (sup. figure 7)includes the processing steps. For the other datasets they authors use hard-coded values. Moreover, I was unable to run some of the notebooks due to errors and missing files and/or dependencies. The authors should provide one notebook for each dataset including the processing and analysis and provide means to run the notebooks (environment files and/or docker files) in an easy way that enables reproduciblity. Ideally, these notebooks should also include the spatiAlign analysis.- I observed a strange effect in figure 2 where the UMAP manifolds of the BBKNN, Harmony and Combat are similar. I could identify the error causing this in one of the notebooks. I strongly suggest the authors to revise all the analyses and figures and to provide notebooks to reproduce these in an easy way as I mentioned before.- I find the MNN performance surprisingly bad. I wonder if this could be due to how the data was processed with this method. Did the authorstry to disable cosine normalization for the output?.
  
  Minor:- I think the manuscript would be stronger if the authors would include the normalized counts in the figures where they show the raw counts.- I still find inconstancies in the text (typos, grammatical and syntactical errors). The authors are still using the term enhanced (specially in figure legends).- In the MOB dataset, the authors claim that the Visium spots are 100mm but that cannot be true, visium spots are 50mm.- In figure 3 (panel f) use the same layout as figure 2 for consistency.- In figure 4 (panel g) the color bar and labels are missing.- In Sup. figure 3 (panel c) the color bar is out of place and the legend is missing.
  
  Re-review: The authors have made a great effort to improve the manuscript. The improvements on the documentation and open-source package will be appreciated by the community. I only have minor comments:- The grammar has improved but I could still see some errors (to cite a few):- line 96 "dimensional reduction"- line 346 "structure and MERFISH"- I still think that the authors have not been able to fully demonstrate the performance of their method to integrate datasets with true biological/phenotypical differences (f.e. disease and healthy). Supplementary figures 7 and 8 add value of the manuscript by integrating tumor cells from different patients but this is not exactly what reviewer 1 and Isuggested. I acknowledge the explanations that the authors provide in their response but I'm not in total agreementwith the statements. There are publicly available datasets that could suit this analysis. I will not request to amend such analysis to the manuscript but I could at least suggest to mention this in the manuscript as a limitation or future work.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.08.552402v2
www.biorxiv.org www.biorxiv.org

StereoSiTE: A framework to spatially and quantitatively profile the cellular neighborhood organized iTME

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  With emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Rosalba Giugno
  
  Authors introduce StereoSiTE, which integrates open-source bioinformatics tools with the self-developed algorithm SCII. The aim is to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions and accurately infer functional cell-cell communications with quantitatively defined interaction intensity in ST data.
  
  The paper's objective is commendable, and the overall organization of the content, along with the obtained results, holds great promise. Nevertheless, certain aspects need to be addressed. The proposed approach's novelty is significantly anchored in the SCII software. However, the paper has notable drawbacks. It falls short in providing a theoretical and scientific comparison with other similar tools. Moreover, the comparison includes systems that do not incorporate spatial considerations, posing a limitation in assessing the method's uniqueness in a broader context.
  
  Give more details on which systems are you referring here: "To improve accuracy, we recommended using spatially resolved data at single cell resolution". Please provide your insights on the rationale for employing or abstaining from downstream analysis to comprehend the spatial distribution of gene expression in tissue, as https://doi.org/10.1093/gigascience/giac075 and https://doi.org/10.1038/s41467-023-36796-3. Additionally, consider discussing how this is associated with the prediction, validation of the functional enrichment or on step: Clustering bins into different cellular neighborhoods based on their cellular composition.
  
  Re-reviews The authors have solved my issues.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest Statement
  
  Reviewer 2. Chenfei Wang
  
  In this manuscript, Xin. et al. provided a framework called StereoSiTE that incorporated the established methodologies with their developed algorithm to defined cellular neighborhood (CN) organized immune tumor microenvironment (iTME) based on cellular compositions, and to dissected the spatial cell interaction intensity (SCII) in spatial transcriptomics (ST). StereoSiTE has the following improvements compared to existing methods. First, SCII detects cell-cell communication using both cell space nearest neighbor graph and targeted L-R expression. Second, SCII taken the interaction distance account for different interaction classification such as secreted signaling, ECM receptor and cell-cell contact. Finally, StereoSiTE could avoided to detected the false positive interactions caused by limited reachable interaction.
  
  Although the authors performed comprehensive works to demonstrate the potential applications of StereoSiTE. This reviewer has strong concerns about the potential novelty and effectiveness of StereoSiTE over existing methods. The CN results were not mindful of the spatial information, and the labeled cellular neighborhood (CN) may mislead users. Additionally, although the L-R pair could be categorized into three classifications based on interaction distance, the SCII only uses different radius to infer cell communication without employing a different strategy for predicting interactions in distinct L-R pairs. I have the following detailed comments.
  
  Comments： 1. The authors fail to show the novelty and advantages of CN compared to other methods, such as DeepST, which integrates gene expression, spatial location and image information. The authors should provide the comparison with the several recent strategies that consider the effect of local niches including BANKSY, stLearn, Giott, and DeepST. 2. The authors should compare SCII with additional methods such as CellPhoneDB v3 and Cellchat v2, demonstrating its superior performance. 3. The method used for cell segmentation should offer more comprehensive information rather than solely citing "Li, M. et al. (2023)". 4. Format of the paper. The alignment inconsistency within the manuscript—with some paragraphs centered and others justified—should be corrected for uniformity. 5. The figures and manuscript containing 'Teff' and 'M2-like' cell types should provide a legend explaining the abbreviations for clarity. 6. The font size of the labels in Figures 5E-F is insufficient for easy reading and should be enlarged. Re-review: In the response letter, the author emphasizes the novelties of the StereoSiTE framework and demonstrates how the StereoSiTE software was specifically designed to address the question of "how iTME responds and functions under stimulation" using stereo-seq data. The author highlights notable enhancements to the self-development algorithm, including CN and SCII. The CN algorithm focuses on evaluating the cell composition in iTME, while SCII is designed to infer the intensity of spatial cell interactions. These advancements have been incorporated into the updated version of the manuscript. Notably, the SCII component of the framework combines spatial information and expression patterns to infer that cell-cell communication can limit reachable interactions, thereby reducing false positive interactions. The authors have also employed distinct strategies to predict different types of L-R pairs with varying interaction distances, encompassing secreted signaling, ECM-receptor, and cell-cell contact. In the case of secreted type L-R pairs, SCII enables the specification of varying radius thresholds to infer spatial cell communication. However, it is recommended that the authors consider the exponential decay of expression values, particularly when the radius exceeds 100 μm.
  
  The response also outlines the authors' claim that CN exhibits good performance compared to other tissue domain division methods (BANKSY and Giotto HMRF). However, upon reviewing the performance comparison results, it becomes apparent that BANKSY outperforms the other methods, although the CN method shows nearly consistent performance with BANKSY on the benchmark dataset STARmap. To substantiate the preference for CN over BANKSY, the authors are encouraged to provide evidence of its user-friendly interface, shorter run time, or lower memory usage. Overall, the revisions and enhancements made to the StereoSiTE framework significantly improve its functionality and analytical capabilities. The StereoSiTE software holds great promise in providing invaluable insights and support for potential users and researchers in the field.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae078), and published as part of our Spatial Omics Methods series. The peer-reviews are as follows.
  
  Reviewer 1. Lihong Peng
  
  In this manuscript, the authors developed a computational framework named StereoSiTE to spatially and quantitatively profile the cellular neighborhood organized iTME by incorporating open source bioinformatics tools with their self-proposed algorithm named SCII. This study is very meaningful. However, it remains several problems.
  
  Major comments: 1. The authors incorporated several open sources bioinformatics tools. However, how to ensure their combination is the optimal to the spatially resolved cell-cell communication inference performance? For example, cell2location was used to deconvolute cellular composition and construct cellular neighborhood. Why to use cell2location for deconvoluting spatial transcriptomics data? why not use the newest deconvolution algorithms, for example, SpaDecon, Celloscope, POLARIS, GraphST, SPASCER, and EnDecon? No model can adapt to all data. The authors should first verify that cell2location is the best appropriate cell type annotation tool corresponding to iTME. If not, the subsequent analyses will be not appropriate.
  
  The authors claimed that they computed the decomposition losses of different combinations of the number of CN modules and CT modules. Which combinations? The authors should list them.
  
  When measuring spatial cell interaction intensity, the authors only simply summed up the ligand and receptor gene expression information of the sender and receiver cells. Why not consider existing classical intercellular communication intensity methods? The authors should compare other intercellular communication intensity measurement methods. Please refer to the following two cites: Cell-cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies, briefings in bioinformatics. CellDialog: A Computational Framework for Ligand-receptor-mediated Cell-cell Communication Analysis, IEEE Journal of Biomedical and Health Informatics. Deciphering ligand-receptor-mediated intercellular communication based on ensemble deep learning and the joint scoring strategy from single-cell transcriptomic data, Computers in Biology and Medicine.
  
  For protein-protein interaction analysis, the authors queried 628 significant up regulated genes in CN5 area of treatment samples from STRING. Can all obtained proteins be ligands or receptors? In addition, they labeled hub genes and key protein-protein interaction networks, what were these hub genes and key networks used for?
  
  Which ligand-receptor pairs could mediate intercellular communication within immune tumor microenvironment? Among these L-R pairs, which L-R pairs are known in existing databases and which L-R pairs are the predicted ones?
  
  "The enrichment analysis of individual CN showed that each CN had a dominant cell type with a spatial aggregation (Fig 2F), which was increasingly obvious than that in whole slide (Fig 2E)." What's a dominant cell type? How to define it?
  
  "To reduce the variance among open-sourced L-R databases, we unified L-R database in SCII by choosing L-R dataset in CellChatDB, which assigned each L-R with an interaction distance associated classification as secreted signaling, ECM receptor and cell-cell contact." How to unify L-R database? Did it allow for user-specified LR databases and/or add user-specified LR databases?
  
  In figure 3, how to confirm which L-R pairs mediate intercellular communication?
  
  StereoSiTE is composed of multiple modules, is it scalable? Can some of these modules (such as clustering and cell type annotation) be replaced with other more powerful modules?
  
  The authors claimed that "CellPhoneDB detected many false positive interactions". How to find these false positive LRIs? How to validate the LRIs be false positives? Please list the found false positive LRIs.
  
  In Figure 3, the authors should add comparison experiments between StereoSiTME and classical intercellular communication analysis tools.
  
  Minor comments: 1. The text in subfigure A, B, and C in Supplementary Figure 2 is obscure. The authors should revise Supplementary Figure 2. 2. In Section "Abstract", iTME should use full name when it first appears. 3. Which cites of "13 Li, M. et al. (2023)." is in the reference list?
  
  Re-review:
  
  In the revised manuscript, the authors conducted lots of revisions. However, it still remains many problems to solve:
  
  The authors have compared the performance of Cell2location with other cell type identification methods, Celloscope[10], GraphST[11], and POLARIS[12] on on both STARmap and stereo-seq dataset of liver cancer. How about its performance on other unlabeled datasets? Please compare it with "STGNNks: Identifying cell types in spatial transcriptomics data based on graph neural network, denoising auto-encoder, and 𝑘-sums clustering".
  
  Cell-cell communication is usually mediated by LRIs. The construction of high-quality LRI databases is very important to cell-cell communication. The authors should introduce these LRI data resources and potential LRI prediction methods and cite them, for example, PMID: 37976192, 37364528, 38367445.
  
  In Figure 4B, 4C, 4D, and 4F, Figure 5A and 5B, Figure 6B and 6C, the fonts are too small. Please enlarge the fonts.
  
  The organization and structure of this manuscript must be carefully revised. For example, The structure in Discussion is obscure. In the first paragraph in this section, the authors have introduced their proposed method, next, they described it in details. But the third paragraph elucidated the reason why to develop this reason. In addition, "Figure 3 highlights that the analysis without distance threshold may lead to false positive results, and SCII showed more superior performance than other methods." why to Figure 3? Did not the other results support their conclusion? The final paragraph in Discussion introduced their method again. It HAS NO logic.
  
  Where is the conclusion of this manuscript?
  
  The authors should analyze the limitations of this work for further work in the future.
  
  English is VERY POOR. This manuscript must be carefully revised. For example,
  
  "prove that spatial proximity is a must to guarantee an effective investigation.", is a must to do?
  
  Re-re-review: The authors have solved my issues.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.31.522366v3
www.biorxiv.org www.biorxiv.org

A data standard for the reuse and reproducibility of any stable isotope probing-derived nucleic acid sequence (MISIP)

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Xiaoxu Sun
  
  The paper titled "MISIP: A Data Standard for the Reuse and Reproducibility of Stable Isotope Probing Derived Nucleic Acid Sequence and Experiment" presents a compelling argument for establishing a minimum information standard for stable isotope probing (SIP) experiments. The proposed MISIP standard aims to facilitate data reuse and ensure the reproducibility of results within the scientific community. The authors have meticulously considered the essential information required for MISIP, resulting in a well-articulated manuscript. However, I have a few suggestions that could further refine the proposed standard.
  
  To me, one critical aspect of MISIP is to ensure it provides necessary details of the SIP incubations. Although the authors have integrated some of this information, which can overlap with other existing standards like MIMS/MIMARKS (e.g., sample origin), there are additional elements that should be included in MISIP, either as mandatory or recommended information.
  
  Suggestion 1: Inclusion of Additional Substrates in Incubations
  
  The paper rightly identifies the isotopologue as a requisite detail for MISIP. However, I recommend expanding this requirement to include a mention of other substrates added during incubations, at least as a recommended piece of information. While specifying the primary substrate (e.g., 13C-labeled glucose) is often sufficient for studies targeting heterotrophic processes, the identification of autotrophic populations using substrates like 13C-bicarbonate necessitates the disclosure of electron donors/acceptors to clarify the targeted metabolic processes.
  
  Suggestion 2: Detailed Reporting of Incubation Progress
  
  Although incubation time is suggested as a recommended field, I propose that details regarding the progress of the specified reactions should also be documented, such as the incorporated dose. This is particularly relevant when different substrate doses are used, as these can yield varied outcomes. For instance, the rate of substrate utilization can significantly differ across inoculums at identical time points; coastal sediment might consume 1 mM of glucose in a day, whereas deep-sea samples might take longer. Therefore, merely reporting incubation time without context may not provide sufficient insight for readers to gauge the dynamics of potential cross-feeding or other relevant processes.
  
  In conclusion, integrating these suggestions into the MISIP standard could enhance its comprehensiveness and utility. By providing a more detailed framework, researchers can better interpret experimental setups and results, fostering a more robust foundation for data reuse and reproducibility in the field of stable isotope probing.
  
  Re-review: Nice work on addressing all the comments. All my concerns have been addressed.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.
  
  Reviewer 2. Jibing Li
  
  In this study, the authors meticulously delineated the Minimum Information about Stable Isotope Probing (MISIP) data standard within the broader framework of the Minimum Information about any (x) Sequence (MIxS) data standard. By extending the scope of MIxS to incorporate SIP-specific metadata, the authors have provided invaluable guidance to SIP practitioners regarding the collection and reporting of essential metadata for SIP experiments. Adoption of the proposed MISIP data standards is poised to significantly augment the reusability of sequence data derived from SIP experiments, thereby fostering a deeper understanding of in situ biogeochemical processes and microbial ecology. While the manuscript presents novel insights, further refinement is necessary to optimize its impact.
  
  The MISIP data standard holds paramount importance in the realm of stable isotope probe (SIP) technology as it standardizes the collection and reporting of metadata essential for SIP experiments. This significance will be elucidated in the introduction to underscore the necessity and relevance of the MISIP framework.
  
  The "Excess Atom Fraction" (EAF) serves as a pivotal metric for evaluating the isotopic enrichment of specific taxa, genomes, or genes in SIP experiments. It plays a crucial role in quantifying the incorporation of isotopically labeled substrates into microbial biomass, thereby providing valuable insights into microbial community dynamics and functional gene expression.
  
  The introduction section will be expanded to provide a comprehensive background on DNA/RNA-stable isotope probing (SIP) technology, emphasizing the need for standardized data reporting through the MISIP framework. This contextualization will elucidate the motivation behind the development of MISIP and underscore its significance in promoting data reuse and reproducibility in SIP research.
  
  To enhance transparency and credibility, a detailed account of the development process of the MISIP data standard, including the methodologies employed and potential challenges encountered, will be incorporated. This supplementary information will provide readers with insights into the rigor and practicality of the standard.
  
  Specific application cases showcasing the efficacy of the MISIP data standard in actual research scenarios will be integrated into the manuscript. These case studies will serve to illustrate the practical utility of MISIP and bolster the persuasiveness of the article.
  
  A comparative analysis of the MISIP data standard with existing similar standards will be conducted to highlight its advantages and uniqueness. This comparative approach will furnish readers with a comprehensive understanding of the distinctive features and benefits of MISIP.
  
  The article will delve into the limitations of the MISIP data standard, explore potential avenues for future improvement, and delineate its application prospects in fields such as microbial ecology. This discussion will offer critical insights into the current state and future trajectory of MISIP.
  
  The manuscript will be supplemented with a thorough examination of the limitations of MISIP data standards, potential avenues for future enhancement, and its implications for microbial ecology and other relevant fields. This holistic approach will ensure that the article comprehensively addresses all facets of the MISIP framework.
  
  Re-review: Overall, the author addressed the questions I raised; however, the existing SIP research overlooked some representative authors I consider important, such as Thomas, F. (SME J. 2019;13:1814-30) and Luo, CL (Environ Int. 2023;180:108215). The author should include a more thorough review of the relevant literature to provide a well-rounded context for the study. Additionally, I identified several formatting errors in the manuscript, such as the incorrect citation in reference 15. These errors should be rectified to meet the journal's standards.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae071). These reviews are as follows.
  
  Reviewer 1. Dayi Zhang The topic is quite interesting and important for microbiologists who are doing SIP work. However, there are some concerns about its quality and novelty. 1. 15N is widely used in SIP but the authors did not mention them in this work. As an important labelling isotope, it is not acceptable to exclude 15N work. 2. The authors have well designed and explained the catalog of MISIP, but how to standardize data from different sources are is not mentioned. In other words, there is only a method to put information together but no protocol to compare data from different studies or extract useful information from others' work. I think this is the most important expectation of this work. 3. As different protocols were used by different researchers to achieve SIP results, the authors should give criteria for their quality and the way to improve the quality for comparison. However, I cannot find such information. 4. For the reason above, I think this is only a very preliminary concept, and the datasets and methods should be further developed for practical purposes.
  
  ---Editors Comments--- This work was then rejected to allow more work and revision. and then resubmitted.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.07.13.548835v1
www.biorxiv.org www.biorxiv.org

Whole-genome re-sequencing of the Baikal seal and other phocid seals for a glimpse into their genetic diversity, demographic history, and phylogeny

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Due to them being found in the landlocked, isolated habitat of Lake Baikal makes the Baikal Seal (Pusa sibirica) unique among all pinnipeds as the only freshwater seal. This paper presents reference-based assemblies of six newly sequenced Baikal seal individuals, one individual of the ringed seal, as well as the first short-read data of the harbor seal and the Caspian seal . This data aiding the study of the genomic diversity of the Baikal seal and to contribute baseline data to the limited genomic data available for seals. Peer review extended the description of the used tools and parameters in the revised manuscript, and provided some more information on the methods..This newly generated sequencing data hopefully now helps to extend the phylogeny of the Phoca/Pusa group on genome-wide data and can also broaden the view into the genetic structure and diversity of the Baikal seal
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractBackground: The iconic Baikal seal (Pusa sibirica), the smallest true seal, is a freshwater seal that is endemic to Lake Baikal where it became landlocked some million years ago. It is a rather abundant species of least concern, despite the limited habitat. Until recently, research on its genetic diversity has only been done on mitochondrial genes, restriction fragment analyses, and microsatellites, before its reference genome has been published. Findings: Here we report the genome sequences of six Baikal seals, and one individual of the Caspian seal, ringed seal, and the harbor seal, re-sequenced from Illumina paired-end short read data. Heterozygosity calculations of the six newly sequenced individuals are like the previously reported genomes. In addition, the novel genome data of the other species contributed to a more complete phocid seal phylogeny based on whole-genome data. Conclusions: Despite the long isolation of the land-locked Baikal seal population, the genetic diversity of this species is in the same range as other seal species. The Baikal seal appears to form a single, diverse population. However, targeted genome studies are needed to explore the genomic diversity throughout their distribution.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.142). These reviews are as follows.
  
  Reviewer 1. Yaolei Zhang
  
  Overall, the newly generated data from this study are valuable, but the authors have not effectively analyzed and interpreted the data. The entire paper appears to be more like an undergraduate bioinformatics homework exercise, with the results resembling a middle school student's description of a picture. Additionally, there are several major issues: 1. Background investigation is not sufficient: Genomic data on the Baikal seal has been publicly available five years ago, including a chromosome-level genome assemby with much higher quality, such as contig N50, which is nearly ten times higher than the reference genome used by the author in this study. 2. Methodology is unclear: The description of the software and parameters used is incomplete. A proper methodological description should allow a basic bioinformatics analyst to quickly reproduce the results of the paper. However, with the current description, there are too many missing details in the methodology section. 3. Data issues: • a. For publicly available data, the authors did not provide detailed descriptions of the accession numbers. • b. For the newly generated data in this study, the author did not sufficiently describe the data quality to support their conclusions. • c. In the supplementary table, the author show 100% mapping rates of sequencing reads for all samples. Having worked on numerous resequencing projects, I have rarely encountered 100% mapping rates, especially when aligning to different species. The author should check this. 4. Basic analytical skill/experience is lacking: For example, the PSMC analysis, sequencing depth can directly affect the results, but the author did not consider this issue and proceeded to compare curves generated from different sequencing depths directly. Additionally, how was the mutation rate (μ) derived? The generation time is only mentioned as coming from IUCN, but values are not provided in the paper. Moreover, in the genetic diversity section, is calculating heterozygosity only sufficient to be considered a measure of genetic diversity? Hope the author to read some re-sequencing papers thoroughly Re-review: The authors carefully addressed most of my concerns. Although I still doubt about the mapping rate (I did no find the mapping report attached), I am happy to accept this manuscript.
  
  Reviewer 2. Stephen Gaughran
  
  Are all data available and do they match the descriptions in the paper? Yes. NCBI numbers should be added when available.
  
  Comments: I would recommend using a lower mutation rate for seals: de novo mutation rates around 7e-9 have been measured for a few pinniped species. Line 129: I think you mean kya here (not Ma). Line 160: I think this should be "an average value of 0.066%"
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.19.619210v1
www.biorxiv.org www.biorxiv.org

TSTA: Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence Alignment Method

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The article presents strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA (Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence-Alignment). The Technical Release write-up presenting a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Demonstrating the performance gains achieved by vectorized SIMD technology and the application of threading. Testing and debugging a few errors, and adding some more background detail, demonstrating it can achieve faster comparison speed. Demonstrating TSTA's efficacy in pairwise sequence alignment and multiple sequence alignment, particularly with long reads, and showcasing considerable speed enhancements compared to existing tools.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractsThe rapid advancements in sequencing length necessitate the adoption of increasingly efficient sequence alignment algorithms. The Needleman-Wunsch method introduces the foundational dynamic programming (DP) matrix calculation for global alignment, which evaluates the overall alignment of sequences. However, this method is known to be highly time-consuming. The proposed TSTA algorithm leverages both vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignments.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.141). These reviews are as follows.
  
  Reviewer 1. Xingbao Song and Baoxing Song
  
  Zong et al. implemented a TSTA package that integrated the difference method, the stripe method, SIMD, and multiple threading approaches to perform efficient sequence alignments. The TSTA toolkit could conduct pairwise and multiple sequence alignments. The memory cost of TSTA is comparable with the most efficient one. Overall, TSTA is a good package, and the manuscript is well-written. While I have a few suggestions: 1) The minimap2 should be mentioned in the section on "difference recurrence relation." It has a much broader range of users and implemented an algorithm that is slightly different from the one by Suzuki, etc. 2) The striped SIMD is also implemented in reads mappers, such as BWA. 3) Page 14, line 215 "1k bps", line 227 "1000 kbps", line 230 and table1 "100k". They should be consistent. 4) In Table 4, I am not sure I understood the second and third lines correctly. Please clarify. 5) I tried to compile TSTA from the source code. To compile the package, I had to copy 'seqio.h' into the 'msa' and 'psa' folders. Please fix it.
  
  Reviewer 2. Yuansheng Liu
  
  The article explores strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA. The paper provides a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Experimental results indicate that TSTA demonstrates significant speed advantages, particularly when handling long sequences and in the no-backtracking mode. However, the experiments on MSA are limited by the experimental environment, which does not fully address the needs of current sequencing technologies concerning long reads and depth. Specifically, the low number of sequences in MSA does not meet the requirements for downstream genomic analysis applications. While the algorithm is highly innovative, its performance on short sequences and during the backtracking phase still requires optimization. 1. In line 7, the TSTA algorithm utilizes vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignment. Why are there no experiments designed specifically to evaluate the global alignment performance of TSTA with vector-level parallelism? Or are there any other experimental designs that demonstrate the improved performance of TSTA when vector-level parallelism is employed? 2. In line 149, is the Active-F method used by the TSTA algorithm contributing to the excessive memory usage and access time overhead observed during the iterative process of PSA? Are there better optimization strategies from this perspective? If not, why does TSTA incur higher time costs in traceback as shown in Table 1? Why does bsalign result in lower time consumption? 3. Can you provide the time breakdown for each part of the parallel computation in TSTA for PSA (including at least CPU computation overhead, communication overhead, and I/O overhead) to clarify if there will be significant communication overhead issues with larger datasets and more threads? 4. Table 2 shows that both real and simulated datasets have issues with insufficient depth and short reads. In real MSA processes, it is common to encounter comparisons with depth over 60X and lengths exceeding 100 kbps for long reads. The results under the current experimental conditions seem to perform poorly for such data scenarios. Can you address this? 5. Gene data often includes repetitive regions that affect the accuracy of alignment algorithms. Can you design experiments to verify how TSTA performs in aligning long repetitive regions? Specifically, how accurately does TSTA align sequences in such regions compared to other methods? 6. Besides repetitive regions, sequencing errors produced by ONT R10 chips can also impact alignment accuracy. Alignment algorithms used in genome correction often struggle to detect such errors. How does TSTA handle such issues during MSA? Can the algorithm be designed to address these sequencing errors more effectively? Re-review: After thoroughly reviewing the revised manuscript and testing the TSTA tool, I cannot endorse the manuscript for publication in its current form. I encourage the authors to address the following issues thoroughly and consider re-submitting after significant improvements. Efficiency Concerns: In the context of multiple sequence alignment (MSA), I find that TSTA does not demonstrate a significant advantage in terms of efficiency. I conducted a test with approximately 2G of homologous diploid reads (not too large data), and the tool has been running for around 29 hours. Despite this extensive runtime, the process remains incomplete. This is far from the efficiency one would expect from a tool designed for large-scale sequence alignment. Functionality Issues: There are still unresolved issues with the tool's functionality. The -f parameter does not appear to work as intended, and there are also problems with the -o parameter. Such issues need to be addressed to ensure the tool's reliability and usability.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.18.613655v1
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly and annotation of the crested gecko, Correlophus ciliatus, a lizard incapable of tail regeneration

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The crested gecko (Correlophus ciliatus), is a lizard species endemic to New Caledonia, and a potentially interesting model organism due to its unusual (for a gecko) inability to regenerate amputated tails. With that in mind here is presented a new reference genome for the species, assembled using PacBio Sequel II platform and Dovetail Omni-C libraries. Producing a genome with a total size of 1.65 Gb, 152 scaffolds, a L50 of 6, and N50 of 109 Mb. Peer review making sure more detail was added on data acquisition and processing to enhance reproducibility. In the end producing potentially useful data for studying the genetic mechanisms involved in loss of tail regeneration.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractThe vast majority of gecko species are capable of tail regeneration, but singular geckos of Correlophus, Uroplatus, and Nephrurus genera are unable to regrow lost tails. Of these non-regenerative geckos, the crested gecko (Correlophus ciliatus) is distinguished by ready availability, ease of care, high productivity, and hybridization potential. These features make C. ciliatus particularly suited as a model for studying the genetic, molecular, and cellular mechanisms underlying loss of tail regeneration capabilities. We report a contiguous genome of C. ciliatus with a total size of 1.65 Gb, a total of 152 scaffolds, L50 of 6, and N50 of 109 Mb. Repetitive content consists of 40.41% of the genome, and a total of 30,780 genes were annotated. Assembly of the crested gecko genome provides a valuable resource for future comparative genomic studies between non-regenerative and regenerative geckos and other squamate reptiles.Findings We report genome sequencing, assembly, and annotation for the crested gecko, Correlophus ciliatus.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.140). These reviews are as follows.
  
  Reviewer 1. Anthony Geneva and Cleo Falvey
  
  In their revised manuscript Gumangan and colleagues have addressed each of the comments we made on the original manuscript via substantial revisions. We appreciate the improvements the authors have made but feel there are a few remaining issues that require attention, detailed below. Those issues notwithstanding, this new assembly and annotation represent a very nice contribution to the field and will certainly be widely used.
  
  Specific comments: Pages 2 and 6: Each time L50 or L90 statistics are reported they are listed with the units “bp”. These values are counts of scaffolds are are typically simply reported as integers without units. Page 3. “Furthermore, C. ciliatus is the only non-regenerative lizard species capable of hybridizing with regenerative relatives, specifically C. sarasinorum, Mniarogekko chahoua, and Rhacodactylus auriculatus.” This statement is very interesting but requires a reference or at least attribution of some kind (eg - personal observation by one of the co-authors). Page 3: “Genomic DNA was sequenced using the Illumina Novaseq 6000 platform. 185.8 gigabase-pairs of PacBio CCS reads were used as inputs to Hifiasm v0.15.4-r347 [8] with default parameters.” The sequencer listed here for generating long reads seems to be an error and should be some PacBio platform (Sequel, Sequel IIe, etc). Page 6: “The contig/scaffold N50 is 109 Mb, and the largest scaffold had a length 1169 Mbp (Table 1)”. 1169 should be 169.
  
  Reviewer 2. Zexian Zhu
  
  Review comments are in the following link: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTU3L3Jldmlldy5kb2N4
  
  Reviewer 3. Chaochao Yan
  
  Are all data available and do they match the descriptions in the paper? No. In the section "Availability of supporting data," it is stated that "supporting datasets, including annotation, are available at GigaDB." However, I was unable to locate these datasets during my search. Could you please provide a direct link or the accession number to access these resources?
  
  Is the data acquisition clear, complete and methodologically sound? No. The manuscript currently lacks detailed information regarding the samples and data used to assemble and annotate the reference genome. For instance, it does not specify how many samples or libraries were used for RNA-Seq or whole-genome sequencing. I suggest including a table that provides comprehensive information on the samples and sequencing data. Additionally, while the manuscript mentions that "Genomic DNA was sequenced using the Illumina Novaseq 6000 platform," the corresponding Illumina data are not described. I am unclear about how the PacBio CCS reads were produced. Could you please provide more details or clarify the methodology used to generate these reads?
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Some methods described in the manuscript lack sufficient detail, particularly for tools such as BLAST, BlobTools, HiRise, and BWA. To ensure reproducibility, I recommend providing the specific parameters used for these analyses.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.28.615630v1
www.biorxiv.org www.biorxiv.org

SMARTER-database: a tool to integrate SNP array datasets for sheep and goat breeds

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractUnderutilized sheep and goat breeds have the ability to adapt to challenging environments due to their genetic composition. Integrating publicly available genomic datasets with new data will facilitate genetic diversity analyses; however, this process is complicated by important data discrepancies, such as outdated assembly versions or different data formats. Here we present the SMARTER-database, a collection of tools and scripts to standardize genomic data and metadata mainly from SNP chips arrays on global small ruminant populations with a focus on reproducibility. SMARTER-database harmonizes genotypes for about 12,000 sheep and 6,000 goats to a uniform coding and assembly version. Users can access the genotype data via FTP and interact with the metadata through a web interface or programmatically using their custom scripts, enabling efficient filtering and selection of samples. These tools will empower researchers to focus on the crucial aspects of adaptation and contribute to livestock sustainability, leveraging the rich dataset provided by the SMARTER-database.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.139). These reviews are as follows.
  
  Reviewer 1. Ran Li
  
  The authors presented an online SMARTER-database, which collected a large number of genotype data for sheep and goats. The resources are of great importance for the community.
  
  My major concerns: 1) The below link is not accessible: webserver.ibba.cnr.it 2) For sheep, the database support reference genome assembly of Oar3 and Oar4, but actually Oar 3 is rarely used. Instead, the current ovine reference genome assembly (ARS-UI_Ramb_v3.0) is not supported. 3) For the presentation of metadata (https://webserver.ibba.cnr.it/smarter/breeds?species=Sheep), I suggest additional columns indicating the region and country should be provided. 4) For the datasets (https://webserver.ibba.cnr.it/smarter/datasets), references are needed to know where the data are from.
  
  Re-review:
  
  My comments have been properly addressed. The manuscript is acceptable for publication.
  
  Reviewer 2. Hans Lenstra and Johannes A. Lenstra
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is implicitly clear and does not need to elaborate upon.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? No. This does not to seem necessary.
  
  Is the code executable? unable_to_test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? unable_to_test Is the documentation provided clear and user friendly? Yes. I did not test this.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? No. I did not see such a list, but I would not be able to assess this.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages? not_applicable
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No. I did not find any of this but it does not seem to be essential.
  
  Additional Comments: This manuscript describes a highly useful database of sheep and goat genome-wide SNP genotypes from several sources, supplemented with phenotypes and geographic locations. I recommend this manuscript for publication in Gigascience after a revision. There is some missing information, whereas the presentation should become less cryptic to readers who are less familiar with the bioinformatic terminology. Missing info. 1. The title and abstract do not mention that SMARTER focuses on SNPs that are genotyped on bead arrays or related technologies. The focus on the genome-wide (GW) SNP genotypes, which only partially represents the total genomic diversity, should already be clear from the Title and the Abstract. 2. Nowadays there are more publications on WGS data, T2T sequences and pangenomes than on GW SNP genotypes, so people may wonder if the GW SNP genotypes still are useful. It may be emphasized that bead-arrays allow an affordable analysis of many animals and that genotypes derived from WGS data contain many false homozygote scores if not sequenced at a very high coverage. 3. Figures 2 and 3 give an idea of the geographic coverage, but what is the unit of the numbers that are visualized in the heat map (0 to 2300 for sheep, 0 to 1100 for goats)? 4. It is not clear which published data have been used or not. We recommend presenting a supplemental table describing the current contents: country, breed, number of animals, number of SNPs (at least 50K or HD), reference. 5. Is there an organized effort to update the database, which ideally should contain all published GW SNP databases? 6. To my experience for most HW SNP datasets only the filtered data after quality control (typically 45 to 49K, less than 42K if sheep 50K and HD genotypes are combined) are available. How is this handled? 7. It may be mentioned that after omission of A/T and G/C SNPs the TOP strand consists only of A/C and A/G SNPs. 8. The problematic SNPs are mentioned twice within the last paragraph of the section Data Composition. 9. Does SMARTER allow to store phased datasets and show the variant haplotypes? These can now be generated by long-read sequencing and are needed for several downstream analysis options. 10. Table 1: OAR3 = Oar_v3.1 and OAR4 = Oar_v4.0? Please use the official codes. 11. Are there options to convert the data to newer assemblies? For instance, the sheep ARS-UI_Ramb_v3.0 is superior to Oar_v4.0. I have used an NCBI tool for conversion of Oar_v1.0 (most popular for 50K datasets) and Oar_3.1 (used often for sheep HD datasets) to Oar_v4.0, but this tool has probably been discontinued and was not available for goat assemblies. 12. I repeatedly found that most published or unpublished databases contain several errors such as duplicates and outliers by mislabeling or crossbreeding. Because these are better removed prior to downstream analysis, data curation would be desirable, for instance by an inspection of a NJ tree of individuals. This also shows the degree of breed-level differentiation, for instance the relationships of different populations of a transboundary breed. These caveats should at least be mentioned. 13. Another caveat: is there a systematic check on the validity of the merging of datasets by testing if breeds sampled independently by different institutes cluster closely together? Presentation. 14. Abbreviations should not be used in abstract. What is “REST API”? These abbreviations of course are in the list, but what is “Representational State Transfer”? And “JSON Web Token”? 15. Figure 1 needs more guidance via the legend. The boxes show alternative formats? What are “str”, “dict “? 16. Figure 5 is useful and seems to retrieve data for the goat Alpine and Bionda dell'Adamello breeds. It would also be useful to show other “API-URL” (this is user input?) while describing in plain language what is being accomplished. 17. Figure 6: bold indicates the user input? What is exactly a “array [string]” (give an example)? A few other examples may be most instructive and familiarize the reader with the logic of SMARTER. 18. In the section “The SMARTER-database project”: what is a mongoengine? 19. In the same section: “Finally the VariantSpecie abstract class is inherited by . . .”: this sentence is difficult to understand. 20. In the section Reproducibility: please give a short description of what is the use of the Conda and Docker programs. 21. Same section: “Raw data undergoes initial exploration”, “structure and potential issues”: can you be more specific? The last part of this section is also difficult to follow.
  
  Re-review: This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.01.610681v2
www.biorxiv.org www.biorxiv.org

NucBalancer: Streamlining Barcode Sequence Selection for Optimal Sample Pooling for Sequencing

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents NucBalancer, a R-pipeline and Shiny app designed for the optimal selection of barcode sequences for sample multiplexing in sequencing. Providing a user-friendly interface aiming to make this process accessible to both bioinformaticians and experimental researchers, enhancing its utility in adapting libraries prepared for one sequencing platform to be compatible with others. Important now with the introduction of additional sequencing platforms by Element Biosciences (AVITI System) and Ultima Genomics (UG100) increasing the diversity and capability of genomic research tools available. NucBalancer’s incorporation of dynamic parameters, including customizable red flag thresholds, allows for precise and practical barcode sequencing strategies. This adaptability is key in ensuring uniform nucleotide distribution, particularly in MGI sequencing and single-cell genomic studies, leading to more reliable and cost-effective sequencing outcomes across various experimental conditions. All the code is available under an open source license, and upon review the authors have also shared the code for the Shiny app.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractRecent advancements in next-generation sequencing (NGS) technologies have brought to the forefront the necessity for versatile, cost-effective tools capable of adapting to a rapidly evolving landscape. The emergence of numerous new sequencing platforms, each with unique sample preparation and sequencing requirements, underscores the importance of efficient barcode balancing for successful pooling and accurate demultiplexing of samples. Recently launched new sequencing systems claim better affordability comparable to more established platforms further exemplifies these challenges, especially when libraries originally prepared for one platform need conversion to another. In response to this dynamic environment, we introduce NucBalancer, a Shiny app developed for the optimal selection of barcode sequences. While initially tailored to meet the nucleotide, composition challenges specific to G400 and T7 series sequencers, NucBalancer’s utility significantly broadens to accommodate the varied demands of these new sequencing technologies. Its application is particularly crucial in single-cell genomics, enabling the adaptation of libraries, such as those prepared for 10x technology, to various sequencers including G400 and T7 series sequencers. By facilitating the efficient balancing of nucleotide composition and the accommodation of differing sample concentrations, NucBalancer plays a pivotal role in reducing biases in nucleotide composition. This enhances the fidelity and reliability of NGS data across multiple platforms. As the NGS field continues to expand with the introduction of new sequencing technologies, the adaptability and wide-ranging applicability of NucBalancer render it an invaluable asset in genomic research. This tool addresses the current sequencing challenges ensuring that researchers can effectively balance barcodes for sample pooling regardless of the sequencing platform used.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.138). These reviews are as follows.
  
  Reviewer 1. Aamir Khan
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. The tool has novel features not reported in previous tools for barcoding.
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Yes. The tool is available as an R script as well as a shiny app.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I would suggest mentioning a few features that are novel or superior to other tools. Perhaps adding a table specifying these novel features that are not part of existing tools will add value to MS.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The documentation is provided in a clear and user-friendly way. The input file formats are given in the GitHub page. It would be better to add an example to the shiny app page.
  
  Yes. Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. Dependencies are mentioned on the tool documentation page and can be installed if R is already installed.
  
  Additional Comments: The authors have a well-written MS describing the NucBalancer tool. The tool adds value for sequencing by pooling samples and will be useful as we make technological advancements in the sequencing space.
  
  Reviewer 2. Hugo Varet
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. The manuscript explains the constraints to be satisfied when looking for barcodes but more details about the context (Illumina chemistry for instance) would be appreciated. Moreover, is the software compatible with dual-indexing?
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Yes. The source code of the program is available on GitHub as a R script, but the source code of the Shiny application is not available.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  Yes. Support can be asked by email to the authors as stated at the end of the README on GitHub.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. The example command line works well. However, the R script needs shiny and xtable packages to be loaded even if none of their functions is actually called in the script.
  
  Is the documentation provided clear and user friendly?
  
  No. A detailed documentation would improve the application proposed. In particular, more details about the different chemistries used by Illumina, MGI... and the constraints to find compatible barcodes.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. The strategy used to find barcodes seems very simple, but more details would improve the manuscript.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. The manuscript cites several packages developed to find compatibles sequencing barcodes but the performances are not compared. Moreover, we do not know if NucBalancer still work with a high number of samples/barcodes.
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  No. A real world example would be appreciated to illustrate the software, especially in a scenario where the other cited solutions were not able to find compatible barcodes.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No.
  
  Additional Comments: I would suggest the authors to improve the design of the Shiny app as (at the moment) it only runs a R script and prints the result. Moreover, I think the quality of the R code could be easily improved (e.g. loops with strange counters or comparisons with booleans).
  
  Re-review: I thank the authors for the improvements they made on this new version of the manuscript. At this stage, I'm not totally satisfied for the following reasons: - authors tell the source code of the Shiny app is now available on GitHub, but I have not been able to find it. - in the manuscript, the sentence "The tool does not have any dependency other than the utilities from the base R package" is no longer true as the tool now uses optparse. - in table 1, checkMyIndex is referenced with no web interface available white it actually exists (https://checkmyindex.pasteur.fr/). Moreover, the proposed web interface could still be improved. For instance: - it would be great to add something to show the algorithm is currently looking for a solution. - check the input files have a valid structure to be used. - display the input files when they are loaded to make sure the user uploaded the correct file.
  
  Reviewer 3. Wen Yao
  
  The authors reported a new tool for barcode sequences design. This tool is developed using R/Shiny and is available for using online. Below are my comments for further improvement of the manuscript and the tool. 1. Please provide a “load example data” button in the Shiny app. With this button, the example data can be easily loaded by the users for testing NucBalancer. 2. This URL (http://146.118.68.98:8888/) for NucBalancer should also be given in the manuscript. 3. The “Download Table” button is not working. 4. Format of the input data should be checked, as input data in wrong format caused the NucBalancer to crash. 5. The authors should compare NucBalancer with published similar tools in this field. More details are required.
  
  Re-review: The authors have addressed all my concerns.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.06.611747v1
www.biorxiv.org www.biorxiv.org

V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation

2
1. GigaScience 10 Nov 2024
  
  in GigaScience
  
  The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.
  
  Reviewer: Shilpa Garg
  
  V-pipe 3.0 is introduced as an advanced computational pipeline tailored for the analysis of nextgeneration sequencing data from short viral genomes. Designed to meet the challenges posed by the vast and diverse datasets generated by these technologies, V-pipe 3.0 emphasizes reproducibility, scalability, adaptability, and transparency. It achieves this by adhering to Snakemake's best practices, allowing easy swapping of virus-specific configuration files, and providing thoroughly tested examples online.
  
  The utility of V-pipe 3.0 is showcased through its application in two extensive data analysis projects, proving its efficacy in sustainable viral genomic data science. Central to V-pipe 3.0 is its capacity for estimating viral diversity from sequencing data. A versatile benchmarking module has been developed to continuously assess various diversity estimation methods, accommodating the rapid advancements within this field. The pipeline simplifies the inclusion of new tools and datasets, supporting both synthetic and real experimental data. However, challenges in global haplotype reconstruction highlight the need for scalable methods that can accurately reflect the complex population structures of viruses and manage the uncertainties in the results.
  
  Some additional clarification in the manuscript would be appreciated. 1) I'm curious about how the efficiency is attained. 2) Is it possible to utilize V-pipe for analyzing other microorganisms? 3) The authors might consider directing readers to the following review article for reference: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9 4) Identifying specific genes or genome regions with high polymorphism across different populations would be fascinating. How does V-pipe handle analysis in these highly variable regions?
2. GigaScience 10 Nov 2024
  
  in GigaScience
  
  AbstractThe large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.
  
  Reviewer: Fotis Psomopoulos
  
  The manuscript showcases a computational pipeline designed for analyzing next generation sequencing data of short viral genomes, namely V-pipe 3.0. After an overview of the challenge the tool is addressing, i.e. the necessity of continuous benchmarking of various methods due to their diverse performance across different scenarios,the paper continues with a detailed listing of the results, highlighting the key elements of Reproducibility, Scalability, Adaptability and Transparency.
  
  The next section provides some details on the three applications / demonstrations of V-Pipe 3.0, i.e the Swiss SARS-CoV-2 Sequencing Consortium, the Swiss surveillance of SARS-CoV-2 genomic variants in wastewater and the Global haplotype reconstruction benchmark. This is followed by a comprehensive comparison of V-Pipe 3.0 Ï„Î¿ other relevant viral bioinformatics pipelines for within sample diversity estimation, focusing on functionalities and sustainability, and specifically nf-core/viralrecon, HAPHPIPE and ViralFlow, as well as a section discussing the main advantages of V-Pipe 3.0 as well as the rationale for some of the identified drawbacks.
  
  The paper concludes with a thorough description of the underlying methods of V-Pipe 3.0 as well as on the data used. Overall the paper gives a very good presentation of V-Pipe, and makes a strong case about its use and value in a real-world challenge. An overall comment is that there is some confusion on the role of V-Pipe 3.0 as a workflow - i.e. whether it's a dynamic system that uses different tools per step based on user input, or if it's an automated systems that benchmarks the analysis using (e.g.) synthetic data as the baseline. In either case, there are also a few unclear points in the manuscript itself that could be further improved.
  
  Specifically: -- It is not clear how V-pipe 3.0 differs from V-pipe. Although there is an indication of significant differences, an overview of the new features implemented in this version and/or a small introductory paragraph would be useful. -- In the "Results" section, lines 130 - 225 appear to refer to the implemented methodology and might be better served as part of the "Methods" section -- In the "Results" section, lines 135 - 138 implied that GitHub Actions are used to ensure Reproducibility of the workflow. Some more elaboration on this would be very useful, as GitHub actions are commonly used to automate processes (such as testing, conflict resolution etc). In particular, an reproducibility issue that might not be resolvable by GitHub actions are dependency conflicts that are specific to the particular system that is being tested. -- In the "Results" section, lines 139 - 146, it's not clear how the benchmark study contributes to the overall reproducibility of V-pipe 3.0. Some further explanation of the rationale would be very useful here. -- In the "Results" section, lines 179 - 183, it is not clear how Git and GitHub ensure adaptability of any new features that are implemented. Usually a version control system/automation system, can facilitate the integration of new features, but it's not readily evident how it supports/ensures/facilitates adaptability. Maybe a definition of "adaptability" in this particular context could also help. -- In the "Applications" section, it is not clear which version of V-Pipe was used for the overall analysis (V-pipe or V-pipe 3.0), especially in the wastewater use case. -- In the section "Comparison to other workflows" it is not very clear which tools are implemented within V-pipe 3.0, which differences there are with previous version (V-pipe) and how these differ to other pipelines. A table that is summarizing these details and highlighting the differences would be very useful here. Moreover, there are a few minor points that would enhance the readers' understanding: -- (minor) In the Section "2.1 Reproducibility", it's mentioned that all software dependencies are defined in Conda environments, making V-pipe 3.0 portable between different computing platforms. Is there a particular reason why V-Pipe itself isn't implemented as a conda package directly? -- (minor) More often that not, the pandemic is named as COVID19, in contrast to the virus that is named "SARS-CoV-2". It may be useful to amend/update the references to the "SARS-CoV-2 pandemic" accordingly.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.16.562462v1
Oct 2024
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

CannSeek? Yes we Can! An open-source single nucleotide polymorphism database and analysis portal for Cannabis sativa

1
1. GigaScience 30 Oct 2024
  
  in Public
  
  Preprint submitted to: https://doi.org/10.25918/preprint.367
  
  A webinar with the authors is also available in Cassyni https://doi.org/10.52843/cassyni.y1p61f
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC11480739/
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research

1
1. GigaScience 30 Oct 2024
  
  in Public
  
  Preprint submitted to: https://doi.org/10.1101/2024.10.02.616368
  
  A webinar with the authors is also available in Cassyni https://doi.org/10.52843/cassyni.y1p61f
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC11515022/
www.biorxiv.org www.biorxiv.org

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research

2
1. GigaScience 20 Oct 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper reports the establishment of the International Cannabis Genomics Research Consortium (ICGRC) web portal leveraging the open source Tripal platform to enhance data accessibility and integration for Cannabis sativa (Cannabis) multi-omics research. With the aim of bringing together the wealth of publicly available genomic, transcriptomic, proteomic, and metabolomic data sets to improve cannabis for food, fiber and medicinal traits. Tripal is a content management system for genomics data, presenting a ready-to-use specialized ‘omics modules for loading, visualization, and analysis, and is GMOD (Generic Model Organism Database) standards-compliant. The paper explaining how this was put together, what data and features are available, and providing a case study for other communities wanting to create their own Tripal platform. Covering their setup and customizations of the Tripal platform, and how they re-engineered modules for multi-omics data integration, and addition of many other custom features that can be reused. Peer review fixed a few minor bugs and added clarifications on how the platform will be updated.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 20 Oct 2024
  
  in GigaByte
  
  AbstractGlobal changes in Cannabis legislation after decades of stringent regulation, and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host Cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing.The Tripal platform was used to host public genome assemblies, gene annotations, QTL and genetic maps, gene and protein expression, metabolic profile and their sample attributes. SNPs were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service API, developed on top of existing Tripal modules, returns generic tables of sample, property, and values. Use-cases demonstrate the API’s utility for various -omics analyses, enabling researchers to perform multi- omics analyses efficiently.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.137). These reviews are as follows.
  
  Reviewer 1. Weiwen Wang
  
  Is the code executable?
  
  Unable to test.
  
  This manuscript is about an online platform, and I am not sure how to test the code.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Same as above.
  
  Additional Comments:
  
  With the increasing legalization of cannabis in many countries today, exploring this crop has become a hot topic of research. This manuscript by Mansueto et al. introduces a platform built on the Tripal framework, designed to facilitate multi-omics research in Cannabis sativa. The platform integrates genomic, transcriptomic, proteomic, and metabolomic data, providing researchers with a comprehensive resource for data analysis and sharing. Additionally, APIs have been developed, enabling rapid querying. This manuscript detailed information on how to customize Tripal modules and Chado schema for managing biological entities. Finally, this manuscript highlights the importance of standardization in data storage and analysis, proposing community-wide adoption of standardized nomenclature to ensure consistency and traceability of data. Overall, the platform is poised to become a valuable resource for cannabis research and to advance scientific progress in related fields.
  
  While this manuscript was engaging, particularly in the sections on Tripal "re-engineering" and controlled vocabulary, I do have several concerns.
  
  1 Because my registration (using business email) has not been approved, I cannot test the functions requiring ICGRC registration.
  
  2 The authors noted that the Cannabis Genome Browser has not been updated. Do the authors have a plan for updating ICGRC? If so, what is the proposed update frequency?
  
  3 ICGRC currently includes only a few cannabis cultivars, especially when compared to other platforms like Kannapedia and CannabisGDB. Do the authors have plans to add additional cultivars, such as First Light and Jamaican Lions mentioned in this manuscript, in the near future?
  
  4 When I tried to register using Gmail, an error popped up: ‘Domain is not allowed to register for this site’. Perhaps it would be clearer to instruct users to use a business email for registration directly.
  
  5 There is a data submission function in ICGRC, but the exact workings of this feature remain unclear to me. If a user submits a cannabis genome to the ICGRC, whether this data will be visualized within specific modules like synteny search or genetic mapping tools on the platform.
  
  Reviewer 2. Hongyun Shang
  
  Is the code executable?
  
  Unable to test.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Unable to test.
  
  This is a comprehensive database with many features that improves the shortcomings of cannabis species that had no genome database in the past. It is a good work. Here are some minor suggestions:
  
  Did not find the function of searching gene and protein sequences directly by gene id without providing chromosome location, which may be a common feature of many omics databases.
  
  In the chapter "The need for cannabis multi-omics databases and analysis platforms", "There are no analysis tools or results available on this website", "No results available" seems inappropriate.
  
  In the chapter "Cannabis - Omics, Genetic and Phenotypic Datasets in the Public Domain", "Crop Ontology" Crop Ontology, is "Crop Ontology" repeated?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.02.616368v1
www.biorxiv.org www.biorxiv.org

PhysiMeSS - A New PhysiCell Addon for Extracellular Matrix Modelling

2
1. GigaScience 19 Oct 2024
  
  in GigaByte
  
  Editors Assessment:
  
  PhysiCell is an open source multicellular systems simulator for studying many interacting cells in dynamic tissue microenvironments. As part of the PhysiCell ecosystem of tools and modules this paper presents a PhysiCell addon, PhysiMeSS (MicroEnvironment Structures Simulation) which allows the user to accurately represent the extracellular matrix (ECM) as a network of fibres. This can specify rod-shaped microenvironment elements such as the matrix fibres (e.g. collagen) of the ECM, allowing the PhysiCell user the ability to investigate physical interactions with cells and other fibres. Reviewers asked for additional clarification on a number of features. And the paper now clear future releases will provide full 3D compatibility and include working on fibrogenesis, i.e. the creation of new ECM fibres by cells.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 19 Oct 2024
  
  in GigaByte
  
  AbstractThe extracellular matrix is a complex assembly of macro-molecules, such as collagen fibres, which provides structural support for surrounding cells. In the context of cancer metastasis, it represents a barrier for the cells, that the migrating cells needs to degrade in order to leave the primary tumor and invade further tissues. Agent-based frameworks, such as PhysiCell, are often use to represent the spatial dynamics of tumor evolution. However, typically they only implement cells as agents, which are represented by either a circle (2D) or a sphere (3D). In order to accurately represent the extracellular matrix as a network of fibres, we require a new type of agent represented by a segment (2D) or a cylinder (3D).In this article, we present PhysiMeSS, an addon of PhysiCell, which introduces a new type of agent to describe fibres, and their physical interactions with cells and other fibres. PhysiMeSS implementation is publicly available at https://github.com/PhysiMeSS/PhysiMeSS, as well as in the official Physi-Cell repository. We also provide simple examples to describe the extended possibilities of this new framework. We hope that this tool will serve to tackle important biological questions such as diseases linked to dis-regulation of the extracellular matrix, or the processes leading to cancer metastasis.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.136), and has published the reviews under the same license. It is also part of GigaByte’s PhysiCell Ecosystem series for tools that utilise or build upon the PhysiCell platform: https://doi.org/10.46471/GIGABYTE_SERIES_0003 These reviews are as follows.
  
  Reviewer 1. Erika Tsingos
  
  One important aspect that the authors need to be aware of and mention explicitly is that their algorithm for fiber set-up leads to differences in fiber concentration and orientation at the boundary, because fibers that are not wholly contained in the simulation box are discarded. The effect of this choice can be seen upon close inspection of Figure 2: In the left panel, fibers align tangentially to the boundary, so locally the orientation is not isotropic. Similarly, in Figure 2 middle and right panels, the left and right boundaries have lower local fiber concentration. This issue could potentially affect the outcome of a simulation, so it's important that readers are made aware so that if necessary they can address this with a modified algorithm. ----- Minor comments: In the abstract, the phrasing implies agent-based frameworks are only used for tumour evolution. I would rephrase such that it is clear that tumour evolution is one example among many possible applications. I suggest adding a dash to improve readability in the following sentence in the introduction: "However, we note that the applications of PhysiMeSS stretch beyond those wanting to model the ECM -- as the new cylindrical/rod-shaped agents could be used to model blood vessel segments or indeed create obstacles within the domain." In the implementation section, add a short sentence to clarify if PhysiMeSS is "backwards compatible" with older PhysiCell models that do not use the fiber agent. Notation in equations: A single vertical line is absolute value, and two vertical lines is Euclidean norm? The explanation of Equation 1 implies that the threshold v_{max} should limit the parallel force, but the text does not explicitly say if ||v|| is restricted to be less or equal to v_{max}. Is that the case? In Equation 2, I don't see the need to square the terms in parenthesis. If |v*l_f| is an absolute value it is always positive. Since l_f is normalized the value of the dot product is only between 0 and the magnitude of v. Am I missing something? Are p_x and p_y in the moment arm magnitude coordinates with respect to the fiber center? Table 2: It would be helpful to have a separate column with the corresponding symbols used throughout the text and equations. Figure 5/6: Missing crosslinker color legend. ----- Typos/grammar: "As an aside, an not surprisingly," --> As an aside, and not surprisingly, "This may either be because as a cell tries to migrate through the domain fibres which act as obstacles in the cell’s path," --> remove the word "which"
  
  Reviewer 2. Jinseok Park
  
  Noel et al. introduce PhysiMess - a new PhysiCell Addon for ECM remodeling. This new addon is a powerful tool to simulate ECM remodeling and has the potential to be applied to mechanobiology research, which makes my enthusiasm high. I would like to give a few suggestions. 1) Basically, it is an addon of PhysiCell. So, I suggest describing PhysiCell and how to add the addon for readers who are not familiar with these tools. Also, screen captures of tool manipulation would be very helpful. 2) Figure 2 and 3 exhibit the outcome of the addon showing ECM remodeling. I would suggest to show actual ECM images modeled by the addon. 3) The equations reflect four interactions, and in my understanding, the authors describe cell-fibre, fiber-cell, and fiber-fiber interactions. I suggest generating an example corresponding to each interaction's modulation and explaining how the add-on results explain the physiological phenomena. For instance, focal adhesion may be a key modulator of cell-fibre or fiber-cell interaction, presumably, alpha or beta fiber. I would demonstrate how the different parameters generate different results and explain the physiological situation modeled by the results. 4) Similarly, Figure 5 and Figure 6 only show one example and no comparison with other conditions. For example, It would be better to exhibit no pressure/pressure conditions. It may help readers estimate how the pressure impacts cell proliferation.
  
  Reviewer 3. Simon Syga
  
  The presented paper "PhysiMeSS - A New PhysiCell Addon for Extracellular Matrix Modelling" is a useful extension to the popular simulation framework PhysiCell. It enables the simulation of cell populations interacting with the extracellular matrix, which is represented by a set of line segments (2D) or cylinders (3D). These represend a new kind of agent in the simulation framework. The paper outlines the basic implementation, properties and interactions of these agents. I recommend publication after a small set of minor issues have been addressed. Please refer to the attached marked-up PDF file for these minor issues and suggestions. https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTUwL2d4LVRSLTE3MTk5NDYwNjlfU1kucGRm
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.27.564365v1
academic.oup.com academic.oup.com

Deepdefense: annotation of immune systems in prokaryotes using deep learning

1
1. GigaScience 13 Oct 2024
  
  in Gigascience Annotations
  
  The data used for training are publicly available as part of the publication [26]. The used HMM models are part of the publication [27]. Additionally, data were taken from the PADLOC website [41].
  
  DOME-ML annotations are also available for scruitiny https://dome.ds-wizard.org/projects/8f35140d-3b02-4328-ac18-c6bd9f8620e4
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae062/7817746
academic.oup.com academic.oup.com

3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources

2
1. GigaScience 11 Oct 2024
  
  in Gigascience Annotations
  
  John Jumper,
  
  Joint winner of the 2024 Nobel Prize in Chemistry https://x.com/NobelPrize/status/1843951197960777760
2. GigaScience 11 Oct 2024
  
  in Gigascience Annotations
  
  Demis Hassabis,
  
  Joint winner of the 2024 Nobel Prize in Chemistry https://x.com/NobelPrize/status/1843951197960777760
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac118/6854872
Sep 2024
www.biorxiv.org www.biorxiv.org

High-speed whole-genome sequencing of a Whippet: Rapid chromosome-level assembly and annotation of an extremely fast dog’s genome

2
1. GigaScience 27 Sep 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This Data Release paper presents the genome of the whippet breed of dog. Demonstrating a streamlined laboratory and bioinformatics workflows with PacBio HiFi long-read whole-genome sequencing that enables the generation of a high-quality reference genome within one week. The genome study being a collaboration between an academic biodiversity institute and a medical diagnostic company. The presented method of working and workflow providing examples that can be used for a wide range of future human and non-human genome projects. The final is 2.47 Gbp assembly being of high quality - with a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. This reference being scaffolded into 39 chromosome-length scaffolds and the annotation resulting in 28,383 transcripts. The results also looked at the Myostatin gene which can be used for breeding purposes, as these heterozygous animals can have an advantage in dog races. The reviewers making the authors clarify this part a little better with additional results. Overall this study demonstrating how rapidly animal genome research can be carried out through close and streamlined time management and collaboration.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 27 Sep 2024
  
  in GigaByte
  
  AbstractBackground The time required for sequencing and de novo assembly of genomes is highly dependent on the interaction between laboratory work, sequencing capacity, and the bioinformatics workflow. As a result, genome projects are often not only limited by financial, computational and sequencing platform resources, but also delayed by second party sequencing service providers. By bringing together academic biodiversity institutes and a medical diagnostics company with extensive sequencing capabilities and know-how, we aimed at generating a high-quality mammalian de novo genome in the shortest possible time period. Therefore, we streamlined all processes involved and chose a very fast dog as a model: The Whippet.Findings We present the first chromosome-level genome assembly of the Whippet. We used PacBio long-read HiFi sequencing and reference-guided scaffolding to generate a high-quality genome assembly. The final assembly has a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. The total assembly length is 2.47 Gbp, of which 2.43 Gpb were scaffolded into 39 chromosome-length scaffolds. In addition, we used available mammalian genomes and transcriptome data to annotate the genome assembly. The annotation resulted in 28,383 transcripts resembling a total of 90.9% complete BUSCO genes and identified a repeat content of 36.5%.Conclusions Sequencing, assembling, and scaffolding the chromosome-level genome of the Whippet took less than a week and adds a high-quality reference genome to the list of domestic dog breeds sequenced to date.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.134), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Tianming Lan
  
  The authors provided an example of High-speed strategy for whole-genome sequencing, genome assembly and annotation for species and take an example with the Whippet dog. This is a very novel idea under the genomic era with plummeting sequencing cost, fast accumulated sequencing data but shortage of computing resources. The authors also provide a very high-quality reference genome for the Whippet dog species with very good contiguity, accuracy and completeness. However, I have several concerns need the authors to further consider before it could be published at the journal of GigaByte.
  
  Q1. There are too many keywords. Can the authors reduce a few? Biodiversity conservation, Comparative genomics, and evolutionary biology does not make sense in this manuscript. Q2. The authors performed reference-guided scaffolding analysis with the German Shepherd dog genome (GCA_011100685.1) as reference. Better if the authors explain why they selected this genome as the reference as there are several published dog genomes? Q3.The part of Heterozygosity make no sense to this manuscript unless there is a reasonable connection with other parts, because the dog is not a threatened species and also not a very special breed facing extensive inbreeding abd accumulation of deleterious mutations? Q4. The part of Myostatin doesn’t make sense to me, as I have read the paper the author cited and found that not all Whippet have this mutation? They sequenced 22 individuals, and 4 individuals are homozygous (-/-), 5 are heterozygous (mh/+) and the rest are homozygous (+/+). So you can always have a result by checking this mutation, but make no sense. Furthermore, one individual can hardly represent a species or a population? At the beginning of this paragraph, please change “Since” to “Since”. Q5. I think the most important find in this manuscript is how the authors finished a high-quality genome within a very short-term working. I suggest the authors remove the descriptions of Heterozygosity and Myostatin, but added a paragraph to tell readers the basic needs or standards for such a short-term work for genome assembly for a genome of something like dog. Just a suggestion, but I think would be better to improve the manuscript.
  
  Reviewer 2. Xiaobo Wang
  
  This study outlines an approach to expedite the sequencing and de novo assembly of genomes by leveraging collaboration between academic biodiversity institutes and a medical diagnostics company with advanced sequencing capabilities. The primary focus was on generating a high-quality de novo genome of the Whippet, a fast dog breed, within an accelerated timeframe. Below are some specific comments I would like to highlight.
  
  The authors mentioned the use of QUAST and QualiMap software tools to assess the genome of the Whippet; however, the corresponding results were not presented in the manuscript.
  
  The authors' reliance solely on mammalian protein sequences for homology annotation means that unique genes specific to the Whippet remain unannotated. The discrepancy of approximately 7% between the completeness assessments of the gene set and the genome via BUSCO further underscores the incomplete nature of the gene set. To address this, I recommend integrating transcriptome data, at the very least, to incorporate de novo annotation results. This addition should enhance the comprehensiveness and accuracy of gene annotations for the Whippet genome.
  
  The authors claim the absence of reported mutations in the Mstn gene but have not provided corroborating evidence, such as read alignment results from the genomic region, to verify that this is not due to assembly errors.
  
  If feasible, I propose integrating second-generation sequencing to further polish the genome and elevate its quality.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.16.608262v1
www.biorxiv.org www.biorxiv.org

RiboSnake – a user-friendly, robust, reproducible, multipurpose and documentation-extensive pipeline for 16S rRNA gene microbiome analysis

2
1. GigaScience 04 Sep 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This new software paper presents RiboSnake, a validated, automated, reproducible analysis pipeline implemented in the popular Snakemake workflow management system for microbiome analysis. Analysing16S rRNA gene amplicon sequencing data, this uses the widely used oQIIME2 [ tool as the basis of the workflow as it offers a wide range of functionality. Users of QIIME2 can be overwhelmed by the number of options at their disposal, and this workflow provides a fully automated and fully reproducible pipeline that can be easily installed and maintained. Providing an easy-to-navigate output accessible to non bioinformatics experts, alongside sets of already validated parameters for different types of samples. Reviewers requested some clarification for testing, worked examples and documentation, and this was improved to produce a convincingly easy-to-use workflow. Hopefully opening up an already very established technique to a new group of users and assisting them with reproducible science.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 04 Sep 2024
  
  in GigaByte
  
  AbstractBackground Next-generation sequencing for assaying microbial communities has become a standard technique in recent years. However, the initial investment required into in-silico analytics is still quite significant, especially for facilities not focused on bioinformatics. With the rapid decline in costs and growing adoption of sequencing-based methods in a number of fields, validated, fully automated, reproducible and yet flexible pipelines will play a greater role in various scientific fields in the future.Results We present RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake for the computational analysis of 16S rRNA gene amplicon sequencing data. The pipeline comes with pre-packaged validated parameter sets, optimized for different sample types. The sets range from complex environmental samples to patient data. The configuration packages can be easily adapted and shared, requiring minimal user input.Conclusion RiboSnake is a new alternative for researchers employing 16S rRNA gene amplicon sequencing and looking for a customizable and yet user-friendly pipeline for microbiome analysis with in-vitro validated settings. The complete analysis generated with a fully automated pipeline based on validated parameter sets for different sample types is a significant improvement to existing methods. The workflow repository can be found on GitHub (https://github.com/IKIM-Essen/RiboSnake).
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.132), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Michael Hall
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Unable to test. The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?
  
  The Usage instructions say to create a new repository using ribosnake as a template, but ribosnake is not a template repository (see https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template). The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Not applicable.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  Yes, though as mentioned above, the README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Additional Comments:
  
  The Introduction could be make far more concise, there's a lot of repetition.
  
  The installation command in figure 1 is three commands, not two as stated in the text (third-last paragraph Introduction), and is slightly misleading from an installation point of view as it assumes conda and snakemake are installed. Though it is mentioned later in the text (p5) that snakemake and conda require manual installation.
  
  The in-text citation for Greengenes2 is just [?] - maybe a latex issue?
  
  The last paragraph of the 'Features and Implementations' section was mostly already stated earlier in the manuscript.
  
  Make the colouring consistent between fig 2a-c and 2d as well as the vertical ordering to make for easier comparison. For example, in figures 2a-c Enterococcus (grey) is on the bottom, whereas in fig 2d it is red and in the middle. Colour legends should also be added to Figures 3-5 to match Fig 2.
  
  A small table should be added showing the comparison of RiboSnake and the original publication for the top 10 most abundant phyla for the Atacama soil dataset and their abundances (see last paragraph of 'Usage and Findings'.
  
  Reviewer 2. Yong-Xin Liu and Salsabeel Yousuf
  
  The manuscript presented by the authors describes a comprehensive study on the “RiboSnake pipeline” for 16S rRNA gene microbiome analysis, which is a user-friendly, robust, and multipurpose. RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake, offers parallel processing for efficient analysis of large datasets in both environmental and medical research contexts. Further demonstrating its effectiveness, this pipeline effectively analyzes human-associated microbiomes and environmental samples like wastewater and soil, thus expanding the scope of analysis for 16S rRNA data. The overall computational pipeline is useful and results are sound, validated through rigorous testing on MOCK communities and real-world datasets. However, there are some issues for improvement in the manuscript.
  
  Major comments: 1． In the clinical data section the author mentions rectal swabs were used from a published study [31]. While the source is referenced, it would be helpful to know if any information was provided in the referenced study regarding the collection methods or storage conditions for the rectal swabs. 2． The text mentions using cotton swabs pre-moistened with TE buffer + 0.5% Tween 20. While cotton swabs are common, are there any considerations for using different swab materials depending on the target analytes or sampling surface (e.g., flocked swabs for better epithelial cell collection)? 3． Does RiboSnake require user intervention during any steps, or is it fully automated? 4． The author mentions that contamination filtering parameters should be adjusted based on the sample type. How can users determine the appropriate filtering parameters for their specific samples? Are there guidelines for users to know how much adjustment is needed for specific scenarios? 5． The default abundance threshold for filtering low-frequency reads is chosen based on Nearing et al. [44]. Please discuss the rationale behind using a single threshold for all sample types? Would it be beneficial to allow users to define this threshold based on their data characteristics? 6． Would you like to explain the limitation of RiboSnake, such as specific types of samples it may not be suitable for or potential biases introduced by certain functionalities? 7． The manuscript mentions various visualization tools used throughout the pipeline (QIIME2, qurro). Please clarify which types of data are visualized with each tool, and how users can access or customize these visualizations? 8． To strengthen the manuscript's impact, consider discussing the specific novelty of RiboSnake compared to existing 16S rRNA gene microbiome analysis pipelines. Would you be able to elaborate on the unique features or functionalities of RiboSnake that address limitations of current methods? 9． EasyAmplicon is recently published pipeline and easy using in windows, mac and linux system,
  
  Minor comments: 1. Reference is missing in this sentence. “The default is the SILVA database [47]. Greengenes2 [? ] can be used alternatively”. 2. The author should careful about the lowercase and upper case throughout the manuscript. Please check the following for references:  ..the 2017 published Atacama Soil data set with samples taken fromthe Atacama desert was used [32] as well as samples collected fromsoil under switchgrass published in [33].  based on an Euclidean beta diversitymetric, shows that the positive controls, as well as the samples taken from subjects 1 and 3 (S1 and S3), cluster together.  A wide range of diversity analysis parameters are available in QIIME2 and its associated tools. These include the Shannon diversity index to measure richness, the Pielou index tomeasure evenness, or perform standard correlation analysis using Pearson or Spearman indices, among others. 3. In the introduction part this sentences “However, while these methods enable 16S rRNA analysis with minimal user interaction…” needs attention for clarity. Consider separating it into two sentences to emphasize the limitations of existing pipelines compared to the described methods’. Alternatively, using contrasting words like "in contrast" could highlight these differences. 4. More detail in attached PDF.
  
  https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTM5L2d4LVRSLTE3MTY5Nzk4MTktcmV2aXNlZC5wZGY=
  
  Re-review: The author's response has been fully addressed my concerns. The quality of the paper has apparently improved. I agree with the publication of this article.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.06.606757v1
www.biorxiv.org www.biorxiv.org

TooManyCellsInteractive: a visualization tool for dynamic exploration of single-cell data

3
1. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Georgios Fotakis
  
  1) General comments In this manuscript the authors present TooManyCellsInteractive (TMCI), a browser-based TypeScript graphical user interface for the visualization and interactive exploration of single-cell data. TMCI facilitates the visualization of single-cell data by representing it as a radial tree of nested cell clusters. It relies on TooManyCells, a suite of tools designed for multi-resolution and multifaceted exploration of single-cell clades based on a matrix-free divisive hierarchical spectral clustering method. A key advantage of TCMI lies in its capability to provide a quantitative depiction of relationships among clusters, allowing for the delineation of context-dependent rare and abundant cell populations, as showcased in the original publication [1] and in the present manuscript. TMCI extends the capabilities of TMC significantly, notably enhancing computational performance, particularly in scenarios where multiple features are overlaid (an improvement that is attributed to the persistent feature of the PostgreSQL database).
  
  A notable aspect of this manuscript is the fact that the authors performed a benchmark using publicly available scRNAseq datasets. This benchmark highlights TMCI's superior performance over TMC and its comparable performance to two other commonly utilized tools (Cirrocumulus and CELLxGENE). Moreover, the authors showcase TMCI's applicability through aggregating publicly available scRNAseq data. Here, they successfully delineate sub-populations of cancer drug-tolerant persister cells by employing minimum distance search pruning, enhancing the visibility of small sub-populations. Additionally, the authors note an increase in ID2 gene expression among persister-cell populations, as well as the enrichment of unique biological programs between short- and long-term persister-cell populations. Furthermore, they observe an upregulation of the diapause gene signature across all treated sub-populations. The biological insights the authors glean are novel and highly intriguing. In general, this manuscript is well written, with the authors offering comprehensive documentation that covers the essential steps for installing and running TMCI through their GitHub repository. Additionally, they provide a minimal dataset as an example for users. However, there are a few minor adjustments that, once implemented, would enhance the manuscript's value by improving clarity and providing valuable insights to the field.
  
  2) Specific comments for revision a) Major - As stated in the manuscript's abstract, visualising large cell populations from single-cell atlases poses greater challenges and demands compute-intensive processes. One of my major concerns revolves around TMCI's scalability when handling large datasets. The authors conducted benchmarking on relatively modest datasets (ranging from 18,859 to 54,220 cells). Based on the data provided in Supplementary Table S3, while TMCI demonstrates comparable performance to CELLxGENE on the Tabula Muris dataset and its subset (with mean memory consumption differences ranging from 870 MB to 1.8 GB), the disparity significantly increases when loading and rendering visualizations of the larger dataset, reaching 8.5 GB of RAM. It would be of great interest if the authors conducted a similar benchmark using a larger dataset to elucidate how TMCI scales with increased cell numbers, especially considering the trend in the field towards single-cell atlases and the availability of datasets consisting up to millions of cells (like the Tabula Sapiens [2] dataset or similar [3, 4]).
  
  In the "Results" section, under the title "TMCI identifies sub-populations with highly expressed diapause programs," the authors assert that "the significantly different sub-populations were more easily seen in TMCI's tree". Since perception can be subjective (for instance, a user more accustomed to UMAP plots may find it challenging to interpret a tree representation), it would be beneficial for the authors to allocate a section of the supplementary material to demonstrate the clarity advantages of TMCI's tree visualization. One approach could involve a side-by-side comparison of visualizations generated by TMCI and CELLxGENE using the same color scheme. For instance, Figure 4b could be compared with Supplementary Figure S1g, Figure 4j with Supplementary Figure S1h, and so forth.
  
  The "Discussion" section overlooks the future prospects of TMCI. As demonstrated in the case study, TMCI exhibits potential beyond serving as a visualization tool for identifying tree-based relationships in single-cell data. Are there any plans for integrating analytical functionalities to provide insights into cellular compositions and underlying biology, such as marker gene identification, differential gene expression analysis, and gene set enrichment analysis? In the future, could TMCI support the visualization of such results using methods like violin plots, heatmaps, and others?
  
  In the "Materials and Methods'' section, the authors outline the process of aggregating the scRNAseq datasets used for the case study, including filtering and normalization steps. However, scRNAseq technologies are prone to significant noise resulting from amplification and dropout events. Additionally, when integrating different scRNAseq datasets, users need to consider potential batch effects. Did the authors employ any de-noising or batch correction methods? If not, what was the rationale behind this decision? It would be intriguing to observe any potential differences in the results following the application of such methods.
  
  Remaining within the "Materials and Methods" section, providing a brief description of the methods and tools utilized for the differential gene expression analysis, the GSEA (if not solely conducted through Metascape), and the packages utilized to generate the plots in Figures 3 and 4 would enhance clarity and facilitate reproducibility.
  
  Figure 4 - b: Distinguishing between the various cell lines on the partitioned nodes based on the current color coding—particularly for the MDA-MB-231 and PC9 cell lines, as well as between the treated and untreated populations of the SK-MEL-28 cell line—is quite challenging. Employing a different color scheme would significantly enhance clarity, making the different cell populations more distinguishable.
  
  Figure 4 - d and k: The authors should add statistics as relying solely on the box and whisker plots makes it challenging to ascertain whether there is a significant difference between the conditions. For instance, it appears that ID2 is over-expressed between the control and treated population only in the SK-MEL-28 cell line.
  
  b) Minor - In the "Results" section, under the title "TMCI reduces time to display trees," the authors state: "these benchmarks indicate not only the superior performance of TMCI to generate static and interactive tree of single-cell data compared to other tools…". However, based on the results presented in the manuscript and the supplementary material, it seems that TMCI may not be outperforming alternative interactive visualization methods. This phrase should be revised to accurately reflect the benchmark results.
  
  References 1. Schwartz GW, Zhou Y, Petrovic J, Fasolino M, et al. TooManyCells identifies and visualizes relationships of single-cell clades. Nat Methods 2020;17(4):405-413. PMID: 32123397 2. The Tabula Sapiens Consortium, The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376, eabl4896. DOI:10.1126/science.abl4896 3. Sikkema L, Ramírez-Suástegui C, Strobl DC, et al. An integrated cell atlas of the lung in health and disease. Nat Med 2023;29, 1563-1577. DOI:10.1038/s41591-023-02327-2 4. Salcher S, Sturm G, Horvath L, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer cell 2022;40(12):1503-1520.E8. DOI:10.1016/j.ccell.2022.10.008
2. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Mehmet Tekman
  
  PAPER: TOOMANYCELLSINTERACTIVE REVIEW
  
  Table of Contents
  
  Using the Application .. 1. Positive Notes: ..... 1. General UI and Execution .. 2. Negative Notes: ..... 1. Controls ..... 2. Documentation ..... 3. Feature Overlays:
  
  Docker / Postgreseql
  
  Ethos of the Introduction
  
  The manuscript reads very well, and the quality of the language is good.
  
  This review tests the application itself, and makes some comment about some ambiguous wording in the introduction
  
  1 Using the Application
  
  I tested the Interactive Display at https://tmci.schwartzlab.ca/
  
  1.1 Positive Notes: ~~~~~~~~~~~~~~~~~~~
  
  1.1.1 General UI and Execution
  
  The general interactivity of the UI was very impressive and expressive. I liked that every aspect including the pies and the lines themselves could be coloured and scaled.
  
  I found the feature overlays and pruning history stack very intuitive, as well as rolling back the history on each state change.
  
  The choice of D3 was a good one, enabling very pleasing animations enter/exit/update state animations, as well as ease of SVG export.
  
  The inclusion of a command line `generate-svg.sh' for rendering without a browser is very useful.
  
  1.2 Negative Notes: ~~~~~~~~~~~~~~~~~~~
  
  1.2.1 Controls
  
  At first I wasn't able to find the controls, despite having the page open to 1330px wide, but then I realised I had to scroll down outside of the SVG container to find them.
  
  As mentioned in a recently opened PR, there's a CSS media rule `@media only screen and (min-width:1238px)' taking place, that looks strange on my Firefox 122 on Linux. Maybe better media rules for screens in the 700-900px wide range might be useful, as well as making separate rules for smartphones.
  
  1.2.2 Documentation
  
  Typescript is a good language to develop in, and lends itself naturally to documentation, though I did notice a distinct lack of documentation above many functions in the code base.
  
  Perhaps write a bit more documentation to make the code base accessible to new collaborators?
  
  Otherwise, the quality of code looked good, and the license was GPLv3 which is always welcome.
  
  1.2.3 Feature Overlays:
  
  I found the feature overlays super useful, though limited by the number of colours. These appear to be limited to one colour for all genes.
  
  Very useful for showing multiple genes, but it would be nice to have the ability to colour the expression of different genes with different colours, at least for < 3 genes of interest (due to the difficult colour mixing constraints).
  
  2 Docker / Postgreseql
  
  It is not clear to me what the Node server and PostgresQL database run in the docker container are actually doing, other than fetching cell metadata and marking user subsets from pruning actions.
  
  Could this not have been implemented in Javascript (e.g. IndexedDB)? Why does the data need to be hosted, if it's the user loading it from their own machine anyway. Is the idea that the visualization should be shared by multiple users who will be accessing the same dataset?
  
  If this is a single-user analysis, then why not keep all the computation and retrieval on the client-side?
  
  The reason I'm asking this is because I believe that by keeping the database operations within Javascript, you could run the system within a single Conda environment, or even with pure Node lockfile.
  
  I can understand needing a Docker for development purposes, but to actually run the software itself seems excessive. Is it not possible to separate the client and server into Conda? That way, one could then include the vizualisation (as the end stage) in bioinformatic pipelines.
  
  3 Ethos of the Introduction
  
  This is a small wording complaint in the Introduction section.
  
  TooManyCellsInteractive (TMCI) presents itself as a solution to the conventional scRNA-seq workflows that prepare the data via the usual: data → PCA → UMAP→ kNN → clustering stages.
  
  TMCI hints that it as an alternative solution to this workflow, but from what I can see in the documentation, it appears to require a cluster_tree.json' file, one that is produced only by the TooManyCells (TMC) pipeline.
  
  Unless I've misunderstood, it's not accurate to say that TMCI is an alternative to these conventional workflows, but that TMC is.
  
  TMCI simply consumes the files output by TMC and renders them. If what I'm saying is true, then the introduction should reflect that.
3. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Qingnan Liang
  
  Klamann et al. report a tool for single-cell data visualization, TMCI, which was related to the previous method of TMC. It is appreciated to see such continuous work and maintenance of the method and I do agree TMCI has the potential of promoting the application of TMC. The manuscript is generally well-written, and it suits well with the scope of GigaScience. The TMCI is publicly available with reasonably detailed tutorials. In this manuscript, however, at several points the elaboration does not provide sufficient details or rationales. I suggest revision/clarification as below before recommendation to publish.
  
  Does TMCI provide an interface with one or more popular single-cell frameworks, such as SingelCellExperiment, Seurat, or Scanpy? A TMCI user would probably use one of these frameworks to do other parts of the analysis.
  
  Is batch effect considered in the drug-treated data example? More generally, if a user want to use TMCI with multiple datasets, what would be the recommended approach for batch effect? Also, we know cell cycle is a factor that are usually 'regressed out' for single-cell analysis. Does TMC/TMCI consider this?
  
  "To normalize cells between data sets, we used term frequency-inverse document frequency to weigh genes such that more frequent genes across cells had less impact on downstream clustering analyses" We know TF-IDF is becoming a common practice in scATAC-seq analysis. Is this TF-IDF approach common for tree construction (or hierarchical clustering) with high dimensional data? Is this recommended for all users with scRNA-seq data?
  
  Figure 4C is not very easy to read. It may be helpful to label/highlight the comparison pairs to make the point.
  
  Also it is not sufficiently emphasized that how TMCI helped finding this ID2 target. Or how such visualization would trigger interesting downstream approaches. I guess the power of this tree approach is somehow similar to the increasingly popular 'metacell' approach, which combine similar cells to 'cell states'. Thus it makes an interesting midpoint between 'single-cell' and 'pseudo-bulk'. It would really be helpful to see that some states (nodes), although similarly treated, behave differently than others, if there are such cases (not sure if cell lines have such heterogeneity). Similar comments for the pathway analysis part.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.16.544954v1
Aug 2024
www.biorxiv.org www.biorxiv.org

MOBFinder: a tool for MOB typing for plasmid metagenomic fragments based on language model

2
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 1: Haruo Suzuki **
  
  I recommend that the authors consider revising based on the following points.
  
  the unpaired Wilcoxon signed-rank two-sided test. -> should be corrected to either "Wilcoxon rank-sum test" or "Mann-Whitney U test"
  
  https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test "Wilcoxon rank-sum test" redirects here. For Wilcoxon signed-rank test, see Wilcoxon signed-rank test. https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test Not to be confused with Wilcoxon rank-sum test.
  
  Since MOBscan can only predict the MOB type with plasmid proteins, we annotated the plasmids in the test set with Prokka, then manually submitted them to the MOBscan website for MOB type annotation.
  
  Given that MOBScan operates as an online tool and cannot be executed locally, the calculation of MOBScan's run time was confined to the duration spent on preprocessing with Prokka locally." (Please refer to Line 313-319 in the revised manuscript).
  
  -> Actually, it can be executed locally using the scripts included in https://github.com/santirdnd/COPLA/. It may not be necessary to run MOBscan locally (it may be okay that they manually submitted them to the MOBscan website), but I'll inform you regardless.
  
  In the comparison, it was observed that MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B). (Please refer to Line 418-421 in the revised manuscript).
  
  -> Do the authors' results contradict the following general expectation? MOB-typer utilizes BLAST, whereas MOBscan utilizes hmmscan, and therefore, MOBscan is expected to retrieve more distantly related proteins than MOB-typer.
  
  MOB-suit and MOBscan are represented by blue lines, orange lines and gray lines respectively. -> should be "MOB-suite"
  
  I suggest receiving English language editing before publishing the paper. "For the MOB typing, MOBscan [18] uses the HMMER model to annotated the relaxases and further perform MOB typing." -> should be "For the MOB typing, MOBscan [18] uses the HMMER model to annotate the relaxases and further perform MOB typing."
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Dan Wang
  
  The manuscript provides a comprehensive background on the necessity and challenges of MOB typing in the context of plasmid genomics and its significance in tracking the transmission of resistance genes and virulence factors. The innovation introduced by MOBFinder, which incorporates an 11-class classification system, addresses a critical gap in current research methodologies by enhancing the precision of plasmid fragment classification. Key Strengths: Innovation: MOBFinder represents a novel approach in the typing of metagenomic plasmid fragments using word vector characterization combined with machine learning techniques. Methodological Rigor: The methodological approach, including the use of random forest models and the construction of a benchmark dataset from annotated complete plasmid genomes, is robust and well-executed. Performance: The tool demonstrates superior performance compared to existing tools like MOBscan and MOB-suite, providing a significant improvement in accuracy. Impact on Field: The application of MOBFinder in a T2D cohort illustrates the tool's practical utility and its potential to influence antibiotic resistance studies. Recommendation: Given the thorough revisions and the contributions this manuscript offers to the field of microbial genomics and antibiotic resistance, I recommend that the manuscript be accepted for publication in GigaScience.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.06.570414v1
www.biorxiv.org www.biorxiv.org

CAT – A Computational Anatomy Toolbox for the Analysis of Structural MRI Data

3
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Cyril Pernet
  
  CAT has been around for a long time and is a well maintained toolbox - the paper describes all the features and additionally provides tests/validations of those features. I have left a few comments on the pdf (uploaded) which I don't see has mandatory and thus 'accepted' the paper (and leave the authors to decide what to do with those comments). It provides a nice reference for the toolbox.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Chris Foulon
  
  Overall, I think the CAT software provides valuable tools to analyse morphometric differences in the brain and promotes open science. The study shows the software's capabilities rather well. However, I think some clarifications would help the readers understand and evaluate the quality of the methods.
  
  Comments: Figure 2: Looking at the chart, I have a question regarding the pipeline. Is it required to run the whole pipeline using CAT? Or is it possible to input already registered data to start directly with the VBM analysis or further?
  
  Voxel-based Processing: The above question is quite important, seeing that the preprocessing uses rather old registration methods. The users might want to use more recent registration methods, especially with clinical populations.
  
  Spatial Registration and Figure 3: For the registration, how is the registration performing with clinical populations (e.g. stroke patients)? It can be significant for the applicability of the methods with specific disorders.
  
  Surface Registration and Figure 3: What type of noise is used to evaluate the accuracy? This can be important as not every noise can be modelled easily, and some noises are more or less pronounced depending on the modality.
  
  Maybe having the letters of the figure panels referred to in the text would help the reader.
  
  Performance of CAT: Although I see the advantage of using simulated data, I think it would require more explanation. First, what tells the reader the quality of this simulated data, and how does it compare to real data? Second, is it only healthy data? In that case, the accuracy evaluation might not be relevant for the majority of the clinical studies using CAT.
  
  Longitudinal Processing: Are VBM analyses sensitive enough to capture changes over days? I would be surprised, but I would be interested to see studies doing it (and the readers would also benefit from it, I reckon).
  
  Mapping onto the Cortical Surface: I am a bit confused about the interest in mapping functional or diffusion parameters to the surface. Do you have examples of articles doing that? It sounds like it would waste a lot of information from these parameters, but I am not familiar with this type of analysis. "Optionally, CAT also allows mapping of voxel values at multiple positions along the surface normal at each node". I do not understand this sentence; I think it should be clarified.
  
  Example application: Is there a way to come back from the surface space to the volume space to compare the results? For example, VBM and SBM should provide fairly similar results, but comparing them is difficult when they are not in the same space. Additionally, in the end, the surface representation is just that, a representation; most other analyses are still done on the volume space, so it could be helpful to translate the result on the surface back to the volume (if it is not already available).
  
  Evaluation of CAT12: I was confused with Supplemental Figure 1 as it is not mentioned in the caption that it is the AD data and not the simulated one. Maybe it would help the reader to mention it.
  
  Regarding the reliability of CAT12, it seems to capture more things, but I struggle to see how we can be sure that this is "better" than other methods; couldn't it be false positives?
  
  "those achieved based on manual tracing and demonstrated that both approaches produced comparable hippocampal volume." comparable volumes do not really mean the same accuracy; this sentence could be misleading.
  
  I think the multiple studies show that CAT12 is as valid as any other tool but I am not sure the argument that it is better is as solid. Of course, I understand that there is no ground truth for what a relevant morphological change is for a given disease.
  
  Methods: Statistical Analysis: Why is the FWER correction used for the voxel-wise statistics (which perform many comparisons) and FDR used on ROI-wise statistics (which perform much fewer comparisons)? I would expect the opposite.
  
  "The outcomes of the VBM and voxel-based ROI analyses were overlaid onto orthogonal sections of the mean brain created from the entire study sample (n=50); " I don't understand what this refers to.
3. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Chris Armit
  
  This Technical Note describes the Computational Anatomy Toolbox (CAT) software tool, which includes a Graphical User Interface that can be used for morphometric analysis of Structural MRI data. The CAT software tool is impressive, and enables voxel-based and surface-based morphometric analysis to be accomplished on Structural MRI data, and also voxel-based tissue segmentation and surface mesh generation to be applied to these 3D imaging datasets. The authors helpfully illustrate the utility of the Computational Anatomy Toolbox (CAT) using T1-weighted structural brain images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.
  
  This is an excellent, freely available tool for the Neuroimaging community and the authors are to be commended for developing this impressive software tool.
  
  Minor comments
  
  I first attempted to launch the CAT software tool on macOS 14.0 (Sonoma) with Apple M1 chip, and on the command line I received the following message: "spm12" is damaged and can't be opened. You should move it to the Bin.
  
  I additionally tested the CAT software tool on macOS 12.6 (Monterey) with Intel chip, and I was able to run the CAT software tool on this platform.
  
  A minor criticism is that the installation instructions in the supporting Readme file for archive [CAT12.9_R2023b_MCR_Mac_arm64.zip], which runs on macOS with Intel chip, only details how to install the SPM (Statistical Parametric Mapping) software tool. The CAT software tool needs to be downloaded separately and then moved into the directory of the SPM toolbox, and these installation instructions are included in the supporting CAT software documentation (https://neuro-jena.github.io/cat12-help/#get_started)
  
  With the issues I encountered in installation, I invite the authors to list the System Requirements - specifically the Operating Systems that are needed to run the CAT software tool - in the GigaScience manuscript and also in the supporting CAT software documentation.
  
  In addition, it would be particularly helpful if the instructions on how to install CAT in the context of SPM were included in the supporting Readme files for the Computational Anatomy Toolbox (CAT) zip archives.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.11.495736v2
www.biorxiv.org www.biorxiv.org

Deciphering Cancer Genomes with GenomeSpy: A Grammar-Based Visualization Toolkit

3
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Luca Beltrame
  
  Lavikka and coworkers present an interesting visualization framework and associated application for genomics visualization. The challenges outlined by the authors in finding appropriate visualization tools for large-scale genomics data were also experienced by this reviewer, and thus better and improved tools are always welcome.
  
  The manuscript is well laid out, presenting the key facts in a proper manner. The use of GPU rendering for graphs is an excellent move, and I expect to be extremely useful even for machines with lower-end GPUs. The code looks reasonably written and commented (being an application, this too is important for a review). I have also tested the examples, and indeed the software is very useful (the documentation should, however, point out that some issues regarding saving the canvas still exist). One may argue that the use of JSON for the graph grammar can be awkward, but at the same time other file formats may be more problematic and/or require specialized parsers (which open yet another can of worms).
  
  Documentation is also logically organized. As a minor suggestion, the authors may want to add some form of search to their documentation page.
  
  There are is an open questions that the authors may want to answer: they explicitly mention GISTIC 1.0 for the G-score plots. Is there a specific reason why they chose 1.0? The 2.0 algorithm is far more robust and produces more reliable results.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 2: Alessandro Romanel **
  
  In this article, the authors introduce GenomeSpy, a grammar-based toolkit for creating customized, interactive visualizations for genomic data analysis. I find the article extremely interesting, and I believe the framework introduced by the authors has broad utility. The website is well-maintained and documented, and I particularly found the examples mentioned in the paper to be useful and informative. The authors chose to present their toolkit by narrating the navigation of a dataset generated in the DECIDER study. While the narrative makes the utility of the visualizations clear in data interpretation, what is not clear at all is how easy it is to use GenomeSpy to create those same visualizations. I believe that the success of a toolkit like this is strongly tied to its ease of use, and this aspect is not clear or prominently highlighted in the manuscript. Additionally, it would be interesting to more clearly highlight GenomeSpy's strengths compared to other approaches. By combining Rshiny and ggplot, it is indeed possible to create complex interactive data visualizations. Therefore, it would be necessary to more strongly emphasize what the other innovative aspects of GenomeSpy are, beyond GPU acceleration, compared to other approaches available today.
3. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Andrea Sboner
  
  In this manuscript, the authors present Genome Spy, a visualization toolkit geared toward the rapid and interactive exploration of genomic features. They demonstrate how this tool can help investigators explore a large cohort of 753 ovarian cancers sequenced by whole-genome sequencing (WGS). By using the tool, they were able to identify outliers in the dataset and refine their diagnosis. The tool is inspired by Vega-lite, a high-level grammar for interactive graphics, and extends it for genomic applications.
  
  The manuscript is clearly written, and the authors provide links to the applications itself, tutorials and examples. I want to commend them for doing this. This is a tool that would nicely complement others and has a specific advantage of using high-performance GPUs that are now common in modern computers.
  
  The only concern that I have is about a couple of claims that may not be fully supported by the data provided: 1. Claim: users can implement new visualization designs easily. While the grammar certainly enables the users to define new designs, I do not think that this is necessarily easy, as the authors themselves recognize in the discussion section when they suggest providing templates to reduce the learning curve. Indeed, the example in Figure 2 is still quite verbose and would need some time for anyone to understand the syntax and the style. The playground web application facilitates testing it, though. 2. Claim: the grammar-based approach allows to be mixed and matched. I did not find any specific example of how to do this. It would have been quite interesting to see the intersection between the DNA representation of structural variants and RNA-seq data (if this is what it means as "mix and match").
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.06.561159v1
www.biorxiv.org www.biorxiv.org

Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data

2
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 2: Liuyang Zhao **
  
  In this study, the authors initiate a novel exploration by employing parameter escalation experiments to assess the impact of reference size and alternative allele frequency cutoffs on the effects of virus lineage composition in wastewater samples and their references. The research provides valuable insights into how different parameter settings influence outcomes in test data sets, particularly highlighting the role of virus lineage composition in wastewater samples and the corresponding references. Detailed parameters for these analyses are made available in several bash files at osf.io/upbqj. Despite these significant contributions, certain areas could benefit from further enhancement:
  
  1.The current methodology utilizes Ion Torrent for testing mock samples. However, this approach may not fully capture the variability in alignment and sub-lineage analysis. Incorporating additional sequencing data from PacBio, Nanopore, and Illumina would offer a more comprehensive examination of these aspects, potentially leading to more robust findings.
  
  2.While the study showcases a variety of pipelines based on mutation-based and sequence-based tools in Table 1, the evaluation of three data sets was limited to only using MAMUSS (as a mutation-based reference) and VLQ-nf (as a sequence-based reference). For more conclusive guidance in pipeline selection, it is advisable for the authors to expand their analysis to include at least two or three more pipelines. This recommendation aligns with observations noted by the authors at line 619, suggesting a comprehensive benchmark comparison would significantly enhance the study's utility and appeal to readers seeking optimal pipeline strategies.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 1: Irene Bassano **
  
  In the manuscript "Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data" Aßmann et. al compare two methods, a sequence and mutation-based, respectively, to better understand the circulating lineages and sub-lineages in wastewater samples. Since the advent of wastewater-based epidemiology (WBE) as a tool to complement results from clinical data, there has been search for novel tools that can give robustness to the results and more importantly confidence in the data analysis. In this context, this manuscript is very important as it is contributing towards achieving that goal. This is clear in the fact that they have designed a new tool, namely MAMUSS. 1. One aspect however that the manuscript fails to mention is the difficulty in reconstructing full genome sequences from wastewater data. This has been one of the biggest problems since it is widely accepted that viral particles in water do degrade, and consequently what is being sequenced is a partial genome. Consensus sequences are therefore very difficult to obtain. 2. Another aspect that the authors fail to mention in the introduction or as a point of discussion, is how a variant is defined and how we take this information from clinical samples to then adopt it to define variants in environmental samples, although some relevant tools are mentioned such as COJAC and MMMVI. Yet, how these are used, it is not explained. 3. The manuscript is well written, there are some repetitive sentences that need to be removed (see comments on PDF) as well as a couple of sentences which are not grammatically correct (see comments on PDF). 4. It is worth mentioning that the words "variants" and "lineages" are used interchangeably. I do suggest they choose one term only. 5. The manuscript mentions several times the presence of false and true positive, however does not mention how these were calculated. These need to be supported by a small statistical test. 6. There are minor corrections throughout the manuscript that need to be address. All these are highlighted as comments in the original manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.02.543047v1

GigaScience

Annotations: 984

Joined: September 13, 2019

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators