312 Matching Annotations
  1. Jun 2024
    1. Discussion

      Given that you have a very low fraction of bacterial reads, which is a common problem in the field, I think a useful contribution from your data would be to create a panel of primers to amplify community members that you see are present. This would give you more resolution than 16s but allow you to avoid more of the host sequencing data. However, the usefulness of such a panel would be bounded by how it would be adopted by others in the field. It would probably be most useful if you applied it to this fish farm repeatedly, but I'm not sure if doing so is biologically interesting.

    2. However, by examining the bacteriome in detail, we can obtain much more information about its composition and function than diversity alone can tell us. Based on the taxonomic constitution of our samples, Proteobacteria and Actinobacteria phyla were clearly dominant both in fish skin mucus and water samples. The dominance of the Proteobacteria phylum is not an uncommon observation in fish external mucus samples1,3,5,6,8,11,21,62,63, however, differences between fish species have been observed for the other phyla1,11,62,63. Moreover, significant within-species variability in dominant phyla has been described64, and variability within individuals related to body sites should be noted12.The microbiome can be an important indicator of various pathological conditions, which has already been described in fish, for example, in the case of the gastrointestinal tract65. In this regard, the Bacteroidota phylum may be interesting, which has been highlighted as a marker for eutrophication9,66. Understanding the changes in the composition of the bacteriome or even the microbiome during different pathological conditions can be an important step in understanding and potentially diagnosing disease processes.Our results are therefore in line with the dominance of the Proteobacteria phylum observed in other fish species, but direct comparison with C. carpio is not possible due to the lack of available data. Of course, our observations on the bacteriome composition of our samples are also limited by their paramount host genome contamination, which reduced the coverage of bacterial genomes of interest in the sequencing reaction.

      Since you have the resolution to go below phylum, I think it would be interesting to focus on that more in the discussion.

    3. Even though this might limit our conclusions on the bacteriome composition of the common carp skin mucus, our samples still provide valuable insight into the main constitution of fish skin mucus bacteriome.

      I agree, but I think this would be worth mentioning in the abstract, and perhaps in the last paragraph of the introduction, to better prepare your readers about the types of results you are going to present

    4. Bacteria (mean ± SD) was 0.12 ± 0.12

      The percentage or fraction? If percentage, less than 1% is incredibly small and I would question any results in this report. How many reads total was this? If you used a very high depth, you might capture a substantial portion of the community.

    5. For functional prediction of the bacteriome, reads classified as originating from bacteria were assembled to contigs by MEGAHIT v1.2.940

      I imagine you might have a huge amount of drop out here by applying kraken first and then assembling with megahit. I would either: 1. Map reads to carp first, and then assemble anything that doesn't map 2. Assemble everything and then filter out carp contigs.

    6. genus level in more detail

      Why not the species level?

    7. Taxonomic classification of the reads was performed with Kraken v2.1.234 to the NCBI nt database (built on: 26.12.2022).

      It might be worth mapping back to the host genome if you have one prior to performing taxonomic classification.

      I would also be interested to see nonpareil curves of your sequencing data before and after host mapping. I would be curious if you reached saturation of the community -- this can usually be better assessed with raw sequencing data than with taxonomically classified reads.

    8. Rarefaction curves were calculated with the vegan v2.6-237 package at the species level.

      Can you add what functions you used to do this in the Vegan package?

    9. TrimGalore v0.6.732 was used for quality trimming of the merged and forward unmerged (see above) reads.

      What filters did you use here? I'm curious how many reads were lost to filtering.

    10. At the farm where samples were collected, both scaly and mirror carp phenotypes are kept. During the sample collection, we could sample two of each at one pond, however, only one scaly and three mirror carp at the other. Furthermore, it is worth mentioning that two specimens from pond 1 had ulcers on their skin, otherwise, all sampled fish appeared to be healthy. Details on the metadata on each sample, along with the number of reads used for classification, can be found in Supplementary File 1.In addition to the skin mucus samples, water was collected from each pond. Water and mucus samples were frozen immediately after collection on dry ice and were subjected to shotgun metagenomic sequencing.

      Do you have any idea if the bacterial load of the water, and therefore the skin, of the carp was much higher than for fish observed in the wild, or typically sequenced with 16s? I'm wondering if there was more bacteria than usual, and that was why you were able to get enough bacterial reads to perform an analysis

    11. Due to the economic importance of the common carp among freshwater fish species16–18

      Would you be able to provide some specific examples of the economic importance (even half a sentence)? I'm not a carp expert so I have no idea what these might be!

    12. The microorganisms that inhabit the skin are important for the well-being of their hosts3–5. They might even play a practical role in the maintenance of the health of these animals, for example, as an indicator of various pathological conditions13,14, or as a source for potential future probiotics15. Due to the economic importance of the common carp among freshwater fish species16–18, efforts to protect their health are particularly important.

      Can the microbiome also be pathogenic for carp?

    13. However, it should be noted that studies on the bacteriome and microbiome of this species are underrepresented compared to other species, especially considering the skin mucus bacteriome. For this reason, it would be beneficial to increase our knowledge on the bacteriome of the common carp as well.Despite the long history of the study of the microbial and bacterial community of the outer surface of fishes19,20, it has recently received much more attention due to the advent of next-generation sequencing (NGS) technologies4,14. However, it is important to note that 16S rRNA gene-based methods have been used in the majority of such studies on the bacteriome of fish skin mucus4,14,21. A review article from 2021 listed only one paper using shotgun metagenomics for the analysis of the external surface of eels21,22. Beyond which, to the best of our knowledge, we are aware of only one further shotgun metagenomics study from 202023 investigating the fish skin metagenome of cartilaginous and bony fishes from an evolutionary perspective. Despite the conflicting results on the effectiveness of the two methods in revealing microbial community structure24–27, it is certain that shotgun sequencing-based methods have the major advantage of providing much greater insight into the functional organization of microbial communities14,24,25.

      I think this section doesn't highlight the massive challenge in trying to get shotgun metagenomic sequencing data from fish. In the experiments where we have tried (killifish, different tissues), we end up with 98 or 99% killifish (host) reads. 16s allows us to amplify and get just the microbial signal.

      We have talked about trying to do a more balanced marker gene panel, but that has methodological problems like not having as many tools and determining the best marker genes to use.

      It would be nice if these challenges were better represented. I think the reason this gap exists is methodological (hard to get shotgun sequencing from fish), not for lack of interest

    14. The colonization of the skin mucus of fishes is assumed to originate from the surrounding water, which process may even start at the larval stage3. However, the fish skin bacteriome composition is influenced by several factors such as stress1, water pH level6 or other environmental influences7–9. Furthermore, even the genetics and diet of the host species can have an effect on its structure1,8,10,11. Moreover, even within a single individual, different body parts may show differences in microbiome composition12.

      I'm curious the extent to which these studies investigated farm vs. wild fish, and if you think that would make a difference on microbiome. It might be helpful to include that distinction in when covering literature in the introduction, given that you see some results that you don't expect relative to other observations in the field.

  2. May 2024
    1. Materials and Methods

      I may have missed it, but I didn't see a methods section for peptide discovery/annotation. Would you be willing to add this?

    2. especially regarding taxon sampling and filtering of sites/genes.

      Is it possible to be more specific about the sequencing data that would be needed to answer this question? I think that could be value-added for the community to have this clearly spelled out

    3. post-assembly errors.

      would these be post-assembly errors or assembly errors? If post-assembly, can you add details about what this means?

    4. The completeness (% of complete BUSCOs) of the four new gene catalogs generated in this study fell within the range of recently sequenced tick genomes as shown in Table 2. Completeness was lowest in I. pacificus (81%), and highest in I. ricinus and I. hexagonus (about 91%), which is somewhat lower than the 98% observed for the recently improved genome of I. scapularis (De et al. 2023). For I. pacificus, we also note a relatively high percentage of “duplicated” genes in the BUSCO analysis, suggesting that heterozygosity might have not been fully resolved and that our assembly still contains duplicate alleles, which is supported by the higher heterozygosity estimate for this genome (supplementary Fig. S1).

      It isn't clear in this section whether you ran BUSCO on the genomes in genome or protein mode. I think it would be helpful to see both sets of results -- the genome alone would tell us how many single copy genes are in the assembly, while the proteins compared with this will tell us how well annotation went for these genomes as well.

  3. Apr 2024
    1. Aligning gene embeddings from different species into the same space opens up the possibility of using advanced deep learning technology for inter-species comparisons.

      Are your methods versatile enough to allow this approach for new species pairs if there is enough RNAseq data? What is the minimum number of RNA seq samples needed?

    2. These genes have low RNA similarities in mice, suggesting that the correlation of these genes with other genes and possibly their functions have diverged between mice and humans. This divergence could contribute to the discrepancies observed between the human and mouse studies.

      Is there a way to look at this more systematically? This is hugely valuable.

    3. This suggests that our approach, although relying solely on transcriptome data, can achieve superior performance compared to models that incorporate multi-omics data.

      Do you think this is because of the expression levels of genes or some other signal? I think unpacking this more would be a big value-added for understanding this model.

    4. To mitigate bias from using a single phenotype annotation dataset, we analyzed another dataset, specifically the ‘mouse models of human diseases’ annotations from the MGI database. This dataset includes information about human diseases, their mouse models, and the associated genes. Given the low number of shared associated diseases among the homologous genes (Extended Data Fig. 8b), we focused on comparing proportion of each gene group with shared disease association(s). Results indicated that homologs with high RNA similarity have largest proportion of shared association(s). In contrast, homologs with low RNA similarity showed a smaller proportion, even if their DNA similarity is high (Fig. 3b and Supplementary Table 2).

      This is very cool. it would be interesting to correlate this with the number of failed clinical trials for therapies developed in mouse and applied to humans. It might also be interesting to see if there are other DNA signals that could be used to improve DNA performance. Promoter sequences come to mind, but there might be others.

    5. The experiment indicated that it was not due to methodology limitations that fewer than 5,000 of 16,983 mouse genes were the nearest embedding neighbors of their human orthologs.

      Then what was it due to?

    6. 6,007 human genes their mouse orthologs were among the ten nearest embedding neighbors (Fig. 1g).

      What were the other, nearer embedding neighbors? copy number variants of the same genes? genes in the same pathway?

    7. Note that only the representations of these other genes would be used for later analyses.

      I don't understand the importance of this note as currently written.

    8. Thus, we forced the 5,000 randomly selected one-to-one orthologous gene pairs to have the same gene embeddings (Fig. 1b).

      Do you have confirmation that all information capture in the embeddings is synonymous between these 1:1 orthologs? For example, doe the orthologs share the same promoters and transcription factor still? I think both of these would be fairly straightforward to check at scale.

    9. This includes gene-associated diseases/phenotypes, protein interactions, transcription factors, biological pathways, gene ontology, associated cell types and more.

      Thank you for including these details, this is very helpful

    10. Although multiple mouse gene expression studies have been conducted, they have often relied on techniques such as dimension reduction, phylogenetic clustering, co-expression analysis, and differential expression analysis16,18-33. These traditional methods face challenges in achieving a comprehensive comparison of genes at the RNA level, as they can be susceptible to batch effects, biased by small sample sizes, and constrained by the limited availability of samples from matched biological conditions33,34. Given the dynamic and complex nature of gene expression, which varies across genders, ages, tissues, and conditions, a thorough characterization at the RNA level necessitates integrating data from diverse biological contexts and a large collection of samples.

      I don't think this paragraph does justice to the work already undertaken and I don't think it highlights why there was a gap for this work to fill. I think the goals of the previous compendia that you included were very different than the goal of this paper, and I think that's ok! Unpacking that a bit more in this intro would be useful. Otherwise, it makes it sound like previous researchers made mistakes and that's why we need this paper, which I don't think is true.

    1. https://github.com/bacpop/MAG_pangenome_pipeline/tree/main/simulate_pangenomes

      Old URL I think (though it still re-directs to the correct place!)

    2. https://github.com/bacpop/CELEBRIMBOR

      Thanks for putting this together! I took a look at the repository and noticed a few changes that I think could help make CELEBRIMBOR more user friendly. 1. Would you be willing to add tool versions to your environment.yml file? This will help make sure this will still be installable over time, and help the docker container match in results to users who deploy this on e.g. an hpc 2. Does create_plots.py belong in the scripts/ directory instead? 3. The hashes after the second equals sign in some of your yaml files will make the environments difficult to install across different operating systems (linux vs. mac) (ex https://github.com/bacpop/CELEBRIMBOR/blob/main/envs/Snakemake.yaml) 4. The readme uses the old repo name in the clone/cd instructions 5. Lastly, have you explored using something like a click interface & making the tool conda-installable? I know this is a big lift, but in my experience it makes it so much easier for others to pick up and use, including dropping the pipeline into larger pipelines. This pipeline might provide some inspiration for how to accomplish this: https://github.com/metagenome-atlas/atlas

    1. (such as edgeR [48], DESeq [49], limma [33], and voom [50])

      can you be more specific here? which functions in these packages accomplish the task (e.x. I believe it is vst() for DESeq2). Can you also cite DESeq2 instead of DESeq? I think DESeq has been retired.

    2. Fig 2

      would you consider switching this plot to an upset plot (R packages upsetr or complexupset) instead of a venn diagram? For many intersections, upset plots are a bit easier to understand than venn diagrams

    3. 228 paired primary-tumor/non-tumor RNA-seq samples available from 113 subjects

      are these number right? are 4 samples from one subject? 113*2=226, not 228

    4. Importantly, it must be pointed out that gsea-3.0.jar, utilized in protocols published by Reimand et al [37], is affected by serious security vulnerabilities due to the use of the Java-based logging utility Apache Log4j in GSEA versions earlier than 4.2.3. Moreover, as reported by the GSEA Team, version 3.0 contained microarray-specific code (mostly related to Affymetrix) that may cause issues with RNA-seq data analysis, which was removed in later GSEA updates.

      Did you do anything to account for these things in your analysis?

    5. An important challenge of pathway enrichment analysis is that of gene set overlap, where some genes participate in multiple gene sets [35, 36].

      I'm so glad you included this! I have struggled with this a lot in my own research so I'm so glad to see it explicitly mentioned here.

    6. Taken together, S2 Fig and S3 Fig show that differential expression/enrichment analyses derived from these different count normalization and filtering procedures lead to highly concordant results at both gene and pathway levels.

      This is very nice.

    7. TPM > 1

      Is this supposed to be a less than sign instead of a greater than sign?

    8. positive-control pathways

      can you prepend this with "cancer-type-specific" so that its clear inline what this means without having to prematurely jump to a future section?

    9. harmonized

      Can you provide a little more context as to what this means? Are all samples consistently analyzed or is there some normalization that takes place as well?

    10. https://github.com/juliancandia/GSEARNASeq_Benchmarks

      I get a "Page not found" error when i navigate to this URL. Is the repo still private or is there a typo in the URL? I would love to look at/give feedback on the code as well!

    11. GSEA was run using the latest available version 4.3.2 (build 13, October 2022) [24].

      This sentence switches to passive from active voice, and the next sentence is active again ("We"). Would it be possible to make it active voice as well? without that, it doesn't sound like you ran the GSEA analysis and got it from somewhere else, which is a little confusing

    12. specifically using the type of data on which GSEA is most commonly being currently utilized

      Meaning short read Illumina bulk RNA-seq data?

  4. Mar 2024
    1. nd 512 GB of RA

      Is this the recommended amount of RAM to run chlomito? this is quite high.

    2. https://hub.docker.com/repository/docker/songweidocker/c

      This link gave me a 404

    3. e. By combining these two metrics, we can significantly improve the83accuracy of identifying and removing organelle genome sequences from genome assembly d

      I'm assuming the second metric relies on mapped reads. Did you consider identifying spanning reads as further evidence for your tool? If a read spans an organellar genome sequence and nuclear genome sequence (perhaps with k=21 bp overlap at minimum, or potentially higher), then I think that would show evidence of an HGT event

    4. ce, we packaged this29method into a Docker ima

      Is chlomito available on GitHub? Where can the docker image be downloaded from?

    5. Plum and Mang

      Would you be willing to provide details on the quality of these two genomes? How well known are the chloroplast and mitochondrial sequences in these (do they have gold-standard labels?)?

  5. Feb 2024
    1. The data for pre-training is openly available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.

      Can you provide which date you accessed this data?

    2. Employing a sliding window approach, we truncate sequences longer than 3000 bp to 3000 bp. Sequences with length less than 200 bp are excluded

      I'm curious about the method employed here and the rationale behind it. 1. What was the size of the sliding window? So that mean nucleotide at position X in a genome would be captured many times, each time multiple times for a sliding window, capturing all 3k bases before and after it in a series of different windows? 2. Why eliminate sequences less than 200bp in length if they are "real"? Many non coding RNAs or peptides are encoded by short sequences. Does this decision limit your embedding space to sequences > 200 bp long?

    3. 91.7 million (M) nucleotide sequences

      It would be super helpful at this point to breakdown a bit more they type of data this model was trained on -- genomes, transcriptomes, thinks in GenBank, etc.

    4. Existing models, such as Enformer and TIGER 7,8, have made contributions to specific genome tasks.

      Would it make sense to include citations to Nucleotide Transformer and DNABERT as well? Unless I'm misunderstanding the application space, these seem like seminal efforts in this area.

  6. Jan 2024
    1. Secondly, the two general purpose predictors, one of which is based on a convolutional neural network (UniDL4BioPep-A), and the other on an ensemble of three simpler ML models (AutoPeptideML), both have comparable performance.

      I'll post this on github as well, but are these models available for use? I know I could use UniDL4BioPep and their models, but I think your tool and would love to be able to use the models you built!

    2. The biological hypothesis is that peptide sequences evolve with fewer constraints compared to protein sequences.

      It would be interesting to unpack this sentence a bit more in the discussion. What do you mean by this, and can you provide any references for this?

    3. Figure 4

      Why do you think performance is so bad for blood brain barrier?

    4. The self-reported values for the handcrafted models referenced in Table 1 are included with the evaluation in the original set of benchmarks to contextualise the contributions of both general purpose frameworks.

      The way this is plotted is really confusing. I would suggest making the handcrafted models as a vertical line or a dot instead of its own bar. The reference line approach would make it clear that you're not re-running those models in some way like you are the other two.

    5. Step 6 - Model EvaluationThe ensemble obtained in the previous step is evaluated against the hold-out evaluation set using a wide range of metrics that include accuracy, balanced accuracy, weighted precision, precision, F1, weighted F1, recall, weighted recall, area under the receiver-operating characteristic curve, matthew’s correlation coefficient (MCC), jaccard similarity, and weighted jaccard similarity as implemented in the scikit-learn [52]. The plots generated include calibration curve, confusion matrix, precision-recall curve, and receiver-operating characteristic curve, as implemented in sckit-plot [60].Step 7 - PredictionAutoPeptideML can predict the bioactivity of new samples given a pre-trained model generated in Step 5. Predictions are a score within the range [0, 1]. This result can be interpreted as the probability of the peptide sequence having the target bioactivity, given the predictive model (P (x ∩ + |MODEL)). This step outputs a CSV file with the peptides sorted according to their predicted bioactivity probability.

      This is sort of a dumb question at this point, but are all classifiers explicitly binary? How would you recommend a user who is interested in many classes of peptide bioactivity (or who plans to say, predict all peptides in a metagenome and then predict bioactivities) go about assessing all classes of bioactivity?

    6. To avoid introducing false negative peptides into the negative subset, the algorithm accepts an optional input containing a list of bioactivity tags that the user considers may overlap with the bioactivity of interest and should, therefore, be excluded.

      This is nice :)

    7. If both positive and negative peptides are provided, the program balances the classes by oversampling the underrepresented class and continues to Step 3; if negative peptides are not provided, it executes Step 2 to build the negative set.

      Would it be possible for a user to provide some negative peptides, but then to also have those negative peptides supplemented by those generated in Step 2? This might be nice for users that have small input data sets if it's possible.

    8. AutoPeptideML only requires a dataset of peptides known to be positive for the bioactivity of interest

      It would be super helpful if you could provide estimates for the minimum number of required sequences or any other information you can add to help a user build intuition for how many sequences they need to provide

    1. To compare the amino acid usage in the functional classes, single amino acid, dipeptide, and tripeptide frequencies were plotted (Figure 2). The amino acid frequency plot in Figure 2A reveals that some BPs classes have distinct characteristics. For example, celiac disease BPs have the highest frequency of proline and glutamine, opioid peptides are enriched in tyrosine and glycine, while cardiovascular BPs have slightly higher frequencies of alanine and the highest frequency of the negatively charged amino acids aspartic acid and glutamic acid.

      I'm curious if using degenerate amino acid alphabets (dayhoff encoding, hydrophobic-polar, etc) would further improve classification accuracy or show interesting patterns.

    2. After filtering the sequences, and merging functionally overlapping or related classes, the final database consisted of 3990 BPs divided into nine different functional groups (Table 1).

      it would be nice to see how many peptides were dropped at each stage of filtering. I'm also curious since you dropped so much data if the model would be more generalizable if that data were included somehow.

    3. https://github.com/BizzoTL/CICERON/

      would you be willing to add a license to the repository so terms of re-use of your work are clear?

    4. 70:20:10

      Sorry if I missed this, but can you report the total size of each group?

    5. HugginFace Transformers

      typo i think :)

    6. “antihypertensive”, “ACE-inhibitory” and “Renin-inhibitory” as Antihypertensive; “DPP-IV inhibitors” and “alpha-glucosidase inhibitors” as Antidiabetic; “antimicrobial”, “antifungal”, “antibacterial” and “anticancer” as Antimicrobial; “antithrombotic”, “CaMKII Inhibitor” as Cardiovascular with positive effects on vascular circulation; “Antiamnestic”, “anxiolytic-like”, “AChE inhibitors”, “PEP-inhibitory” and “neuropeptides” as Neuropeptides.

      curious if you tried without these groupings -- ie, how dissimilar are some of the peptides that were placed into combined groups, and could the model have done well without these groupings

    7. otherwise, they were excluded from the analysis.

      Similarly, how often does this occur?

    8. Peptides with identical sequences but different functional class assignments were removed to avoid introducing potential biases in the classifier’s training.

      How often does this occur?

    9. The final result, CICERON, consists of nine different binary classifiers capable of identifying the products of microbial fermentation-derived BPs.

      Did you try this as a multi-classification problem and end up with a better performance with binary classifiers? Or was the underlying model you used limited to binary classification?

    10. Given the importance of BPs, there have been several attempts to create in-silico approaches to perform a preliminary assignment of the potential functional properties and facilitate the subsequent discovery and testing process in vivo [19–24]. These methods rely on several databases where peptides from various experiments have been collected and classified according to the BPs functional classes. Using the sequence properties of the peptides, such as amino acid composition, or the presence of sequence patterns of interest, peptides can be assigned to a functional class depending on the type of classifier used.

      I'm curious if this task is different than determining whether an amino acid sequence of 2-50 aas is a bioactive peptide, or if one must first know that the sequence is a peptide to then apply these tools to categorize the peptide sequence into a functional class.

    11. Classification of bIoaCtive pEptides fRom micrObial fermeNtation

      One question that this name, and the abstract in general, left me with is whether this method is extensible beyond microbial fermentation peptides. I will continue to read to find out, but I'm wondering if another sentence might be added to clarify this.

      I'm also curious if microbial fermentation peptides include all known classes of peptides, or if there are some functional classifications that might not be labelled by CICERON because CICERON has not seen them before. Again I will continue reading to hopefully find out!

  7. Dec 2023
    1. In the future, we will work with the bioinformatics community to adopt standard pipelines to handle FHR-containing FASTA files. This will involve adding logic to existing FASTA software libraries to handle comments.

      have you started work around this? have you had any community buy in? do you have realistic timelines and goals around achieving this?

    2. Unfortunately, this is not always the case. Although some modern FASTA-consuming tools recognise and ignore semicolon-based FASTA comments, most do not. Fortunately, it is trivially easy to strip comments out of a FASTA file by removing lines that begin with semicolons. Users of FHR-enabled FASTA files may need to add this preprocessing step to their nucleic acid analysis pipelines before passing the file to downstream tools.

      This is a massive drawback and essentially makes the addition of all of this provenance info moot at worst and irrelevant to many FASTA files at best.

      Would it be possible to design a metadata standard that didn't rely on a header that would make most tools unable to process the data? Could it be extensible to other types of FASTAs?

    3. )

      double parentheses typo

    4. There is no formal way to add additional information to the sequence-level header line.

      Can you expand on what you mean by this?

    5. materials the

      missing comma

    6. legacy features

      Will new tools be compatible with these legacy features? I'll be curious as I continue reading the paper whether you have tried using a FASTA with this header with popular tools (BWA, seqkit, seqtk, samtools, etc)

    7. Several organism-focused genome data portals, such as AgBase (18), FlyBase (19), SoyBase (20), wFleaBase (21), WormBase (22), VectorBase (23), Ensembl (24), and others (25), publish annotations that are not found in the NCBI Assembly database. In some cases, these annotations and associated genomes cannot be submitted due to data ownership conflicts. These genome browsers and data repositories are often associated with a larger consortium that is working to answer questions of interest to the relevant scientific communities. Examples of such consortiums are the i5k (26, 27) Workspace (28), a collaborative effort to annotate arthropod genomes, and the Alliance of Genome Resources (The Alliance) (29) a centralised resource Model Organism resource.

      Do you need buy in from each of these communities for your metadata standard to be a success? How do you plan to get that buy in?

    8. Reducing discrepancies between genome references for the “same” organism can be aided by improving our ability to include crucial metadata about the origins of and means by which each genome reference is created in-line with the sequence data itself.

      Does NCBI fundamentally not allow for these metadata fields, or are they not inserted by the users who upload the data? I think creating a new metadata standard (as presented here) while in theory could solve some of these issues, compliance by those who upload genomes will always be an issue no matter what standard is used. I think researchers default to not including information when they are unsure about that information, or unsure of themselves at time of upload, which has historically been a rather stressful process.

    9. ,

      typo :)

    10. Differences can arise when a reference genome is replicated across platforms or devices (e.g. renaming of files or contigs, removal of contigs that fail to meet some criteria such as minimum length, the removal and addition of metadata, etc.) leading to a gradual divergence of reference genome files and their metadata (i.e., the genome data and metadata divergence problem, divergence problems are described by Haslhofer 2010 (13)).

      These are really great examples!

    11. all provenance information must come from external sources and be linked to the file name or checksum.

      I think NCBI handles much of this by putting the information directly in the FASTA header for each contig. The accession itself given to each contig creates a link but I think it does act as a small tracker of provenance

    12. Schoof 2003 and Niu 2022

      These citations follow a different format

    13. transcription

      Would you be willing to use a synonym here? After reading the abstract and first paragraph, I'm searching for clues that this metadata standard might also apply to other types of sequencing data (other FASTAs with e.g. assembled transcripts, amino acid sequences, genes, etc), and seeing transcription here and not referring to the central dogma is a little distracting

  8. Nov 2023
    1. pathways

      how did you deal with shared KOs between pathways when doing pathway level analysis?

    2. After all these filtered steps, we have 1747 high-quality data remaining for the downstream analysis, including547 healthy samples, 274 type 2 diabetes samples, and 926 samples related to inflammatory bowel disease.

      many of these sequences contain detectable human sequences. I would be curious for you to run the human genome against your databases and see what functional profile is returned. it would let users know whether they need to do host filtering before applying this approach. if they didn't need to take that step (or any other QC), that would be huge time savings

    3. -p protein, k=7, k=11, k=15, abund, scaled=1000

      how did you come up with parameters, and how do you know they are the best to use? how would you advise users to choose between k-mer sizes for their own applications?

    4. We used BBMap [9] to simulate a metagenome from 1000 randomly selected genomes from all 4498 bacterialgenomes present in the KEGG database.

      reiterating the point from above, how does your approach break down with increasing evolutionary divergence from the reference, and how is that different from other tools. Soil might be a good ecosystem to test drive this in, and I think the CAMISIM tool allows you to introduce mutations from a reference in a known ratio/identity etc

    5. The number of KOs (a total of only 25K) is much smallerthan the number of genes, and the number of k-mers in a KO is much larger than that of a single gene.Considering these factors, we designed our pipeline to invoke sourmash gather with a list of all KOs in theKEGG database, and then to output a list of KOs that ‘cover’ all observed k-mers in a given metagenome.

      I did some work similar to this with the pfam database a couple years ago: https://github.com/taylorreiter/2021-pfam-shared-kmers

      I'm curious if you did any sort of analysis to see if there is shared kmer content between orthologous groups, or if high shared content (as is observed in pfam) would limit the ability of this approach to be generalized to other databases.

    6. tinier

      shorter

    7. Next, we analyze the distinct functions among different conditions (Type 2 Diabetes, T2D; Healthy, HHS;and Inflammatory Bowel Disease, IBD). We conducted a LEfSe analyses [58] to unveil the key functionalunits/pathways that underlie the distinctions between the condition T2D vs. HHS and IBD vs. HHS.

      can you do these same analyses with a tool like HUMANN2 or something else that is typically used to do functional profiling and compare the results? can you show that you capture more functional units than other tools, or is your method only faster? would you need additional database above just KEGG to make the comparison fair, and is that possible with the approach you have outlined here?

      I think the 2019 HMP IBD paper has a supplemental figure where they have KOs for each sample. it would be interesting to compare against those results for those samples to see if you get the same or different results (super set, subset, etc).

    8. Using the functional profiles as input, we computed the pairwise FunUniFrac distances forT2D vs. HHS and performed MDS on the resulting pairwise distance matrices for visualization

      is the code for this also in the linked github repo? I couldn't find it, but I think it's an interesting application. It would be nice if something similar could be implemented for sourmash taxonomy results

    9. pairwise distances between KOs obtained usingsourmash sketch

      pairwise distances between KOs obtained by comparing sourmash sketches, right?

    10. sourmash clearly is the better choice when high-coverage samples are available.

      I think this is too strong of a statement for the results presented. What about divergence in the metagenome vs. what's in the database? while using an amino acid k-mer will overcome some of this, I would expect diamond to better capture functional potential of a metagenome when the genomes are not in reference databases (I haven't explicitly done this test though so I don't know).

    11. We also found that KofamScan hasexceptionally high resource requirements, and yet did not show promising performance.

      again, I think this comparison is unfair since you aren't using assembled genomes

    12. On the other hand, the use of lightweight sketches allows sourmash to avoid alignmentaltogether, and identify the list of all present KOs more accurately, using fewer computational resources.

      this is not always a benefit. The "alignments" output by diamond can be super useful if the user wants to go back and do a targeted alignment of a specific gene of interest.

    13. We used two different k-mer sizes when running sourmash. Inthese experiments, we used a single active thread to run the sourmash gather program, and 64 threadsto run DIAMOND to generate these results. The computational resources (total CPU time and memory) togenerate these results are shown in Figure 2 (c and d).

      what about wall time? because diamond can be threaded, which is a huge plus, while sourmash gather cannot

    14. From our simulation experiments, we found that KofamScan fails to scale to metagenomes with millionsof reads (taking more than seven days to complete on a simulated metagenome with 1M reads) – makingit an impractical choice for this task. Nevertheless, because KofamScan was developed so closely with theKEGG database, we present the comparison in this manuscript.

      this doesn't make a lot of sense as an application though right? kofamscan is designed to run on ORFs predicted from assembled genomes, not on metagenome reads?

    15. The pipeline is freely available and can be accessed here:https://github.com/KoslickiLab/funprofiler

      I noticed that this repo doesn't have any unit tests and that the python script only contains 58 lines of code. Would it be possible to include this approach directly in sourmash?

    16. he primary use of alignment-based algorithms makes these apoor practical choice in terms of scalability

      even more than this, many of these algorithms are limited to the setting of assembled (meta)genomes, and there are a substantial number of studies showing that short read assembly often fails for metagenomes, especially for those from complex communities. If your method can work directly on short reads, I think that is a huge strength that is worth highlighting

      (I believe DIAMOND-based approaches will also work quite white on short reads, but many of the others do not. while I have used diamond metagenomes against small databases [see the serratus rdrp paper for inspiration here], I'm not sure how well it would scale to whole metagenomes against all of e.g. KEGG).

    17. KOs

      Aren't they called KEGG Orthologs, which is abbreviated to KOs?

    18. Thesemore popular alignment-based tools also lack the use of orthology relationships of the genes.

      This statement isn't clear to me. It seems like the KOALA and kofamSCAN algorithms do consider orthology, can you expand this statement to make it clear what this means?

    19. continue to turn to sketching-based methods, which are often faster andmore lightweight; and theoretical guarantees of the sketching algorithms ensure their high accuracy.

      can you provide citations for this point, both before and after the semi colon?

    20. east common ancestor

      last or lowest common ancestor?

    1. Application of thisalgorithm to single-cell data from other biological species can increase our understanding ofbiodiversity

      I have some questions about this statement. 1. do you know if there is enough non-human single cell data to do a similar study on a different organism? Perhaps mouse might have enough for example? 2. Do you imagine this approach could be used for cross-species analysis, for example in which a mouse is compared to a human, or do you think it is limited to within species analysis?

    2. The gPRINT algorithm accomplishes this by reordering the genes expressed ineach cell according to the human reference genome sequence HG38 (refer to the "Methods"section) and plotting the gene expression levels to generate its unique "gene print" (Figure S1).Building on the principles of deep learning applied in voice recognition, the algorithm treats thepositional information of gene open expression as temporal information in a sound wave. Eachgene interval is treated as a frame segment in a sound wave, and a one-dimensional neuralnetwork is used to learn from a specific reference dataset and automatically predict cell identitiesin the query dataset.

      This is a very clever manipulation of the input data that allows it to be analyze in a new way. As someone who is relatively new to this field, it would be very helpful if either in this section or the introduction if you could provide references for any approach within sequencing data that does something similar. If there is no similar approach to date, that would also be helpful to highlight. I think the idea of using an embedding is fairly common (e.g. word2vec), but it would be helpful to know the boundaries of innovation for this particular approach.

    1. Functional potential profiles were derived from good quality read 1 sequences using SUPER-FOCUS570(33) software, linked to the Diamond sequence aligner (v0.9.19; 90) and version 2 100% identity-571clustered reference database (100_v2; https://github.com/metageni/SUPER-FOCUS/issues/66).572Where subjects/samples were represented by multiple sequence files, the combined SUPER-FOCUS573outputs were normalized so that the total functional relative abundances summed to 100% in each574subject/sample

      I'm really excited by the approach profiled in this paper. I think it's a very clever use of chemical reactions and stoichiometry. However, I'm concerned about the pre-processing step mentioned here. Do you have a sense of how lossy SUPER-FOCUS is, especially for soil microbiomes? Typically, metagenome analysis of soils can lose up to 80% of information due to the system complexity and the amount we have yet to observe. Depending on the fraction of loss of functional information, how do you expect that to impact that results presented in this study?

  9. Aug 2023
    1. Coupled with the ability to annotate novel loci, the increased sensitivity of the de novo pipeline for detecting likely biologically meaningful differential expression might make it preferable over the reference genome-based approaches for studies aimed at broadly characterizing variation in the magnitudes of expression differences and biological processes. However, the reference-based ‘new Tuxedo’ pipeline might be more appropriate when a more conservative approach is warranted, such as for the detection and identification of candidate genes.

      Did you explore the option of merging the StringTie and Trinity transcriptomes? Tools like orthofuser or evidentialgene could be used for this. I would be curious how well this might work. Can you think of any reason this would be a bad idea?

    2. Assembly

      It would be super helpful in this section if you could compare the transcriptome assembly sequence themselves -- similarity and containment. I think this would help build intuition for the difference in transcriptome content between the different methods. A tool like sourmash can be used to estimate these metrics (sourmash sketch, and then sourmash compare. Then you can use sourmash plot to visualize). I'm curious if the trinity assembly has lower mapping because it has fragmented transcripts which are resolved with the use of the genome.

    3. .

      small typo -- period should be a comma :)

    1. As an rRNA reference, we provide the rRNA database from SortMeRNA, a tool commonly used to filter rRNA from metatranscriptomic data. The database contains representative rRNA sequences from the Rfam and SILVA databases (see https://github.com/biocore/sortmerna/blob/master/data/rRNA_databases/README.txt).

      Can this be turned off? I can imagine for e.g. metagenomics one might not want these removed.

    2. Homo sapiens (Ensembl release 99), Mus musculus (Ensembl release 99), Gallus gallus (Ensembl release 99), Escherichia coli (Ensembl release 45), Chlorocebus sabeus (NCBI GCF_000409795.2), and Columba livia (NCBI GCF_000337935.1)

      Just curious if you've thought of benchmarking with T2T assemblies. I think it could be useful and cool (although perhaps beyond the scope of this preprint) to see how much contam changes based on the completeness of the contam reference.

      I'm also curious if these genomes needed to be masked at all (repeats, or ribosomal or other highly conserved sequences) so that off-target mapping doesn't occur. https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/37175-introducing-removehuman-human-contaminant-removal

    3. we also offer a kmer-based option with bbduk

      Can you change some of the early language to reflect this? The bbduk method is very different from mapping -- especially depending on the k-mer size that you end up using. It also will have very different output files (no BAM!), which is great, as BAMs are huge and a lot of people like to avoid them.

    4. Figure 1.

      This figure is a bit blurry, would it be possible to update to a higher resolution image? Additionally, this overview is much more useful than the one currently in the GitHub readme. Would you be willing/able to swap out the README figure and place this one in instead?

    1. In this manuscript, the authors present shournal, a tool to help with tracing shell commands that have been run on Linux computers. shournal sits in a space between iterative computational experiment and codifying those steps in a workflow. I'm excited by the concept of radical repeatability that lightweight tools like shournal could usher in.

      I was unable to install shournal from the instructions on the github page, so this review does not cover feedback on the tool itself. I was eager to try on the snakemake integration and was sad not to be able to. I tried to install on an AWS EC2 instance (Ubuntu, t2.micro, using the latest release of shournal).

      From a high level usability and adoption perspective, I think two things currently decrease the likelihood of shournal's broad adoption. First, the fact that there is no mac or windows distribution decreases shournal's audience. Second, the fact that shournal may be ineffective on HPCs further limits the audience (both by shournalk and by the event history not tracing over multiple machines These limitations do not decrease it's conceptual addition to the field, but will decrease the likelihood of adoption.

      Culturally, I think there are pros and cons to shournal. On the pro side, I think having more tools in the reproducibility arsenal is a positive thing. Shournal meets scientists where they're at as they determine the best scripts to run on their data. However, I worry that reliance on shournal could lead to sub-par documentation for computational experiments. If researchers are in the habit of recording their commands and with notes, reliance on shournal may change this process, removing helpful metadata from command recordings. It is difficult to know how a tool like shournal could change the overall working habits of e.g. bioinformaticians, but it would be interesting to conduct a study on how adoption of shournal improves or detracts from reproducibility and documentation. (to be clear I am definitely NOT suggesting that that be done as part of this paper! But I think shournal could encourage a seachange in computing documentation, so it would be interesting from a metascience perspective to understand the benefits and drawbacks of those changes, and then how shournal could eventually be modified to reduce the drawbacks.)

      One of the limitations suggested in the supplement is that, "provenance of binary executables is not tracked." Would it be possible to parse the help messages of binary executables or look at the stdout for version numbers or other tells of the software? This is far more inelegant than shournalk's current approach, but I wanted to supply it as a brainstorming idea in case the authors find it useful to iterate from. Alternatively, could the checksum of the binary executable be tracked?

      Lastly, I left comments inline on the manuscript itself, but I also wanted to note that the first paragraph of the supplement provides important background knowledge that I think would be better served in the introduction of the paper if there is space to include it.

    2. https://github.com/tycho-kirchner/ shournal

      I think this URL has a typo in it

    3. Second, tracing of file actions is limited to the comparatively rare close operation and lets the traced process return quickly by delegating further provenance collection to another thread

      Clever -- so something like ls or cd would be totally ignored, but any program that actually looks at the data of a file will register, rigth?

    4. Typical workflows

      This is slightly confusing wording for me -- does this mean typical shournal runs? For me, workflow is conflated with workflow engines, which I typically assume have a fairly substantial overhead (e.g. a snakemake workflow with a DAG of over 1m processes can take ~16gb of RAM to run)

    5. run permanently

      What does this mean? It seems like shournal still needs to be activated at each shell session. Are you recommending that this be achieved by adding activation to the bash profile/bash rc, or is this only a statement that shournal was engineered to have a low overhead such that it could run indefinitely, even simulataneously with processes that demand high CPU, i/o, and/or RAM?

    6. conceptually more extensive design goals

      Do you have space to expand on this concept, even with another half sentence or sentence, on how the goals of shournal differ from the previously mentioned tools? I'm a huge nextflow & snakemake user and a big proponent of repeatable and reproducible computation, but I don't have the background to contextualize this comment which I think draws away from the potential impact of shournal in this space.

    7. Ruiz, Richard

      I think this citation may be broken

    8. making later re-execution easier

      by virtue of being recorded so a scientist can go back and re-trace their steps, or through some other mechanism?

    9. (d) shournal’s tracing performance in various scenarios as relative runtime overhead. Boxes for both, kernel module- (KMOD) and fanotify backend are displayed. For comparison, our measured tracing overhead of Burrito, SPADE and the ptrace-based strace is shown as well.

      Is the unit for the y axis seconds?

    10. even if the original files have been modified or deleted

      For clarification, this is only if the original scripts or configuration files have been deleted, not if the data files (like a FASTQ or something) have been removed?

    1. https://github.com/yge15/Cancer_Microbiome_Reanalyzed

      Can you make the code you used in this preprint available here as well?

    2. Luo et al. (20), Zhu et al. (21), F. Chen et al. (22), Narunsky-Haziza et al. (23), C. Chen et al. (24), Lim et al. (25), Bentham et al. (26), Y. Kim et al. (27), Y.Xu et al. (28), and Y. Li et al. (29)

      were you time-limited in examining these or are they more difficult to tell than the above cases in whether the results represent true biology?

    3. Notethat we do not know precisely where Poore et al. went wrong in applying the normalization code

      It would be helpful to know whether: 1. All of the data and code were available to try to exactly repeat what Poore et al did. If not, what ingredients are missing? 2. If you are able exactly repeat what Poore et al ran, do you get the exact same results? If not, is it because they didn't report e.g. a random seed value, or does it seem like the code that is reported isn't what was actually used?

    4. Note that even with two rounds of alignment against the humangenome, many of the reads in each sample were still classified as human by the Kraken programusing our database.

      Given that human contamination is a leading issue here, I think it could be interesting to show what mapping against a human pangenome accomplishes in terms of reduced number of human sequences in unmapped reads. Similarly, a program like bbmap's bbduk.sh could be used to remove human sequences. It's similar to kraken in that it detects matches based on k-mers. While neither of these things are necessary to prove the points in this preprint, I think this preprint might receive a lot of attention and I think it could be a gift to the community to demonstrate the most effective ways to reduce human reads in a sample when that is the goal. This would also have applications to the metagenomics field for host-based samples, where removal of human reads from e.g. gut microbiomes should be performed before data deposition.

  10. Jul 2023
    1. Thanks for the preprint! It's very cool that enough public data exists to do this sort of analysis. I have some concerns about the "Imputed Metagenomics" section. I think the success of the picrust approach is dependent on how well the species in the microbiome are represented in reference genomes. I have some doubts that corn microbiome species would be well represented in public databases, especially for ecosystem-specific genes in the pangenome. I think it would be a huge value-added and validation if you were able to show that picrust accurately predicts the function of corn metagenomes. I think you could do this either by identifying a publicly available paired data set for each of your sample types (soil, rhizosphere, roots, and leaves) and doing functional annotation on the shotgun metagenome versus picrust on the 16s. If this doesn't exist, you could use a tool like groupM to pull out 16s sequences from a metagenome and analyze those with picrust, and then compare it to the functional annotation of the metagenome. Without this type of analysis, I think the "Imputed Metagenomics" section is overstated.

    1. Thanks for the lovely paper! I have some comments on the section, "Binning of Prokaryotic Genomes and Bin Refinement." First, I'm curious if you tried to decontaminate any bins. I think this could be an interesting step, whether they initially had contamination greater than or less than 5%. I think GUNC can be used for this. I've also contributed to a software called charcoal that could work for this. Second, I have concerns about the bin refinement you performed. I think assemblers (esp short read assemblers, so may be less relevant) typically break contiguous sequences and produce fragments when there is either incomplete sequencing or strain variation that causes a bifurcation in the underlying graph. Could your refinement technique be merging contigs together that would never be combined in nature? It could be interesting to investigate the underlying assembly graph for some of these merges, either in the short or long reads. I see you think about this in the paragraph, "This bin refinement method allows the merging of contigs..." I think it might be worth additional investigation to ensure this is not happening, as the methods you use here could be precedent-setting for long read metagenomes (esp those sequenced without accompanying hi-c sequencing data).

      Thanks for this effort, I really enjoyed your preprint. I loved the note at the beginning about the length of the preprint :)

    1. I'm very excited by mTMAT. In the past, I've struggled to find robust statistical techniques to analyze longitudinal microbiome data, so I'm very excited to have another method in my toolkit. After reading your preprint, I have a few questions about the versatility of mTMAT. 1. Would it be possible to use mTMAT on shotgun metagenomic data? I'm envisioning longitudinal samples for which species or strain counts have been inferred. 2. How would mTMAT behave in the face of large microbial transitions over time. For example, what about microbiomes where the composition fluctuates diurnally? Or, a large change happens over the course of a year (like the onset of inflammatory bowel disease) 3. The language at the beginning of the paper emphasis association with disease state. I'm curious if mTMAT is flexible for application to any variable set. For example, could I apply this tool to wine microbiomes? I assume yes from the way the pregnancy data set is presented but wanted to check.

  11. Jun 2023
    1. Venn diagrams

      zDB looks really cool, I'm excited to try it! Have you thought about including an upset plot instead of a venn diagram? I think it could help make intersections, especially between many genomes, more clear.

  12. May 2023
    1. Materials and Methods

      I've always used bbduk.sh to remove host reads. I've used the database linked below, and introduced in the seqanswers post below. One of the benefits to this method is that it separates FASTQ reads into different sets without requiring a bam, so it saves on hard disk space. The method below also uses a masked reference so as not remove true microbial reads that have homology to the human genome. My two comments are: 1. could you benchmark with bbduk as well? I think this could be a real contribution in this space. 2. could you mask the T2T assembly like the bbduk author did? I've really appreciated not accidentally removing plant/fungal etc sequences from my metagenomes.

      Seqanswers introduction to method: http://seqanswers.com/forums/archive/index.php/t-42552.html Database link: https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?usp=sharing

    2. Establishing the per-read gold standardBy each method, a read was given a label as whether it was of host origin (for kraken2, a TaxID assignment to any Chordata species was considered as a host label), but the true label of the read is unknown. Therefore, we compared all the labels given to each read to establish a de facto “ground truth”, or gold standard, by imposing the following criteria. A read was assigned a consensus label if concordant results were found for all methods. When the results were discordant, the reads were subjected to further examination. We used BLAST search results as the discriminating standard to resolve the discrepancy, because BLAST is an expensive yet very sensitive and more robust algorithm.26 Since the number of reads with discordant labels were too large to be all aligned by BLAST, we narrowed down the discrepancy by allowing a more tolerant standard for assigning consensus labels. If at least an alignment-based method and at least a k-mer-based method labelled a read as derived from host, we gave the host label to this read as the gold standard. The remaining reads with discordant labels were queried against the NCBI nr/nt database with BLAST (blast+ v2.12.0). If a hit to Chordata sequences were found with sufficiently high alignment quality (identity 90, coverage 90 for short reads and identity 70 for long reads), the read was considered truly of host origin, and otherwise non-host.

      I think it might be worthwhile to create a simulated set of reads from the human genome and from some microbes, such as E. coli or others with high quality genomes where it can be confirmed that there is no human contamination therein. Especially for ribosomes and other sequences with homology, it could be very difficult to establish a ground truth. The inquiries you did with real data are super important as well, and well designed given that real data is messy, but I think simulated data would be a strong contribution here.

    1. For all 10,487 data sets generated and collected for this study, reads were assembled de313novo into contigs using MEGAHIT v1.2.8 45 with default parameters

      It would be really interesting to see the alignment rates -- e.g. what fraction of each sample assembled, and if this varies by biome. This would give us some sort of idea if there were other viral reads left on the table due to non-assembly

    2. That the 180 RNA viral superclades identified represented RNA-based organisms was147verified by multiple lines of evidence.

      Did you do any sort of contamination screen here to see if any of your hits were off target or had homology to other sequences? Either against BLAST nt/nr or against metagenomes or something?

    3. Independently to the deep-learning111approach, we applied a more conventional approach (i.e., “ClstrSearch”) that clustered all112proteins based on their sequence homology and then used BLAST or HMM models to113identify any resemblance to viral RdRPs or non-RdRP proteins.

      Did you do validation here? We've recently done something similar and noticed that we have to filter our diamond BLAST-equivalent results to 90% identity, or else we get a ton of off target hits.

    4. The latter approach is114distinguished from previous BLAST or HMM based approaches because it queries on protein115clusters (i.e., alignments) instead of individual sequences, which greatly reduced both the116false positive and negative rates of virus identification.

      clever. Reminds me of NCBI's new clustered nr database

    5. The major AI algorithm used107here (i.e., “LucaProt”) is a deep learning, transformer-based model established based on108sequence and structural features of 5,979 well-characterized RdRPs and 229,434 non-RdRPs.109LucaProt had high accuracy (0.03% false positives) and specificity (0.20% false negatives)110on the test data set (Fig. 1b, Extended Data Fig. 4).

      Nice! I have two questions about this. 1. Are there any problems that could arise in training because this training set is so unbalanced? 2. How do your input RdRPs compare to those used in Serratus?

  13. Apr 2023
    1. through a Bioconda recipe

      I saw that noarch was specified on conda, but when I tried to install it via conda on an M1 mac, I encountered issues whether running on arm64 or rosetta.

    2. Meanwhile, in the same benchmark, alignment-free methods appeared to contain a genuine and strongphage-host signal for a broader range of phages, but more complex to parse as the highest scoring hostwas often (>50% of the time) yielding an incorrect prediction at the species, genus, and family level

      I think this sentence has a missing word

    3. The main exception tothis pattern was the unexpectedly high number of host predictions to the Bacteroides genus for marine

      This is interesting. I'm curious if this could also stem from contamination in the databases, as mentioned in the next sentence. Is there a way to systematically evaluate this? (e.g. potential for kit contamination in isolates vs assembly/binning contamination in MAGs)

    4. Random Forest Classifiers were built using the TensorFlow Decision Forests v0.2.164 packagewithin the Keras 2.7.0 python library65, with parameters optimized with the Optuna v2.5.0 framework66.Parameters to be optimized included maximum tree depth (between 4 and 32), minimum number ofexamples in a node (between 2 and 10) and number of trees (between 100 and 1,000). A total of 100trials were performed, each was evaluated on the test dataset, the 5 classifiers with the highest accuracywere selected as the best candidates, and the candidate with the highest recall at 5% FDR was thenselected as the final iPHoP-RF classifier

      If I'm reading this correctly, this design wouldn't allow you to estimate overfitting. Have you brainstormed any ways to make a validation set for this model?

    5. Based on alarge database of metagenome-derived virus genomes

      Would it be possible to add context into the abstract about whether this is a new database, a combination of existing data bases, or just an existing database? I think that would be useful information to have up front

    6. Ashost references, we opted to use all genomes included in the GTDB database34, supplemented byadditional publicly available genomes from the IMG isolate database35 and the GEM catalog36.

      How does this set of genomes compare to e.g. all bacteria and archaea in GenBank? Was there a reason for excluding GenBank?

    1. To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of MALT for vali-dation of results via an LCA-alignment. For this purpose, aMeta automatically builds a project-spe-cific MALT database, based on a filtered list of microbial species identified by KrakenUniq. Inother words, the combination of microbes across the samples remaining after depth and breadth ofcoverage filtering of the KrakenUniq outputs is used to build a MALT database which allows therunning of LCA-based MALT alignments using realistic computational resources. We found thatthis design provides two to six times less computer memory (RAM) usage compared to traditionalways of building and using MALT databases, see Supplementary Figure 6.

      The genome-grist tool does something similar but is built around sourmash gather and BWA for mapping: https://github.com/dib-lab/genome-grist

      I'm curious what the advantage of mapping with an LCA algorithm is here. We've found LCA methods to lead to higher false positives for genome identification (see here: https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It would be really helpful to have this explained a bit more.

    2. Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis reference genomealone).

      This is a really clear explanation!

    3. and selects reads of length above 30 bp which have a good taxonomic specificit
    4. e statistically robust enough

      What statistics does this refer to?

    5. When comparing different KrakenUniq databases, we found that database size played an importantrole for robust microbial identification, see Supplementary Information S1. Specifically, small data-bases tended to have higher false-positive and false-negative rates for two reasons. First, microbespresent in a sample whose reference genomes were not included in the KrakenUniq database couldobviously not be identified, hence the high rate of false-negatives of smaller databases. Second, mi-crobes in the database that were genetically similar to the ones in a sample, appeared to be more of-ten erroneously identified, which contributed to the high rate of false-positives of smaller databases.For more details see Supplementary Information S1.

      This matches intuition, but is super useful to have it written out so clearly.

    6. Another popular general purpose aDNA pipeline, nf-core / eager[32], implements HOPS as an ancient microbiome profiling module within the pipeline, thereforewe do not specifically compare our workflow with nf-core / eager but concentrate on differencesbetween aMeta and HOPS.

      would it be possible to contribute parts of your pipeline back to the nf-core eager pipeline? It think it could give it more visibility and be more reproducibly than your snakemake pipeline given the use of docker containers etc. Your approach seems very nifty and would be great to make it more available! Parts of this might be simplified by work being done on the nf-core taxprofiler workflow which I think is building modules for krakenuniq

    7. The workflow is publicly available at https://github.com/NBISweden/aMeta.

      I took a look at your repository and it was very easy to navigate and follow! I have a couple of suggestions though. Would you be able to pin versions of software tools recorded in your envs/*yaml folder? This will make your pipeline more reproducible.

      It would also be helpful if you could parameterize your output directory. A lot of clusters separate write directories from run directories, so this would help with portability.

      Lastly, I'm curious if k-mer trimming would improve your results. khmer's trim-low-abund.py can trim samples with variable coverage using the -V flag if you're interested in exploring that further!

    1. https://github.com/deepomicslab/GutMeta_analysis_scripts.

      I'm so excited your code is open source! Your website is so beautiful I assumed the code would be hidden. Very excited to explore everything there.

    2. https://GutMeta.deepomics.org

      Would it be possible to make it so that test data sets have pre-calculated results? I wanted to demo your web service, but the results took a very long time to generate even though I used the Demo Fileset CRC_health.100

      Also is it possible for the query task status page to auto refresh the status? it stated running until I refreshed and saw that it was finished.

  14. www.biorxiv.org www.biorxiv.org
    1. he environment

      Is this the readily-measurable environment? Could there be things that are predictive of the environment that weren't measured but impact evolutionary trajectories?

    2. identical

      Are the environmental regimes truly identical, or highly similar? The use of identical here makes me think of highly controlled laboratory settings instead of different locations

    3. Deposition of sequencing data is also detailed in Belser et al.2022

      When I follow the links detailed in Belser et al, I can only find coral amplicon sequencing and aerobiomes, plus two other targeted sequencing experiments. Would you be willing to provide the exact accessions for this data in this paper? It would make the data much more findable!

    4. 32 sites

      In the last paragraph of the introduction, the authors state that samples were taken from 33 sites. Is it 33 or 32?

    5. We aligned Illumina-generated 150-bp paired-end metagenomic reads sequenced from eachcolony to the predicted coding sequences of the Pocillopora meandrina and Porites lobata hostreference genomes using Burrows–Wheeler Transform Aligner (BWA-mem, v0.7.15) with thedefault settings (H. Li and Durbin 2009). A read was considered a host contig if its sequencealigned with ≥ 95% sequence identity and with ≥ 50% of the sequence aligned. Host-mappedreads were sorted and filtered to remove sequences which contained > 75% of low-complexitybases and < 30% high-complexity bases using SAMtools v1.10.2 (Heng Li et al. 2009) with theresultant bam files visualized using the Integrated Genomics Viewer (Robinson et al. 2011). Thereference genomes were indexed with picardtools’ v2.6.0 (https://broadinstitute.github.io/picard/)CreateSequenceDictionary before performing local realignment around small insertions anddeletions (indels) using GATK’s RealignerTargetCreator and IndelRealigner to reduce falsepositive variant identification and represent indels more parsimoniously..CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is madeThe copyright holder for this preprintthis version posted October 27, 2022.;https://doi.org/10.1101/2022.10.13.512013doi:bioRxiv preprint

      It would be super helpful if you could report the fraction of each metagenome that aligned to each coral species in the results or as a supplement

    6. assembly produced 302,299 transcripts, which were then clustered into 40,560 unigenes (i.e.uniquely assembled transcripts) with N50 of 1,053 bp. Based on BUSCO, the assembledtranscriptome was highly complete with 85.6% of the ortholog genes from the Eukaryotadatabase being present with low fragmentation (3.6%), missing (10.8%), and duplication(16.7%) metrics.

      if this was based on metatranscriptomes, what percent of the contigs were from non-eukaryotic species? was percent were from non-coral species?

    7. produced SSH

      Could you use a synonym here? This acronym makes it difficult to dive directly into the discussion and interpret the most salient points

    8. FIGURE LEGENDS

      Many of your figures have both red and green in the palette, would you be willing to change to a color-blind friendly palette to make your figures more accessible?

    9. To begin to answer these questions, weanalyzed samples from the Tara Pacific Expedition (Planes et al. 2019), that ran from 2016 to2018 and completed an 18,000 km longitudinal transect of the Pacific Ocean, sampling threewidespread corals – Pocillopora meandrina, Porites lobata, and Millepora cf. platyphylla –across 33 sites from 11 islands. We conducted ultra-deep metagenomic sequencing of 269coral colonies in combination with morphological analyses and climate data to determine thestanding genetic diversity and population structure of the coral species under study, identifyseveral cryptic species, and reveal disparate patterns of environmentally linked genomic lociunder selection.

      This data set is truly such a gift to the scientific community. I love how this paragraph gives a large scale overview of all of the data collected.

    10. https://doi.org/10.5281/zenodo.7229405

      This zenodo is under restricted access, would you be willing to open access now that the preprint is posted?

    11. Our dataset comprising samples of multiple species of Pocillopora and Porites across the samebasin-wide range has enabled us to demonstrate contrasting evolutionary histories in thesecorals, despite exposure to the same past thermal, and arguably environmental, regimes.

      Is this because these corals were sampled side-by-side? If that's the case, I think stating that more explicitly in the abstract, intro, and here would be really helpful signposting to the readers (and would negate my initial comment on "identical" in the abstract)

    1. 3.2 Factors affecting eukaryotic abundance in DWDS metagenomes

      I'm not sure if this is helpful, but especially if you end up with specific genomes that you want to look for, you could try using sourmash branchwater: https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1. If you have a eukaryotic genome you're interested in, you could sketch it (sourmash sketch) and then use the branchwater tool to search most metagenomes in the SRA to see which ones have high containment with the genome your searched. You could then use the SRA metadata tables to filter to wastewater samples and the dig in more to the biogeography of those.

    2. The majority of the sequenced data in metagenomic assemblies from complex environmental186samples are typically contained in short contigs (e.g., < 5 kbp), especially in case of complex187communities with low abundance organisms17,75,76

      This would be really helpful context to have in the introduction, since it would inform why you chose to structure the methods (short kb contigs) the way you did.

    3. k-mer signature differences

      Would you be willing to briefly describe the size of k-mer used for this? I could imagine very different results for k-mer size of 4 (tetranucleotide abundances) vs. 21 or 31 (which are generally genus or species specific)

    1. Data availability: Raw data are available under NCBI BioProject IDs PRJNA891910 (Adult1122CAMAs, MiSeq and NovaSeq raw data, MAGs), PRJNA891898 (Larva CAMAs, MiSeq raw1123data), PRJNA891892 (Whole larva microbiome, MiSeq raw data).

      These are such exciting data sets! I tried searching for them on the SRA and the ENA the accessions could not be found. Would you be willing make your data public now that you have released your preprint?

    1. Some detected rRNA could represent contamination or microbiome composition, as has also been reported bya recent microbial analysis of human single cells (Mahmoudabadi, Tabula Sapiens Consortium, and Quake,n.d.) (Supplement).

      I have only worked with two single cell data sets so please forgive me if my comment is naive, but given that you suggest that NOMAD+ works on raw single cell reads and the proclivity of raw sequencing data to contain microbial contamination, do you think it would be beneficial to recommend that users screen their single cell data for microbial contamination prior to running NOMAD+? One rapid way to do this would be with the sourmash gather algorithm (https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). Again, I'm not sure how contamination in single cell compares to that of bulk RNA seq, but this is an interesting manuscript that profiles what is in the left over fractions of transcriptomes that don't map to the human genome: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1403-7

    1. )

      missing punctation

    2. Bacteria acquired from plants genes that primarily encode for secreted proteins thatmetabolize carbohydrates, thereby enabling bacteria to grow on plant-derived sugars.

      This sentence was a little hard to understand, I found myself reading it a few times to make sure I grasped the meaning...but it's an important point! I wonder if it can be rephrased? maybe removing "from" at the beginning of the sentence would help?

    3. Using the current taxonomicinformation we cannot tell if this HGT pattern is the result of gene transfer to onemicrobial phylum followed by an additional transfer to another microbial phylum, orindependent gene transfer event to multiple phyla.

      Is there any information that does exist that would allow you to have a better understanding of this? For example, I'm guessing a lot of the microbes you looked at have other isolate genomes sequences in public databases; could you potentially do a pangenome analysis to help understand the trends you identified? (not suggesting you do this for this paper! But I think this is interesting to brainstorm around and might warrant an additional sentence of potential future methods someone could use to answer this type of question)

    4. However, we did not identifybacterial-Arabidopsis homologous protein pairs with amino acid identity above 76%,rejecting the possibility of hypothetical DNA contamination which would lead to highlysimilar protein sequences

      This is a really helpful statement and something I was really curious about -- contamination is always a super hard thing to separate from HGT. I wonder if it would be easy to filter this analysis to only long read genomes, and to look at the identity of nearby proteins to the candidate HGTs. Barring misassemblies/chimeric contigs, it might be easier to trace the HGT events in longer reads. Probably not necessary here, just brainstorming about alternate signals that could really solidify that these candidates are HGT and not contam!

    5. A.

      I found myself a little confused by the box in the second row on the left hand side. Is this an outline of results as well? Or did you determine which 582 plant associated proteins to screen? or is this 582 bacteria?

    6. Interestingly, this level of sequence similarity is shared between theArabidopsis DET2 and its homologues from rice and barley. Importantly, the proteinsfrom other eukaryotic plant pathogens such as Phytophthora and Pythium share weakeridentity (maximum 41% identity) to the Arabidopsis protein than theLeifsonia-Arabidopsis DET2 similarity.

      This is a little hard to interpret as written. Does this suggest that Arabidopsis DET2 was the acceptor of Leifsonia DET2, or that the transfer may have occurred at the LCA between Arabidopsis and rice and barley?

    7. B.

      Additional details about the meaning of the labels would be helpful. I was confused especially by "3. HGT between eukaryotes and PA bacteria"

    8. doman

      typo?

    9. These resultsalso support our current results that nearly all transfers into Arabidopsis are ancient andare shared by monocots and dicots. We could not detect a bacterial gene that wastransferred directly to Brassicaceae and is absent from other dicots.

      Very helpful statement! I think this also says something interesting about your methods. For "ancient" HGT events, it doesn't matter that much what plant genome you look at to identify them

    10. was“unknown”

      typo?

    11. We analyzed the number of introns foundin Arabidopsis genes transferred from PA bacteria assuming that they would containless introns than the average plant gene. However, we found no statistical differencebetween the number of introns of horizontally transferred genes and all otherArabidopsis genes (Supplementary Figure 16).

      Very cool/clever!

    12. Tosummarize this analysis, in most cases the taxonomy of the bacterial donor or acceptoris unknown, and Actinobacteria and Proteobacteria relatively rarely donate or acceptgenes from plants.

      It could be interesting to complement this type of analysis by looking at GC content or tetramernucleotide frequency in this gene across genomes, and how that compares to host genome background. How similar are these genes to the rest of the genes in a given bacteria's genome?

    1. --phred33

      This is fairly stringent phred score trimming. I'm curious why you used such a high threshold. I'd also be curious to see if dropping the phred score to ~10 increased viral genome recall in these samples

    2. Metagenome data set

      Although this would be a separate effort from what you have reported here, I think it could be fascinating to expand this type of inquiry beyond the GEOTRACES data set. Using a tool like sourmash branchwater, you could use the viral genome database you describe above and search for metagenomes on the sequence read archive that contain those genomes. It could be a very nice complement to see what sort of global diversity/biogeography there is for these types of viruses. https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1

    3. Subsampling reads and calculating diversity

      I'm curious if you've ever explored alpha diversity estimation metrics that don't require subsampling. The StatDivLab out of UW has developed some really lovely software for doing this type of thing. DivNet in particular might work as a drop in replacement to subsampling, allowing the full amount of data captured in each sequencing run to be used while accounting for unequal sequencing depth: https://github.com/adw96/DivNet

    1. . For instance, among the 313 non-redundant HGT trees for104Homo sapiens, Pan troglodytes was found in 312 of them, therefore the HGT-appearance number105NHP between Homo sapiens and Pan troglodytes was 312.

      This is still fairly confusing and I'm not sure what it means

    2. Widespread of HGTs among eukaryot

      Did you do anything to deal with contamination? Contamination is fairly widespread, even in refseq genomes, and might lead to unexpected results.

    3. Functional annotation for genes overlapping with HGTs (see Methods) revealed some232significantly enriched Gene Ontology terms (GO terms) (Bonferroni<0.05) for protein-coding genes233from mouse, fruit fly and nematode as well as non-coding genes from yeast. (Table S11). The234significant GO terms for nematode were “hemidesmosome, intermediate filament”, while the235significant GO term for mouse was “protein kinase A binding”. HGTs in fruit fly that overlapped236with coding genes were enriched for “ATP binding, lipid particle, microtubule associated complex”,237etc. HGTs in yeast overlapped with non-coding genes enriched for “retrotransposon nucleocapsid,238transposition, RNA-mediated, cytosolic large ribosomal subunit”, etc.

      shouldn't this be a part of the results section?

    4. novel

      can you explain what you mean by novel? not found by other studies? only found in one genome?

    5. HGTs were clustered using the cd-hit-est program (version 4.6.6)[43] with minimum nucleotide400identity set at 80

      This might not be low enough to detect orthology if the HGT event is ancient. One recent paper showed similarity drops as low as ~40% https://doi.org/10.1101/2022.08.25.505314

    6. We further evaluated the pipeline with a genome containing simulated HGT regions. Since our78HGT identification pipeline has two main steps, sequence composition-based filtering step and79genome comparison step. The evaluation was done for the two steps (Figure S3, Table S1). While80top 1% fragments were input to the pipeline, 20.6% correct results would be identified after81sequence composition-based filtering and 14.3% correct results identified after genome comparison.82When the percentage of fragments input was up to 50%, 83.4% and 77.7% correct results were83identified after two steps respectively. It can be seen that the precision of prediction was higher than8460% for all cases. This indicated that we may have underestimated the number of HGTs (low recall85rate) but majority of the identified HGTs were highly reliable.

      This paragraph was a bit confusing to follow but I think I got the gist of it after a few passes through! I'm curious if you thought about controlling for natural variation in 4mer frequency throughout the genome, as some other methods have found that this helps reduce off target predictions (reviewed in https://doi.org/10.1371/journal.pcbi.1004095). It may not be necessary since you do a second step after the initial screen, but I was just curious if that was something you thought about putting in place, and if so, why you decided against it

    7. non-redundan

      Would you be willing to provide a more clear definition of non-redundant here? does this mean there are no paralogs of the gene? or the HGT only occurred in one model org? or only one genome of all of the 824+13 that you investigated?

    8. . The copy number of each HGT was determined from the number of407merged HGT copies

      Are all of these long read genomes? If not, will this be an unreliable estimate?

    9. 1000-bp segments with 200-bp

      How did you assess these numbers? in metagenome binning, 1kb isn't large enough to get confident estimates of tetramernucleotide frequency; you often need > 2500 bp.

    10. er “-e 1e-5”

      evalue can change with the size of the database, how did you account for this?

    11. 1bp

      Is 1 base pair a large enough overlap to be biologically important?

    12. 4 4 21, 24

      Does this mean the 4 you found are different from the ones found in 21 and 24? if so, how do you account for missing the ones found in 21 and 24?

    13. Euclidean distanc

      Why did you use euclidean distance?

    1. We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere

      It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.

    2. we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species

      Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?

    3. species

      is each of the 1.8 million genomes to come from a separate species, or will some be different strains of the same species?