230 Matching Annotations
  1. Mar 2023
    1. A striking potential metabolic complementarity to emerge from our annotations is the capacity of many frequent lichen bacteria to code for cofactors needed by one of the dominant eukaryotic symbionts

      I'm interpreting up to this point that functional annotation and pathway exploration was only performed for the bacterial genomes and not fungal/algal MAGs? Was this because of the difficult in performing ORF prediction/functional annotations without corresponding RNAseq data or something planned for the future? Because it would be interesting to see if the corresponding fungi have transporters for those cofactors

    2. Four bacterial families dominate lichen metagenomes

      It would be interesting to follow up for groups that are core such as Lichenihabitans if they are the same/different strains in these samples and if there are differences where those hotspots of diversity are relative to the lichen type

    3. Fig. 2.

      It would help provide additional context for these genomes if additional layers to the tree were added showing completeness, redundancy, genome size, etc. so it's easy to compare across the tree the quality of these genomes. This can be done with iTOL or EMPRESS as added metadata layers

  2. Dec 2022
    1. We mapped the presence/absence of merB and hgc genes onto the Tree of Life

      If I interpreted the methods correctly, this tree only includes ribosomal proteins from genomes that either have hgcAB/merB or both of them. This isn't exactly overlaying hgcAB/merB presence/absence onto the "tree of life" because to accurately portray these relationships you would also want to include genomes that have neither or these operons. To do this accurately you would want to overlay the information you have here including genomes that are in Hug et al. 2016 for example

    2. Our study reveals an ancient origin for microbial mercury methylation, evolving from LUCA to radiate extensively throughout the tree of life both vertically, albeit with extensive loss, and to a lesser extent horizontally.

      I think to make a statement like this you would need more extensive analyses quantitatively calculating gene transfer rates and using tree dating methods such as in https://journals.asm.org/doi/10.1128/mBio.00644-17

    3. Figure 3.

      Are these trees rooted with either cdh outgroups or fused hgcAB? I see the symbol for fused hgcAB but in Gionfriddo et al. 2020 fused sequences are usually used to root the tree for accurate topology inference

    4. Nevertheless, several hgcA+ genomes did not carry neighbouring hgcB genes, including all Nitrospina and a few Deltaproteobactiera and Firmicutes, potentially because of gene loss during evolution or incomplete transfer events (i.e., only hgcA genes were acquired during the HGT events).

      I wanted to clarify something from the methods - were just the hgcAB proteins from uniprot pulled down or the entire genome sequences for these hgcAB+ representatives? If you did have the entire genome, did you check for the cases where hgcB was missing if hgcA fell close to the end of a contig or not? I think Peterson et al. 2020 ES&T had a couple cases that hgcA was at the end of a contig

    5. A few putative HGT events could be inferred from the larger clade of the HgcA tree e.g., Marinimicrobia-HgcA clustered with Euryarchaeota-HgcA in the archaeal cluster,

      Was the inference made by position in the tree or analyzing the pairwise sequence identity similarity of the proteins from this Archaea/Marinimicrobia? I am curious because in McDaniel et al. 2020 mSystems we also found only a few potentially clear cases of HGT but did this through pairwise sequence analysis, for example for a case of Deltaproteobacteria/Acidobacteria/Verrucomicrobia/Actinobacteria in permafrost

    6. To investigate the evolutionary history of HgcA, we further enlarged the sample size by retrieving HgcA homologs in UniProt Reference Proteomes database v2022_03 at 75% cutoff (RP75). Two other datasets, including one containing 700 representative prokaryotic proteomes constructed by Moody et al. (2022) and another containing several novel hgc-carriers published by Lin et al. (2021), were retrieved and incorporated into the RP75 dataset. Totally 169 HgcA sequences were collected after removing redundancies

      I might have missed something, but it appears that you have included hgcAB sequences that are either included in the PF03599 protein family or MAGs from Lin et al. Are the HgcA protein sequences from the large curation efforts from McDaniel et al. 2020, Capo et al. 2022, and Gionfriddo et al. 2020 for example integrated into this uniprot release? It would seem easier in this case to pull directly from the Capo et al. database since those are curated sequences and metadata to link back to, unless I'm missing how Uniprot accessions work with incorporating data from MAGs

    1. Methodological details, additional figures, and tables are provided in Supplemental Materials.

      Looking further in the SI I think there is some confusion about what genome coverage is referred to, as it's also flip flopped in the main text. Coverage is how many times a position is covered with reads, so 20X coverage means it is covered 20 times with reads that overlap that particular region. This is also referred to as depth. The calculation that I see in the SI table for "genome coverage" and sometimes referred to throughout the text is actually breadth, which is the percent of the genome that is covered which should be between 0 and 1. This is described in the inStrain paper: https://www.nature.com/articles/s41587-020-00797-0. I'm not sure if the authors are getting these coverage/breadth calculations from coverM or inStrain but it's a little confusing in the paper which they are referring to, which is an important distinction when using genomes that were assembled outside of the samples in question.

    2. Brocadia (n=2) and Nitrospira (n=3) MAGs recovered from this study (Table SI-3) were

      I think the table describing these 5 MAGs should be a main table (still have the SI table describing the reference genomes) and modify the table to include the GTDB taxonomy, % GC, length in Mbp (or make clear the units) and no significant figures on number of contigs. You might also want to include in this table the relative abundance calculation per sample for each genome.

    3. Supporting information

      I didn't see a section describing the data availability for the metagenome or MAGs assembled in this study - will the data be made publicly available in the SRA/Genbank?

    4. Comammox and Nitrosomonas relative abundances were about 0.90 ± 0.8 RPKM and 0.40 ± 0.05 RPKM, respectively (Figure 5C). This differs from our prior work, where comammox and Nitrosomonas relative abundances were 22 ± 6.26 and 21.04 ± 6.17 RPKM, respectively (Figure 5B). Thus, it is very likely that the low abundance of comammox bacteria and Nitrosomonas affected the assembly and binning process, which did not allow for the reconstruction of these genomes even though they are still present in the system.

      I'm confused by which mapping stats to which MAGs you are referring to to come to this statement - is the relative abundance to the MAGs assembled from the prior study that is low and therefore inferring that's why you couldn't assemble comammox MAGs from this study?

    5. Further, the genome coverage of previously assembled comammox (JAMMSM_CMX_1) and Nitrosomonas (JAMMSM_AOB_1) MAGs were 80.6 ± 9.8 and 72.3 ± 1.0%, respectively

      So I think I'm answering my previous question here where the prior assemblies have coverage of 80X and 70X approximately and you required they have at least 50% breadth? I think this could be clarified more and report the actual breadth the genomes have for these samples mapping back to them. I've seen for full-scale WWTPs reads that map back to MAGs retrieved from different samples with breadth as high in the 90%+ range

    6. Therefore, the relative abundance of all nitrifying groups was calculated from a set of dereplicated MAGs recovered from both studies (Table SI-3).

      I think this could potentially be an inaccurate way to do this if you don't have statistics for coverage and breadth mentioned in a prior comment to make sure these populations are actually "present" in the sample. For example in Crits Cristoph et al. for mapping reads from soil samples to MAGs they required at least 50% of the genome to be covered at 5X, so the breadth is .5 here. I can't tell from this statement here if you are requiring 50X coverage or 50% breadth at what specific coverage. Because you refer to the 50% as coverage but explain it as the definition for breadth it's a little confusing

    7. The decrease in the Nitrospira abundance could be the reason why several of the previously assembled MAGs could not be assembled in the current study despite the fact that 5 out of 7 of the previously assembled Nitrospira MAGs had 90% of their genomes covered using reads from this study

      Again, I don't think this is the only reason. You could try to answer this with a coassembly even though it will increase complexity and sometimes make things more fragmented. With this coassembly then you could just pull out putative comammox bins of interest and ignore anything else. The other possibility is that although in the previous study (although I haven't read it) that you observed low strain diversity there could be higher diversity for these samples and also lead to difficulties in assembly

    8. Nitrospira and Brocadia MAGs represented 6.53 ± 0.34 % and 6.25 ± 1.33% of total reads in the sample

      It might be good to also include stats of the % of reads mapping back to the entire metagenomic assembly to give context for how complete your recovery effort was

    9. but at very low abundances and thus their genomes were not successfully reconstructed.

      I'm not sure this means that the potential comammox bacteria/AOB were at low abundance and that's why they didn't assemble. It could be that there was higher strain diversity in these samples than those from which the previous MAGs were assembled from, and the contig you aligned with high percentage is just highly conserved or has low diversity. You could instead see if the contig with amoA ended up in a low quality bin and calculating nucleotide diversity on those contigs to see if other contigs have high diversity and that could be a reason why it didn't assemble well

    10. mapping all sample reads

      I think I'm confused by how many samples there are - from the methods above for DNA extraction it makes it seem that there are 6 samples that are homogenized into one and there is only one sample that is sequenced. Whereas here there is reference to multiple samples that reads are mapped from.

    11. and the past study9

      I'm not familiar with this past study - were the MAGs from this past study retrieved from the same WWTP? Was a certain mapping threshold such as coverage or breadth used to ensure that there is actually a similar population represented by that genome present in the sample (breadth = how much of the genome is actually covered. For example if you have a breadth of 90% and coverage of 20X then 90% of the genome is covered, but if you have a really high coverage but low breadth then you could be mapping to something just super conserved and not that specific population)

    12. Each phylogenomic tree was constructed using ITOL v2.1.7

      I wasn't aware that ITOL could construct the phylogenetic tree, I've only used it as a tree viewing program. There should be mention of what program was used to construct the tree from the muscle alignment (FastTree, RAxML for example) and the parameters used for the tree building program

    13. Biomass attached to six pieces of media collected from the aeration tank were scrapped using a sterile scalpel and homogenized using a sterile loop.

      I might just be misunderstanding how the apparatus or biofilm is structured, but is it fine to homogenize biomass from six pieces of media in this way? Is it expected that from these different pieces they should be pretty similar or would heterogeneity impact downstream analysis?

    14. To compare comammox and anammox ammonia oxidation rates with those reported in literature, abundance adjusted rates (μmol N/mg protein-h) were calculated by dividing the average ammonia consumption rate (mg-N/g TS-h) obtained from aerobic or anaerobic ammonia oxidation batch assays by the portion of total metagenomic reads mapping to comammox or anammox bacteria metagenome assembled genomes (see below) as their approximate contribution to total solids measured and then using the conversion factor 1.9 mg dry weight/mg protein25.

      This adjusted abundance calculation based on metagenomic reads mapping back to anammox/comammox MAGs seems highly dependent on how contiguous your assembly is or if you retrieved the actual population in the assembled MAG responsible for this activity. Therefore I'm worried if this is the best or most accurate way to make this rate calculation and if there is a better way to do this either through lineage-specific qPCR primers or an activity-based assay.

    1. We tested the MuDoGeR pipeline using 598 metagenome libraries

      I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.

    2. Biodiversity analysis with MuDoGeR

      Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?

    3. MuDoGeR is open-source software available

      I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.

    4. MuDoGeR v1.0 at a glance

      One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?

    5. on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.

      Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline

    6. t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

      I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:

      1) This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.

      2) Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.

      3) Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.

    7. MuDoGeR was divided into five modules

      I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.