230 Matching Annotations
  1. Last 7 days
    1. There has been growing caution around biological foundation models due to532potential biosecurity threats such as generating novel pathogenic viruses or guiding533gain-of-function viral mutations

      It might be good to mention this rationale/justification and thought process earlier in the preprint so people understand why the code isn't available right now.

    2. Figure 6.

      Not sure if done on purpose, but this figure shows up really light and it's difficult to see some details

    3. Abstract

      When the code is available, it would be good to link it in the abstract since this is a cool tool people will want to use!

  2. Jul 2024
    1. Advanced statistical R algorithms are invoked through a dedicated R installation

      From the figure below it looks like these statistical tests would be readily available or easy to write functions for in python? Just for installation and maintenance issues it's difficult to maintain code depending on two different languages.

    2. It provides access to established functions written in both Python and R for statistical testing and data transformation.

      From the way this preprint is written it sounds like this piece of software is a python package, but you need to have it installed to work with both Python and R? It would probably be best to either write the entire package in one language, or keep the python/R things separate so then you can install the package with pip or from CRAN/devtools depending on the language. Otherwise keeping up with dependencies down the line for both languages could be difficult.

    1. Code and data will be made available with the publication of this manuscript.

      It would be extremely helpful to provide this at the time of preprinting, especially considering the volume of analysis done in this preprint.

    2. Specifically with a method termed SCoPE-MS26, where sample multiplexing via isobaric tags enables the combined measurement of 14-16 single-cells in a single mass spectrometry (MS) run.

      For readers that may not be familiar with the scRNA field, how does this compare to how many single cells can be measured in a single run for that method?

  3. Jun 2024
    1. Accession numbers for all genomes can be found in Supplemental Table 1.

      I know this information is relatively easy to get with the accession numbers, but it would be nice if this supplementary table already had the accession info including species name and genome/proteome quality stats for easy access.

    1. Drosophila melanogaster 53 and D. erecta 54, 55, with 0.07 Dashing 45 similarity score and 0.08 Mash-distance 46.

      Can you add a sentence putting these scores into context? Such as what are the scores for human to chimpanzee?

    2. We measured genomic distance using Dashing 45 and Mash 46.

      Ah ok this type of information is what I was looking for earlier.

    3. To demonstrate LiftOn’s effectiveness at mapping annotation between distinct but closely related species, we mapped human genes onto Pan troglodytes (chimpanzee). Finally, we illustrate that LiftOn works on more distantly related species by mapping annotation from Drosophila melanogaster to Drosophila erecta and from Mus musculus to Rattus norvegicus.

      When reading this I wonder if you have guidelines for how distant of species this method will work for? For outside readers it might be useful to provide some metric of phylogenetic distance or DNA similarity that this is expected to work well for.

  4. May 2024
    1. Figure 2.

      Part of the confusion of this figure for me might be that the x axis for time in hours is different in Figure A, and you have to look closely at the x axis between them to see differences when adding polyphosphate.

    2. Mutational studies allowed us to validate K43 and K45 as the major polyP interaction sites in α-Syn as replacing them with alanine residues was sufficient to abolish polyP binding, and, as a direct consequence, prevented polyP to i) accelerate fibril formation, ii) stabilize α-Syn fibrils, iii) alter fibril morphology, and iv) mitigate α- Syn cytotoxicity.

      I think this a great summary of results, however I found myself having to look back through the results several times to understand the model in my head. So maybe expanding on this a little more into multiple sentences to explain the effects of the mutations along with the presence/absence of polyphosphate so the model is easier to understand.

    3. It is noteworthy to point out that the two critical lysine residues K43 and K45 as well as H50, whose side-chains constitute the positive cluster associated with the mystery density binding in α-Syn, are part of a local hotspot known to harbor several mutations that elicit early onset familial Parkinson’s Disease.

      Ah, fascinating! This gets at my earlier question

    4. While the mutation of H50 showed little effect on the t1/2 (∼ 40 h), substitution of either K43 or K45 with alanine reduced the t1/2 by about 2-fold (t1/2 ∼ 20h). When present in combination (i.e., α-SynK43A,K45A), we observed an even more drastic acceleration in fibril formation as reflected in a t1/2 of 8.3 hours. These results provided first evidence that this cluster of positively charged amino acids in wild-type α-Syn contributes to the slow rate of in vitro fibril formation.

      This is a fascinating result! I am wondering if you could look in datasets like UK Biobank and connect if there are potential variant differences connected to these diseases. Such as if there are populations with or without these lysine residues and connect to risk for neurodegenerative diseases.

    5. lysine residues, suggesting that these residues are crucial for polyP binding.

      I might have missed this or don't understand the conformational change between the monomer and the polymorph, but how are these residues critical for binding to polyphosphate in the polymorph but this doesn't occur in the monomer form?

    6. hese results were consistent with our experimental data, which failed to show any significant interaction between polyP and purified α-Syn monomers

      this is fascinating

    7. Introduction

      This is a really well written introduction, and I was able to understand most of it without having a deep molecular background myself.

  5. Apr 2024
    1. All data is also available on the project Github at https://github.com/mims-harvard/SPECTRA and on Harvard Dataverse at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN.

      This is great that all your data is available! It would be helpful to provide a LICENSE in the repo so others know the terms of reuse, and some improved documentation on how to exactly use SPECTRA for different cases - such as some of the rationale for the SP decisions are here in the discussion and could help with examples in the repo as well

    2. We define a spectral property (SP) as a MSP expected to affect model generalizability for a specific task (e.g. 3D protein structure for protein binding prediction). The definition of the spectral property is task-specific and, together with the molecular sequence dataset and model, are the only inputs to SPECTRA

      I think this should be earlier in the introduction

    3. Main

      Overall this is a really well written introduction that can be understood by a general audience! I learned a lot and also looking forward to digging into some of the cited references.

    4. a spectral property definition

      I think I'm confused on what this is supposed to be even after having finished reading this paragraph

    5. generating a spectral performance curve (SPC). We propose the area under this curve (AUSPC)

      It's pretty early in the paper and it's pretty acronym heavy, I think some of these terms like spectral performance curve and area under the curve might not need to be abbreviated since the reader will have to think back to what these terms are each time, and there is already MB and SB.

    6. metadata-based (MB) or similarity-based (SB)

      Just a small note - in the abstract SB is referred to as "sequence-similarity based" and here just similarity based, would be good to be consistent

  6. Mar 2024
    1. Data and Resource Availability

      This is great that the plasmids are available through Addgene and the scripts/data on Github! I wonder if there is a general protocol that could be posted on Protocols.io for example to help make this method more approachable for others to try?

    2. The design framework presented here enables a scalable method for establishing key initial footholds in genetic tractability in non-model microbes, and is extensible for a growing library of genetic parts and for complex hierarchical assemblies. Further development and broad application of this pan-microbial toolkit will accelerate our ability to study and engineer diverse microbes.

      It might also be nice to provide some discussion as to how extensible currently this toolkit would be to non-model bacteria in other phyla besides Pseudomonadota? Or even just other species within this phyla that weren't tested? I think giving some rationale to why these species were chosen to begin with would be useful as well as how adaptable/usable are the given tools for diverse bacteria?

    3. 6 non-model microbes in pooled screens.

      It might be nice to name the strains used here in the introduction so the reader has context earlier on for how taxonomically broad the toolkit works in. Context also for why these 6 strains were chosen for developing this toolkit would also be helpful - are they of biotechnology significance and highly desired for better tools?

  7. Feb 2024
    1. The Uniprot IDs of the proteins and the predicted binding sites by PointSite have been deposited at https://github.com/molu851-luo/Reverse-docking-benchmark.

      It's great that this has been made available! It would be awesome if you could add documentation in the github repo about how others might be able to use this resource either using the workflows/pipelines you describe in the paper or with other modified pipelines, since it isn't really clear from the outset how to use this data for different purposes.

    2. those lacking well-defined ligand binding pockets, were removed

      how was this determined?

    3. Introduction

      This was a great overall summary of the approaches in the field and current challenges/caveats!

    1. We’ve developed a user-friendly Jupyter Notebook, accessible via Google Colaboratory, designed for training customized prediction models using protein sequences and experimental data provided by users in FASTA format.

      I don't see a link to where this is available?

    2. To predict the kinetic parameter kcat/KM for enzymes, we followed the variant prediction workflow proposed in the ESM repository (“examples/sup_variant_prediction.ipynb”).

      I think even if you followed a publicly available notebook, it would be great to have your own code publicly available as well

    3. Our work demonstrated that homology search combined with pLLMs can detect and prioritize highly catalytically active therapeutic enzymes even when only little labelled training data is available

      I really enjoyed the brevity of how the work was communicated! I also enjoyed how this is a case of taking computational predictions, narrowing them down, and making experimental validations/comparisons. I am wondering if your searches could be enhanced through structure-based comparisons/clusters as well as sequence-based? I think most of the hits were bacterial, but I could imagine there are probably other organisms or at least distantly related bacteria that could have enhanced KYNase properties as well.

    4. As a general trend, our predictor predicts higher kcat/KM for sequences from bacteria than from eukaryotes.

      This is interesting and I think readers might want to know more about this observation and the implications - is this expanded upon somewhere else or in the Methods?

    5. We implemented an easy-to-use webserver (see Methods and Data availability) for researchers to conduct similar analyses using their own measures.

      I think highlighting the webserver directly in the text and not just in the Methods section, and maybe also mentioning this development in the abstract would be really useful to readers and bring more visibility to this useful tool!

    6. Trained on an 80/20 split of 159 experimentally measured sequences

      I may have missed something, but is this based on experimental measures that were performed as part of this study or were they already available?

    7. w

      Small comment, but a typo here

  8. Jan 2024
    1. Poviding

      typo

    2. include additional bioinformatics tasks, in order to obtain a more comprehensive understanding of the strengths and weaknesses of LLMs in this field.

      It would be great to do an analysis of how the different models handle format conversions for common bioinformatics format types. If they can take a one sentence question for converting between formats, or if you have to explain more exactly what the file should look like. This seems like a common bioinformatics task that one will inevitably have to deal with and is one of the more tedious tasks.

    3. When GPT-4 received feedback that its response was incorrect, it exhibited the tendency to modify its subsequent response, even for initially correct answers. This behavior could potentially be problematic for users without comprehensive domain knowledge.

      Ah cool this is what I was wondering about! How often for this paper was each model provided the feedback that it was wrong?

    4. For this challenge, we provided the top 10 most cited bioinformatics papers to the 3 LLMs, and asked them to generate a summary.

      Since the top 10 most cited bioinformatics papers would probably have quite a few summaries either in subsequent papers or news/perspective articles that these LLMs could have been trained on, could you also include newer bioinformatics papers to see how well each model attempts to summarize them?

    5. with 10 runs of asking the model the same question in the same search window, and 10 runs using a new search window

      Only an anecdotal note, but since these LLM chatboxes like ChatGPT seem to improve by having a conversation - was it evaluated if a model couldn't get a particular question correct even 10 times in a row asking the same question, by pointing out the problem it was having and clarifying the question? I know this might muddy the analysis and benchmark, but could be an interesting analysis to provide - which models improve the most when prompted a little bit, vs those that still don't get around to the correct answer

  9. Dec 2023
    1. In addition, Ca. Dechloromonas phosporitropha were lack of pst, phoU, phoB and phoR genes in the Pho regulon, which is consistent with our hypothesis that the Pho regulation may not work properly in PAOs.

      Interesting that Dechloromonas is missing this

    2. their encoding proteins (i.e., PhoU, PhoU homologue and PPK2) may have incompatible phosphate activation/inactivation thresholds.

      This is an interesting hypothesis - are you able to maybe follow up with structural analyses looking at the Alphafold predictions for these proteins and docking simulations? Or compare to PPK2 in other bacteria known to use PPK1 and PPK2 for polyphosphate accumulation (such as P. aeruginosa) and see if the model holds there as well perhaps?

    3. In addition, three distant phoU homologs (NOF05_17860, NOF05_09930, NOF05_09935) were found in Ca. Accumulibacter genomes which are also horizontally acquired core genes. Distant homologs are pairs of proteins which have similar structures and functions but low gene sequence similarity (Monzon et al., 2022). The homolog phoU

      This is interesting, but a little confusing here. Is the main phoU copy near pit, or is one of the distant homologs near it? If not, where are the distant homologs and how were they confirmed to be structural homologs, or were they annotated that way by KEGG?

    4. construct

      constructed

    5. including the acetate kinase gene. These 42 gene families may not play a key role in the evolution of non-PAO to PAO due to their different transcription behaviors in SCUT-2 and UW1.

      Perhaps, I think there is also demonstrated difference in acetate uptake kinetic rates between the clades/species

    6. Figure 6.

      Panel C of this figure is a little unintuitive since the columns are ordered by the clustering of gene expression patterns and not ordered by time point, whereas panel B is ordered by timepoint

    7. Cluster 2 showed a pattern of increased transcription throughout the anaerobic period, peaking after oxygen exposure. The phosphate transport system substrate binding protein (pstS, NOF05_04305) and the laterally derived polyphosphate kinase 2 gene (ppk2, NOF05_17285) showed Cluster 2 transcription pattern.

      This is interesting, and maybe the opposite of what I would expect. Since P is released during the anaerobic period, and consumption of PHA in aerobic period used to form polyP, I would expect ppk2 and the transporter to be highest in the aerobic period. Although I think this could be related to our results here: https://www.nature.com/articles/s43705-022-00189-2 where we found differentiation of expression patterns in PstSABC either highest at the beginning or end of the aerobic period, so to be high at the beginning of the aerobic period would have to increase in the anaerobic period.

    8. A further analysis of another 21 available Propionivibrio genomes further confirmed that ppk2 and phoU are differential genes between Ca. Accumulibacter and Propionivibrio.

      Ah I see you compared to other Propionivibrio genomes here

    9. The pan PAO genome was compared to the Ca. Propionivibrio aalborgensis (a closely related GAO, Albertsen et al., 2016) genome to identify differential genes (defined as core genes present in the pan PAO genome but absent in the Ca. Propionivibrio aalborgensis genome).

      Is this the only closely related non-PAO that was compared to? I think this is a HQ genome but there could be problems with making PAO-specific inferences comparing to just one non-PAO?

    10. a Pho dysregulation hypothesis is proposed to explain the mechanism of EBPR. It states that the PhoU acquired by HGT fails in regulating the high-affinity phosphate transport (Pst) system. To avoid phosphate poisoning, the laterally acquired PPK2 is employed to condense excess phosphate into polyphosphate.

      This is interesting! Excited to read more and dive into this hypothesis! My gut reaction is if you have looked at other model organisms for polyphosphate accumulation such as E. coli, Pseudomonas aeruginosa, and Neisseria gonorhhoeae to see if they fit this model? I could imagine P. aeruginosa might since PPK2 is known to enhance polyP formation in virulence and biofilm formation of this pathogen

    1. The datasets can also be

      Right now I think only the metagenomes are available int he SRA and I don't see the genomes in Genbank...would be great to upload those there as well if possible!

    2. The program coverM (v0.6.1) (https://github.com/wwood/CoverM) was used to obtain the relative abundance of reads mapped onto each MAG with the “coverm genome” command.

      Were BAM files passed to coverM or was minimap2 used to map the reads? I think also I would do tests to see how complete your assemblies are by mapping back the reads to the assemblies (PacBio or Illumina only) and reporting those stats. If that's somewhere and I missed it sorry!

    3. The shortened PacBio reads and the cleaned short-reads from Illumina libraries were competitively mapped

      Hmm interesting, I might have instead done this mapping with only PacBio (with them fragmented) or only Illumina, since there could be "identical" reads in the two sets and this counts it as double abundance I think.

    4. Geneious Prime version 2022.0.2 (https://www.geneious.com) was used for the extraction of rRNA gene sequences,

      So this was used to ID and pull out the rRNA operons? Any way to explain further how this works?

    5. Phylogenetic analyses and relative abundance

      I might have missed this - what were the methods for assessing quality (CheckM?) and identifying ribosomal genes? I'm guessing Prokka/infernal?

    6. All assembled contigs generated from both PacBio and Illuminia sequencing technology were binned using metaBAT2 (45).

      Also just to keep in mind if someone wants to follow up, there are now binning tools that are specific to long-read based assemblies, or sometimes even manual binning with mmgenome2 could work if the assemblies aren't too fragmented

    7. assembled using metaSPAdes (v3.15.2) (44), and mapped to the final assembly using BBMap (v38.86)

      I'm guessing this was a coassembly of the two Illumina samples?

    8. and assembled using metaFlye (v2.9) (40). PacBio assemblies were polished with racon (v1.4.13)

      I think with this generation of PacBio reads, even metagenomic ones, the recommended pipeline would be assembly with hifiasm (for which I think there's a metagenomic version as well) and then using those assembled reads for downstream processes, because HiFi reads are considered high enough quality to not need these polishing steps, and polishing steps can actually introduce indels accidentally, especially in repeat-rich areas like rRNAs, see here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009802 but mostly applied to Nanopore in this case

    9. PacBio reads were filtered using BBtools (v38.87/38.88)

      Not sure I've seen filtering steps applied for PacBio reads prior to assembly - with what program in BBTools was used and why? For filtering out short reads?

    10. For long-read sequencing, libraries were prepared by shearing genomic DNA to either 3 kb or 6-10 kb,

      Ah ok I see the answer here to my question above, it was a little confusing before

    11. hat “Ca. A. necessarius and “Ca. A. propinquus” accounted for greater than 40% of the “Ca. Accumulibacter” assemblage,

      I know you're focused on Accumulibacter, but I'm curious 1) How much of the total relative abundance Accumulibacter makes up, to thus ask 2) Were there other lineages that were quite abundant other than Accumulibacter? Since you seem to have good HQ MAGs from Accumulibacter with PacBio I would expect that with a little more effort you would have some good flanking genomes, and most flanking genomes that are of good quality come from bioreactors or the Danish WWTP study, so it could be a good resource for the community

    12. high quality (HQ, greater than 90% completeness, less than 5% contamination) according to MiMAG standards (28). In addition, they have two fully assembled copies of the rRNA operon, which facilitates additional analysis of this novel cluster and the proposal for a new species epithet (see below).

      I think HQ based on MiMAG standards is above 90% completeness, below 5% redundancy, presence of all 3 rRNAs, and then at least 18 (I might be wrong on this number) tRNAs. Here do you mean by the two fully assembled copies of the rRNA operon mean there's two of each rRNA gene? Are they fragmented at all (just curious)?

    13. That is, UW14 belonged to “Ca. A. meliphilus,” UW15 to “Ca. A. delftensis,” UW16 and UW24 to “Ca. A. propinquus,” UW17 to “Ca. A. contiguus,” and UW19, UW28, and UW29 to “Ca. A. necessarius” (Table 2).

      Interesting how the genomes from these pilot scale plants aren't from the species we find enriched in the bioreactors if I remember the species names correctly - the IA and IIA genomes are usually what pops up in the bioreactors but not abundant here.

    14. Among the MAGs assembled from these metagenomes were 16 MAGs taxonomically classified, according to the GTDB-Tk Lineage classification, as belonging to the Accumulibacter lineage.

      This is also worded in a little confusing way

    15. 15 6-10kb PacBio, 7 3kb PacBio

      The way this is written is a little confusing - are there 22 long-read metagenomes in total where 15 of them had a read length range of 6-10kb and the other 7 3 kb?

    16. Under these cyclic anaerobic-aerobic conditions, net P removal from the bulk liquid is achieved

      This sentence almost seems like it fits better at the end of the last paragraph, and your topic sentence is the second sentence of this paragraph perhaps?

    17. species for which we propose the new species designation “Ca. Accumulibacter jenkinsii”

      This is so exciting, and what a great way to honor Jenkins!

  10. Nov 2023
    1. Collectively, our analysis supports the view that some GAO species harbour unusual structural variants of the glgB enzyme:

      Could you use a tool like Foldseek to compare the structural similarity of these proteins further: https://search.foldseek.com/search? Or take these input proteins and either fetch the PDB files or fold with Alphafold/ESMfold and cast a wider net of comparisons beyond AS organisms? The ProteinCartography workflow could help with these additional analyses and insights: https://research.arcadiascience.com/pub/resource-protein-cartography/release/6 and github repo: https://github.com/Arcadia-Science/ProteinCartography

    2. connected in a clique

      Maybe cluster is the better term here?

    3. Figure 1:

      Not sure if you can modify the preprint submission to rotate this figure?

    4. Or in other words, is the GAO phenotype largely driven by superior metabolic capacity of the component proteins in glycogen storage pathways (the structural hypothesis) or is it a consequence of optimal regulation of those pathways in the specific environments where the phenotype is observed? (the regulatory hypothesis).

      This is also a really interesting way to frame this question - my gut instinct says it's probably some combination of both the genetic/structural variation and regulatory mechanisms. Creating approaches/frameworks to combine all this information together for traits of interest will be really crucial!!

    5. a logical question to ask is whether the glycogen storage machinery of GAO species exhibit unusual or enhanced metabolic properties compared to those found in non-GAO species?

      This is such a fascinating question, and very similar to the question we posed in related to polyphosphate accumulation: https://research.arcadiascience.com/pub/result-ppk1-homology/release/1. It's exciting to see other groups thinking about protein structural similarity and traits in this way!

    6. McDaniel and colleagues

      Very small note - Joris van Steenbrugge and I were co-first authors so the way I refer to this is "McDaniel and van Steenbrugge and colleagues..." to give the correct credit here, even if it makes the sentence a bit longer

    7. statistical properties

      Describing this as "statistical properties" is possibly confusing, I think I know what you mean is that this is a comparative approach of protein features connected to trait information, but possibly a better way of explaining this

  11. Oct 2023
    1. We utilized our high-quality in-house peptide dataset

      Is this dataset publicly available through the supplement or Zenodo for example? Probably a really useful resource to others and important for reproducibility of your model. If you put on Zenodo then you can attach a DOI to that dataset so others can cite if they use it for other purposes.

  12. Sep 2023
    1. and primers specific to the 16S (V3-V4) region

      What primers? Overall this section is missing a lot of methodology, details, and references for how the library prep, sequencing, and analysis was carried out. I don't see any supplementary attached with the preprint so I assume all the methods done and details that are provided should be in the main text unless I missed something.

    2. All data, code, and materials used in the analysis will be available to any researcher for purposes of reproducing or extending the analysis.

      I think at the bare minimum code and raw data files should be uploaded somewhere that is publicly accessible, given the probably high interest in this piece of work. I can only speak for microbial data since that is my background, but 16S amplicons should be uploaded to the NCBI SRA, and the commands or code used for 16S analysis can easiliy be put on github.

    3. Taxonomic Units for analyses including diversity, taxonomy, and differential analyses

      What database was used to assign taxonomy to reads? I also don't see methods for the results claim that sequences from the intestine vs brain were 100% matched. Were they matched by just taxonomy, OTU grouping?

    4. Divisive Amplicon Denoising Algorithm

      This sounds like either QIIME or DADA2 was used?

    5. Sequence analyses were performed using the NovaSeq platform

      I don't think you analyzed the sequences using NovaSeq since that's just a different type of Illumina instrument for sequencing. Did you use a program like QIIME, Mothur, or DADA2 for example?

    6. We found that the culture can detect as low as 5 CFU bacteria.

      Were similar limits of detection calculated for the other organs? Also there is a range in the discussion listed as detecting 1-300 CFUs, is this possible with this limit of detection?

    7. Further, we observed these phenotypes in young mice (∼8-15 weeks old), long before the characteristic disease-related changes that occur in some of the mouse models employed here. Therefore, the data suggest that commensal bacterial translocation to the brain is an early event and could even be an initiating trigger for microglial changes associated with neuroinflammation and neural protein aggregate formation, leading to certain neurodegenerative and neurodevelopmental diseases

      Could this instead be explained that in young mice the intestinal barrier isn't as strong? I guess that isn't the case since the same level of permeability wasn't seen in the wildtype background. What might happen in mice that are a little older that have induced intestinal permeability? This also relates to my question about how quick this seems to happen from intestinal permeability to bacterial translocation/neuroinflammation. I wonder how quickly this leads to observed disease phenotype?

    8. that is distinct from an acute, fulminant brain infection.

      Could you provide context to the reader approximate CFU levels that are associated with brain infections?

    9. Further, the bacteria detected in the brain and the vagus nerve in these strains were 100% matched with those that were detected in the feces and ileum.

      I think this answers my question above, but how were they 100% matched? By 16S rRNA gene sequence identity since I only see description for 16S sequencing? They could be the same at the species/genus resolution but since the 16S rRNA gene has poor resolution sometimes even at genus level there could be a lot of species or strain differences

    10. We subsequently reversed their diet back to normal rodent chow and tested their phenotypes at days 14

      Throughout the results I'm surprised at the short timelines that bacteria are either translocated to the brain or effects are reversed. Do you think that perhaps the deterioration of the gut lining leading to gut leakiness is a slow process but then the subsequent translocation and neuroinflammatory effects are quite quick? I know this isn't a direct hypothesis in this paper, but still surprising to me that these effects are seen so quickly once gut leakiness is established

    11. Taken together, these data show that bacteria can translocate to the brain of multiple genetic types of mice including wild-type mice, that numerous types of bacteria can translocate to the brain even simultaneously, and that in all cases studied thus far, the bacteria detected in the brain are also found in the intestine.

      Very interesting! Same question as above - could you perform shotgun sequencing of the bacteria retrieved from either the intestine/ileum vs the brain and compare if they are the same populations for these different bacterial species that are translocating to the brain, not just S. xylosus?

    12. Figure 2.

      Similar to my small comment above about colors, it would be great if throughout the figures the colors for control and Paigen-diet were consistent to help the reader, and using color-blind friendly colors

    13. the composition of the microbiome (i.e., diet and antibiotic use) analogously changes the bacteria that localize to the brain

      Very interesting!

    14. It was unclear how S. xylosus translocated to the brain and why this localization was specific and not also observed in systemic organs

      This is interesting that there are more CFUs of S. xylosus in both the feces, illeum, vagus, and brain of paigen-fed mice vs the control. Was whole-genome sequencing done to compare the populations of S. xylosus in feces vs the brain to see if there are strain differences contributing to success in brain translocation?

    15. Figure 1.

      A small nitpick, but the contrasting colors of green and red won't be color-blind friendly

    1. The skin abscess infection

      Do you have any insights if the same anti-infective properties would be shown in a different disease model? Such as infection not on the skin which would be easier to access but internally so that you would have to orally administer the AMP to the mouse?

    2. Five leadAMPs from different sources

      How were the AMPs for screening chosen? You discovered 1000s of AMPs, how were specific ones prioritized for testing in the mouse model? Based on the in vitro assays, or ease of synthesis?

    3. All the c_AMPs predicted here can be accessed at https://ampsphere.big-data-biology.org/.Users can retrieve the peptide sequences, ORFs, and predicted biochemical properties of eachc_AMP (e.g., molecular weight, isoelectric point, and charge at pH 7.0). We also provide thedistribution across geographical regions, habitats, and microbial species for each c_AMP.5100102104106108110112114116118120122124126128.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is madeThe copyright holder for this preprintthis version posted August 31, 2023.;https://doi.org/10.1101/2023.08.31.555663doi:bioRxiv preprint

      Awesome resource!

    4. The large number of singletons suggests that most c_AMPs originated from processesother than diversification within families, which is the opposite of the supposed origin of full-lengthproteins, in which singleton families are rare46

      This is an interesting observation and implication!

    5. o further assess the gene predictions,

      It would maybe help the reader if these were summarized in the results and not just referenced to the corresponding methods

    6. Analogously to Sberro et al.36, we used a modified

      Again, since this is a pivotol part of your analysis this should be explained more how this works and not just referenced

    7. was inferred as previously described96,

      I think this is an important detail and shouldn't be a mention of as previously described and be in more detail

    8. ProGenomes2 database

      From looking at this paper, it looks like the current release of this database only include high-quality genomes from isolates? Not including MAGs? This is possibly a limitation since you screened metagenomes but not MAGs, and you can now easily find compendia of MAGs that are curated from GTDB/IMG

    9. with Illuminainstrument

      why only Illumina metagenomes? Because of the error-rates associated with metagenomes produced with Nanopore for example? I don't think you would have this issue with PacBio Hifi datasets, but also unsure of the amount of these datasets present in early 2020

    10. Accession numbers are listed in the supplementary tables

      Which supplementary table? There are many and this would be useful to the reader

    11. redict and catalog the entire global microbiome

      this sentence seems incomplete, to predict and catalog AMPs in global microbiomes? The abstract also focuses on animal-associated microbiomes, did you focus on this or include environmental microbiomes?

    12. Recently, proteome mining approaches have been developed toidentify antimicrobials in extinct organisms

      This sentence feels a little abrupt given the previous sentence, especially since there isn't an expansion of this sentence the significance of AMPs in ancient organisms

  13. Aug 2023
    1. We compared MetaCerberus to DRAM, InterProScan, and PROKKA for the time used per genome, RAM utilization, and disk space used across 100 randomly selected bacterial genomes within GTDB

      Were more complex metagenomes tested for performance, such as inputting raw reads? It would be good to give the user an expectation of resources/time for raw reads from different communities based on complexity/read depth

    2. map-based heatmaps

      It would be great if one of the example heatmaps was shown here in the heatmap for demonstration, or a longer tutorial in the Wiki of the Github repo for example. Although not sure if this is output as part of the HTML dashboard?

    3. A sample dashboard visualization

      I'm assuming this output is an interactive HTML since the supplementary figure looks like a screenshot?

    4. PacBio, fastp

      Reason for trimming PacBio data? Usually PacBio Hifi data is high-quality enough that this isn't necessary

    5. Porechop

      Any plans to use instead Porechop_ABI https://github.com/bonsai-team/Porechop_ABI? Porechop is no longer being actively maintained

    6. Databases for MetaCerberus

      From scanning the documentation in the github README, it's quite far down that the databases are on OSF and there aren't instructions about if the databases need to be placed in a specific folder or to be pointed to when running the command. Does this happen with the --setup command run after installing?

    7. mamba create –n metacerberus –c bioconda –c conda-forge metacerberus

      Awesome having this right at the beginning, I opened a Github issue but wanted to point out I had an error trying to install with this command

    8. (pORFs)

      By this point in the introduction there are already quite a few abbreviations for which there probably don't need to be such as pORFs and massively parallel sequencing. There's already a lot of abbreviations for software names and MAGs for example, so see if some can be cut down that aren't necessary?

    9. PROKKA

      A small nit but Prokka isn't all capitalized

  14. Jul 2023
    1. Improved reconstruction of circularised phage and plasmid genomes

      This is just an extra thing, but it would be interesting to see how well this tool performs at recovering genomes from eukaryotic lineages since short-read methods produce very fragmented assemblies. Some of the metagenomes in this list are from communities with eukaryotes, such as the cheese samples: https://github.com/PacificBiosciences/pb-metagenomics-tools/blob/master/docs/PacBio-Data.md

    2. We grouped MAGs into three conventional categories based on the CheckM results: ‘near-complete’ if its completeness is ≥ 90% and its contamination is ≤ 5%, ‘high-quality’ if completeness ≥ 70% and contamination ≤ 10%, ‘medium quality’ if completeness ≥ 50% and contamination ≤ 10%.

      Did you also take into consideration number of rRNAs/tRNAs into categories such as those in MIMAG/MISAG: https://www.nature.com/articles/nbt.3893?

    3. Abstract

      It would probably help to bring visibility to the tool if the link to the github repository was in the abstract

  15. May 2023
    1. at our Github repository.

      link?

    2. Data and Code Availability

      It would be great if you could make the polished assemblies or assembled contigs analyzed in this study available since it takes quite a bit of work to get to that point

    3. querying only the best Foldseek hits, which are filtered for an e-value greater than 1e-10,

      Did you take into account other filtering criteria such as Tm score? Or analyze how evalue cutoff corresponded to Tm score?

    4. After the initial assembly, additional assemblies were yielded using a secondary assembly pipeline. Briefly, reads for a given sample were aligned to uncircularized contigs obtained from the same sample with Minimap2 v2.24 (21) and were binned using MetaBAT2 v2.12.1

      So this was done prior to polishing?

    5. Near perfect TM scores within most clusters show that the same putative best structural homolog was often seen in samples widely separated by time,

      Here the Tm score is used to compare protein structures seen in multiple samples?

    6. FIG 7.

      I am not sure I understand this figure or why PFAM annotations were used here as well

    7. This result allowed us to identify a number of functions and pathways present in putative bacteriophage ACCs in this sample,

      Were the thresholds for the structural approach defined by a protein that had any structural hit by foldseek with a functional annotation? Or was there a threshold that needed to be met to consider that protein had a good hit - such as Tm score?

    8. structural homology vs. the entire universe of known and predicted protein structures using Foldseek (9).

      Was this using the Foldseek server? Or what databases did you compare against to consider for functional information?

    9. For this, we collected circular MAGs of > 1 Mbp

      could you possibly be throwing out candidate phyla that have small genomes but are likely circular with this filter? For example I think Patescibacteria are smaller than 1 Mbp, usually fall somewhere around 80% complete with CheckM, but end up as circular contigs

    10. INHERIT package (11) which assigns scores based on the inferred likelihood of being bacteriophage; of these, 227 bacteriophage were predicted.

      Did you try other viral/phage predicting software such as VIBRANT etc.?

    11. total of 227 putative bacteriophage.

      how were the putative bacteriophage sequences identified at this point? By size?

    12. sub-genomic ACCs

      There's already quite a few acronyms to keep track of (GAC, NA etc.) and I think this could be referred to as circular contigs throughout

    13. wetand computational-lab

      typo

    14. The biofilm initially aids in the remediation of wastewater from NAs, but ultimately overgrowth fouls the GAC beds necessitating frequent and regular exchange. As such, the samples collected in this study presents a unique opportunity to investigate and annotate NA degrading bacteria as it serves as a natural experiment.

      I'm somewhat confused from the title of the manuscript and abstract/importance sections if this manuscript will focus on annotating by structure bacteriophage sequences or bacteria as well?

    15. Although many tools and techniques

      For clarity, I would start the introduction with this paragraph since the manuscript focuses on annotating sequences by structure and not so much about long-read sequencing technology itself

    16. In this study, we present wet and dry lab techniques which allowed us to generate 5432 high quality sub-genomic sized metagenomic circular contigs from 10 samples of microbial communities. This unique ecological system exists in an environment enriched with naphthenic acid (NA), which is a major toxic byproduct in crude oil refining and the major carbon source to this community. Annotation by sequence homology alone was insufficient to characterize the community,

      Are these sentences referring to circular contigs that are proposed to be phage or just circular contigs in general? From the title I infer that you are only focused on phage sequences but these sentences make it seem as if you are trying to annotate everything through a structural approach

  16. Apr 2023
    1. In these assays, we found that pBI143 was indeed transferred from the donor to the recipient strains at a frequency of 5 x 10-7 and 3 x 10-6transconjugants per recipient, respectively (Supplementary Fig. 2).

      wow, this is a really exciting result, and also exciting to see that you were able to get a positive hit from something that originated from a hypothesis from metagenomic data and validate it with cultured isolates - well done!

    2. Sequencing depth did not explain this observation, as pBI143 was highly covered (i.e., >50X)

      I am wondering if in addition to coverage you calculated breadth - as in how much of the plasmid was covered at 50X for whichever version of PBI143 was detected? I've found this to be an important statistic to add for elements that might be highly conserved or similar to other sequences.

    3. Our findings reveal the astonishing success of pBI143 in the human gut, where it occurs in up to 92% of individuals in industrialized countries with copy numbers 14 times higher on average than crAssphage, the most abundant phage in the human gut. We also demonstrate the potential of pBI143 as a cost-effective biomarker to assess the extent of stress that microbes experience in the human gut, and as a sensitive means to quantify the level of human fecal contamination in environmental samples.

      This paragraph summarizing the main results is written so well, and in such an exciting way. I totally wasn't expecting the last sentence previewing a way to detect levels of fecal contamination in environmental samples, and I'm excited to read those results!

    1. Still only 11% of the 1000 HQ MAGs examined by the authors encoded homologs to known exopolysaccharide gene clusters,

      I didn't see mention in the methods that the genomes from Singleton et al. were included, which is why I added that suggestion...so I think I'm confused of what reference genomes were used for these surveys - unless you are referring to the Singleton et al. authors being able to detect EPS genes in only 11% of the MAGs?

    2. In addition to triggering granulation and EBPR, selection for “Ca. Accumulibacter”

      What would also be fascinating is if you have frozen biomass still from these samples and can perform metagenomic sequencing to see what specific clades/species of Accumulibacter you might have enriched for

    3. The degree of enrichment of PAOs of up to as high as 83% of the amplicon sequencing read

      If you did the 16S analysis with dada2, it would be interesting to know if 83% of the reads comprised of Accumulibacter were a single or several ASVs - inferring that you had either a clonal or diverse population of Accumulibacter, such as explored in McDaniel et al. 2022 https://www.biorxiv.org/content/10.1101/2022.10.01.510452v1.full

    4. mycolic acid (long fatty acid), colanic acid, capsular heptoses, alginate, trehalose and rhamnose containing glycans (exopolysaccharides), sialic acid and CMP-N-acetylneuraminate (alpha-keto acid sugars), and N-linked glycosylation (binding of glycans to amino acids to form amino sugars or glycoproteins).

      I might have missed this, but it might be good to provide a table describing the accessions such as KO numbers used to annotate these pathways to help with reproduciblity

    5. Clustered heatmap of functional gene categories potentially related to EPS metabolisms

      I'm also not sure I understand the different color schemes here since there are different shades of purple for different cluster categories in addition to the heatmap colors and highlighting certain taxa

    6. Figure 7.

      I know this is a large figure with a lot of text, but even zoomed in on my desktop a lot of the text is blurry - I am wondering for the main figure if there is a better way to visualize this either by focusing on a few significant pathways, or using the Anvi'o pangenomics tool for the genomes and pathways so that the phylogenetic organization is part of the same "heatmap" plot of the clusters

    7. From this heatmap overview, “Ca. Accumulibacter” and “Ca. Competibacter” lineages form a relatively homogenous cluster of EPS genomic signatures (lineage clusters 4+5).

      It would be fascinating to see these results updated with more Accumulibacter references beyond those that were summarized at the time of the 2019 paper referenced to obtain these genomes

    8. EPS biosynthetic pathways

      Are EPS biosynthetic pathways usually clustered together? If so, this could be another reason to include genomes from Singleton et al. 2021 since they generated high-quality genomes from full-scale WWTPs and searching among long-read assembled genomes could improve retrieval of biosynthetic clusters

    9. genomes of the flanking and reference populations

      I might have missed here, but was metagenomic sequencing done for the bioreactors operated in this study? Or by flanking populations do you mean the other activated sludge/granular sludge genomes you collected from other studies?

    10. namely the putative PAO Dechloromonas

      I think putative is fine and accurate here, but interestingly Petriglieri 2021 showed Dechloromonas species that are experimentally shown to be PAOs: https://www.nature.com/articles/s41396-021-01029-2

    11. following the bioinformatics procedure summarized

      It's probably still a good idea to list and cite what software and version you used, such as if it was QIIME, mothur, or the dada2 package and the version so it's directly listed within this publication

    12. Figure 3.

      I'm not sure if you used the plot_heatmap function from ampvis2 for this figure or not, but I think the default behavior is to plot by "sqrt", and not in this continuous fashion since most individual cells will be close to 0. When you have this extreme plotting of a few lineages that are very abundant and most others are not, it can be difficult to visualize, so it might be preferred to change the scaling of the coloring.

    13. Genomes from granular sludge microorganisms were imported to compare the genetic signatures of “Ca. Accumulibacter” and “Ca. Competibacter” into the broader context of the microbial ecosystem of BNR granular sludge

      Were these genomes assembled from samples from full-scale granular sludge operating systems or other granular sludge lab-scale bioreactors? What also might be interesting is to pull other traditional activated sludge genomes (the most representative catalog would be from Singleton et al. 2021 https://www.nature.com/articles/s41467-021-22203-2) to see if there are differences in EPS production overall when comparing traditional activated sludge genomes to those from granular operating systems

    14. A representative set of 19 genomes of “Ca. Accumulibacter” was selected out of the more than 30 metagenome-assembled genomes (MAGs) deposited in public repositories which were recovered from a previous study (Rubio-Rincon et al., 2019)

      Since this study there have been quite a few studies that have produced additional reference genomes for Accumulibacter - mostly summarized in Petriglieri et al. 2022 https://journals.asm.org/doi/full/10.1128/msystems.00016-22 where the phylogeny and nomenclature of Accumulibacter has been updated beyond the ppk1 nomenclature system. Adding genomes from this set might add to your analysis since there are quite a few references that were generated with Nanopore long reads and are higher quality references - including some from full-scale Danish WWTPs

    15. “Ca. Accumulibacter” was first targeted using the PAOmix set of probes PAO462, PAO651, and PAO846 (Crocetti et al., 2000). The detection of PAOs was refined by only using the PAO651 probe, since other PAO462 and PAO846 hybridize to other closely related lineages (Albertsen et al., 2016). The PAO clades I and II were targeted by the probes Acc-1-444 and Acc-2-444, respectively (Flowers et al., 2009; Welles et al., 2015).

      These experiments might have been done prior to this publication, but there are now updated sets of Accumulibacter probes to take into account that some of these probes can target non-Accumulibacter (such as Propionivibrio) and overestimate the abundance - https://journals.asm.org/doi/full/10.1128/msystems.00016-22

    1. t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

      I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:

      1) This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.

      2) Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.

      3) Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.

    2. on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.

      Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline

    3. We tested the MuDoGeR pipeline using 598 metagenome libraries

      I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.

    4. MuDoGeR v1.0 at a glance

      One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?

    5. MuDoGeR was divided into five modules

      I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.

    6. MuDoGeR is open-source software available

      I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.

    7. Biodiversity analysis with MuDoGeR

      Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?

    1. and the past study9

      I'm not familiar with this past study - were the MAGs from this past study retrieved from the same WWTP? Was a certain mapping threshold such as coverage or breadth used to ensure that there is actually a similar population represented by that genome present in the sample (breadth = how much of the genome is actually covered. For example if you have a breadth of 90% and coverage of 20X then 90% of the genome is covered, but if you have a really high coverage but low breadth then you could be mapping to something just super conserved and not that specific population)

    2. Each phylogenomic tree was constructed using ITOL v2.1.7

      I wasn't aware that ITOL could construct the phylogenetic tree, I've only used it as a tree viewing program. There should be mention of what program was used to construct the tree from the muscle alignment (FastTree, RAxML for example) and the parameters used for the tree building program

    3. To compare comammox and anammox ammonia oxidation rates with those reported in literature, abundance adjusted rates (μmol N/mg protein-h) were calculated by dividing the average ammonia consumption rate (mg-N/g TS-h) obtained from aerobic or anaerobic ammonia oxidation batch assays by the portion of total metagenomic reads mapping to comammox or anammox bacteria metagenome assembled genomes (see below) as their approximate contribution to total solids measured and then using the conversion factor 1.9 mg dry weight/mg protein25.

      This adjusted abundance calculation based on metagenomic reads mapping back to anammox/comammox MAGs seems highly dependent on how contiguous your assembly is or if you retrieved the actual population in the assembled MAG responsible for this activity. Therefore I'm worried if this is the best or most accurate way to make this rate calculation and if there is a better way to do this either through lineage-specific qPCR primers or an activity-based assay.

    4. The decrease in the Nitrospira abundance could be the reason why several of the previously assembled MAGs could not be assembled in the current study despite the fact that 5 out of 7 of the previously assembled Nitrospira MAGs had 90% of their genomes covered using reads from this study

      Again, I don't think this is the only reason. You could try to answer this with a coassembly even though it will increase complexity and sometimes make things more fragmented. With this coassembly then you could just pull out putative comammox bins of interest and ignore anything else. The other possibility is that although in the previous study (although I haven't read it) that you observed low strain diversity there could be higher diversity for these samples and also lead to difficulties in assembly

    5. but at very low abundances and thus their genomes were not successfully reconstructed.

      I'm not sure this means that the potential comammox bacteria/AOB were at low abundance and that's why they didn't assemble. It could be that there was higher strain diversity in these samples than those from which the previous MAGs were assembled from, and the contig you aligned with high percentage is just highly conserved or has low diversity. You could instead see if the contig with amoA ended up in a low quality bin and calculating nucleotide diversity on those contigs to see if other contigs have high diversity and that could be a reason why it didn't assemble well

    6. Biomass attached to six pieces of media collected from the aeration tank were scrapped using a sterile scalpel and homogenized using a sterile loop.

      I might just be misunderstanding how the apparatus or biofilm is structured, but is it fine to homogenize biomass from six pieces of media in this way? Is it expected that from these different pieces they should be pretty similar or would heterogeneity impact downstream analysis?

    7. mapping all sample reads

      I think I'm confused by how many samples there are - from the methods above for DNA extraction it makes it seem that there are 6 samples that are homogenized into one and there is only one sample that is sequenced. Whereas here there is reference to multiple samples that reads are mapped from.

    8. Therefore, the relative abundance of all nitrifying groups was calculated from a set of dereplicated MAGs recovered from both studies (Table SI-3).

      I think this could potentially be an inaccurate way to do this if you don't have statistics for coverage and breadth mentioned in a prior comment to make sure these populations are actually "present" in the sample. For example in Crits Cristoph et al. for mapping reads from soil samples to MAGs they required at least 50% of the genome to be covered at 5X, so the breadth is .5 here. I can't tell from this statement here if you are requiring 50X coverage or 50% breadth at what specific coverage. Because you refer to the 50% as coverage but explain it as the definition for breadth it's a little confusing

    9. Nitrospira and Brocadia MAGs represented 6.53 ± 0.34 % and 6.25 ± 1.33% of total reads in the sample

      It might be good to also include stats of the % of reads mapping back to the entire metagenomic assembly to give context for how complete your recovery effort was

    10. Further, the genome coverage of previously assembled comammox (JAMMSM_CMX_1) and Nitrosomonas (JAMMSM_AOB_1) MAGs were 80.6 ± 9.8 and 72.3 ± 1.0%, respectively

      So I think I'm answering my previous question here where the prior assemblies have coverage of 80X and 70X approximately and you required they have at least 50% breadth? I think this could be clarified more and report the actual breadth the genomes have for these samples mapping back to them. I've seen for full-scale WWTPs reads that map back to MAGs retrieved from different samples with breadth as high in the 90%+ range

    11. Methodological details, additional figures, and tables are provided in Supplemental Materials.

      Looking further in the SI I think there is some confusion about what genome coverage is referred to, as it's also flip flopped in the main text. Coverage is how many times a position is covered with reads, so 20X coverage means it is covered 20 times with reads that overlap that particular region. This is also referred to as depth. The calculation that I see in the SI table for "genome coverage" and sometimes referred to throughout the text is actually breadth, which is the percent of the genome that is covered which should be between 0 and 1. This is described in the inStrain paper: https://www.nature.com/articles/s41587-020-00797-0. I'm not sure if the authors are getting these coverage/breadth calculations from coverM or inStrain but it's a little confusing in the paper which they are referring to, which is an important distinction when using genomes that were assembled outside of the samples in question.

    12. Supporting information

      I didn't see a section describing the data availability for the metagenome or MAGs assembled in this study - will the data be made publicly available in the SRA/Genbank?

    13. Comammox and Nitrosomonas relative abundances were about 0.90 ± 0.8 RPKM and 0.40 ± 0.05 RPKM, respectively (Figure 5C). This differs from our prior work, where comammox and Nitrosomonas relative abundances were 22 ± 6.26 and 21.04 ± 6.17 RPKM, respectively (Figure 5B). Thus, it is very likely that the low abundance of comammox bacteria and Nitrosomonas affected the assembly and binning process, which did not allow for the reconstruction of these genomes even though they are still present in the system.

      I'm confused by which mapping stats to which MAGs you are referring to to come to this statement - is the relative abundance to the MAGs assembled from the prior study that is low and therefore inferring that's why you couldn't assemble comammox MAGs from this study?

    14. Brocadia (n=2) and Nitrospira (n=3) MAGs recovered from this study (Table SI-3) were

      I think the table describing these 5 MAGs should be a main table (still have the SI table describing the reference genomes) and modify the table to include the GTDB taxonomy, % GC, length in Mbp (or make clear the units) and no significant figures on number of contigs. You might also want to include in this table the relative abundance calculation per sample for each genome.

    1. A few putative HGT events could be inferred from the larger clade of the HgcA tree e.g., Marinimicrobia-HgcA clustered with Euryarchaeota-HgcA in the archaeal cluster,

      Was the inference made by position in the tree or analyzing the pairwise sequence identity similarity of the proteins from this Archaea/Marinimicrobia? I am curious because in McDaniel et al. 2020 mSystems we also found only a few potentially clear cases of HGT but did this through pairwise sequence analysis, for example for a case of Deltaproteobacteria/Acidobacteria/Verrucomicrobia/Actinobacteria in permafrost

    2. Figure 3.

      Are these trees rooted with either cdh outgroups or fused hgcAB? I see the symbol for fused hgcAB but in Gionfriddo et al. 2020 fused sequences are usually used to root the tree for accurate topology inference

    3. Nevertheless, several hgcA+ genomes did not carry neighbouring hgcB genes, including all Nitrospina and a few Deltaproteobactiera and Firmicutes, potentially because of gene loss during evolution or incomplete transfer events (i.e., only hgcA genes were acquired during the HGT events).

      I wanted to clarify something from the methods - were just the hgcAB proteins from uniprot pulled down or the entire genome sequences for these hgcAB+ representatives? If you did have the entire genome, did you check for the cases where hgcB was missing if hgcA fell close to the end of a contig or not? I think Peterson et al. 2020 ES&T had a couple cases that hgcA was at the end of a contig

    4. To investigate the evolutionary history of HgcA, we further enlarged the sample size by retrieving HgcA homologs in UniProt Reference Proteomes database v2022_03 at 75% cutoff (RP75). Two other datasets, including one containing 700 representative prokaryotic proteomes constructed by Moody et al. (2022) and another containing several novel hgc-carriers published by Lin et al. (2021), were retrieved and incorporated into the RP75 dataset. Totally 169 HgcA sequences were collected after removing redundancies

      I might have missed something, but it appears that you have included hgcAB sequences that are either included in the PF03599 protein family or MAGs from Lin et al. Are the HgcA protein sequences from the large curation efforts from McDaniel et al. 2020, Capo et al. 2022, and Gionfriddo et al. 2020 for example integrated into this uniprot release? It would seem easier in this case to pull directly from the Capo et al. database since those are curated sequences and metadata to link back to, unless I'm missing how Uniprot accessions work with incorporating data from MAGs

    5. We mapped the presence/absence of merB and hgc genes onto the Tree of Life

      If I interpreted the methods correctly, this tree only includes ribosomal proteins from genomes that either have hgcAB/merB or both of them. This isn't exactly overlaying hgcAB/merB presence/absence onto the "tree of life" because to accurately portray these relationships you would also want to include genomes that have neither or these operons. To do this accurately you would want to overlay the information you have here including genomes that are in Hug et al. 2016 for example

    6. Our study reveals an ancient origin for microbial mercury methylation, evolving from LUCA to radiate extensively throughout the tree of life both vertically, albeit with extensive loss, and to a lesser extent horizontally.

      I think to make a statement like this you would need more extensive analyses quantitatively calculating gene transfer rates and using tree dating methods such as in https://journals.asm.org/doi/10.1128/mBio.00644-17

    1. Fig. 2.

      It would help provide additional context for these genomes if additional layers to the tree were added showing completeness, redundancy, genome size, etc. so it's easy to compare across the tree the quality of these genomes. This can be done with iTOL or EMPRESS as added metadata layers

    2. Four bacterial families dominate lichen metagenomes

      It would be interesting to follow up for groups that are core such as Lichenihabitans if they are the same/different strains in these samples and if there are differences where those hotspots of diversity are relative to the lichen type

    3. A striking potential metabolic complementarity to emerge from our annotations is the capacity of many frequent lichen bacteria to code for cofactors needed by one of the dominant eukaryotic symbionts

      I'm interpreting up to this point that functional annotation and pathway exploration was only performed for the bacterial genomes and not fungal/algal MAGs? Was this because of the difficult in performing ORF prediction/functional annotations without corresponding RNAseq data or something planned for the future? Because it would be interesting to see if the corresponding fungi have transporters for those cofactors

    4. Supplementary Table 2).

      Something useful to add to this table and suggestion to Figure 2 for added metadata would be the # of contigs for each genome. I'm assuming most or all of these metagenomes were obtained from Illumina sequencing data and I would presume that a lot of the eukaryotic MAGs are going to be pretty fragmented and that's important information to include

    5. Supplementary Table 1).

      I think it's important to include in this table the sequencing technology for each metagenome (example Illumina HiSeq PE 2x150bp sequencing) even though I could find that clicking through the SRA accessions, it helps to have it directly here especially because it seems a lot of the data comes from the 2019 UC Boulder study

    1. a baseline for integrating different data types in “in natura” settings.

      I'm not sure this is entirely true since this study is from a synthetic community of species that are specifically picked and also have no strain diversity, so the dynamics are limited to species composition changes mostly and not dynamics of similar strains, or other factors such as phages, microbial eukaryotes, etc. The largest "natural" multi-omics studies I could think of that integrated different types of data were Woodcroft et al. from permafrost https://www.nature.com/articles/s41586-018-0338-1 and Herald et al. from wastewater: https://www.nature.com/articles/s41467-020-19006-2 (which includes metabolomics)

    2. we find that all omics methods with species resolution in their readouts are highly consistent in estimating relative species abundances across conditions.

      From the outset I find this quite surprising because there are a couple examples for low complexity enrichment communities or complex naturally occurring communities where this isn't the case that metagenomic/metatranscriptomic/metaproteomic data are concurrent with each other with what species are the most abundant corresponding to most "active" - is this because the synthetic community members were spiked in at the same abundance or the nature of these select synthetic communities exhibit "stable" behavior over time? I think some of these results are only applicable to synthetic communities for different multi-omics measurements being consistent with one another because of the inherent makeup of the community

    3. For metaproteomics data, we estimated species abundance

      I don't think this is species abundance but rather the relative activity of that species based on protein intensity. I also think there are difficulties with calculating species "abundance" in this way because if I remember correctly you will sometimes have a redundant peptide (or a "core genome" protein) that can't be attributed back to a certain genome and those have to be tossed out, so this calculation is based on protein intensities for unique peptides attributed back to a particular genome, correct?

    4. Our results show generally high consistency between omics data types in relative species abundance estimations, and underline that metaproteomics can, in principle, provide robust species abundance estimates, at least for synthetic microbial communities, albeit with lower sensitivity.

      I think I might be confused by the overall framing - is the take home that these different methods should be consistent so that you could use any one alone for surveying a community and know it should give you the same information? Or that if they don't coincide which method is the most reliable for the information that is needed? Some of these conclusions for these methods seem to only be applicable to stable, simple synthetic communities

    1. To expand knowledge of Cyanobacteria viruses largely from terrestrial environments,

      One of the hypotheses postulated in the abstract was that Cyanos from terrestrial environments have viruses that help them adapt to harsh environments. It would be cool to directly compare to cyano co-cultures or simple communities from aquatic or industrial settings, the latter of which probably has "comfier" resources for Cyanos to see how those viral communities differ or have some overlap?

    2. We clustered these 814 viral sequences

      I think I'm interpreting here that regardless of quality results from CheckV that all 814 viral sequences were used for downstream steps? I know that programs like VIBRANT or other phage identification programs can give a lot of false positives and this curation is needed to toss these out...so I'm curious why this decision was made to keep all of them

    3. first used Prodigal v. 2.6.3 to predict open reading frames in vOTU representative genomes using the -p meta option

      If you wanted to do phage functional annotation you could use https://github.com/deprekate/PHANOTATE, which could also give interesting results if you did a comparison of cyanobacterial viral composition of harsh vs not-harsh environments (like photobioreactors)

  17. Mar 2023
    1. first used Prodigal v. 2.6.3 to predict open reading frames in vOTU representative genomes using the -p meta option

      If you wanted to do phage functional annotation you could use https://github.com/deprekate/PHANOTATE, which could also give interesting results if you did a comparison of cyanobacterial viral composition of harsh vs not-harsh environments (like photobioreactors)

    2. To expand knowledge of Cyanobacteria viruses largely from terrestrial environments,

      One of the hypotheses postulated in the abstract was that Cyanos from terrestrial environments have viruses that help them adapt to harsh environments. It would be cool to directly compare to cyano co-cultures or simple communities from aquatic or industrial settings, the latter of which probably has "comfier" resources for Cyanos to see how those viral communities differ or have some overlap?

    3. We clustered these 814 viral sequences

      I think I'm interpreting here that regardless of quality results from CheckV that all 814 viral sequences were used for downstream steps? I know that programs like VIBRANT or other phage identification programs can give a lot of false positives and this curation is needed to toss these out...so I'm curious why this decision was made to keep all of them

    1. a baseline for integrating different data types in “in natura” settings.

      I'm not sure this is entirely true since this study is from a synthetic community of species that are specifically picked and also have no strain diversity, so the dynamics are limited to species composition changes mostly and not dynamics of similar strains, or other factors such as phages, microbial eukaryotes, etc. The largest "natural" multi-omics studies I could think of that integrated different types of data were Woodcroft et al. from permafrost https://www.nature.com/articles/s41586-018-0338-1 and Herald et al. from wastewater: https://www.nature.com/articles/s41467-020-19006-2 (which includes metabolomics)

    2. Our results show generally high consistency between omics data types in relative species abundance estimations, and underline that metaproteomics can, in principle, provide robust species abundance estimates, at least for synthetic microbial communities, albeit with lower sensitivity.

      I think I might be confused by the overall framing - is the take home that these different methods should be consistent so that you could use any one alone for surveying a community and know it should give you the same information? Or that if they don't coincide which method is the most reliable for the information that is needed? Some of these conclusions for these methods seem to only be applicable to stable, simple synthetic communities

    3. For metaproteomics data, we estimated species abundance

      I don't think this is species abundance but rather the relative activity of that species based on protein intensity. I also think there are difficulties with calculating species "abundance" in this way because if I remember correctly you will sometimes have a redundant peptide (or a "core genome" protein) that can't be attributed back to a certain genome and those have to be tossed out, so this calculation is based on protein intensities for unique peptides attributed back to a particular genome, correct?

    4. we find that all omics methods with species resolution in their readouts are highly consistent in estimating relative species abundances across conditions.

      From the outset I find this quite surprising because there are a couple examples for low complexity enrichment communities or complex naturally occurring communities where this isn't the case that metagenomic/metatranscriptomic/metaproteomic data are concurrent with each other with what species are the most abundant corresponding to most "active" - is this because the synthetic community members were spiked in at the same abundance or the nature of these select synthetic communities exhibit "stable" behavior over time? I think some of these results are only applicable to synthetic communities for different multi-omics measurements being consistent with one another because of the inherent makeup of the community

    1. Supplementary Table 1).

      I think it's important to include in this table the sequencing technology for each metagenome (example Illumina HiSeq PE 2x150bp sequencing) even though I could find that clicking through the SRA accessions, it helps to have it directly here especially because it seems a lot of the data comes from the 2019 UC Boulder study

    2. Supplementary Table 2).

      Something useful to add to this table and suggestion to Figure 2 for added metadata would be the # of contigs for each genome. I'm assuming most or all of these metagenomes were obtained from Illumina sequencing data and I would presume that a lot of the eukaryotic MAGs are going to be pretty fragmented and that's important information to include