Hypothesis

5 Matching Annotations

Last 7 days
www.biorxiv.org www.biorxiv.org

Structure-aware protein sequence alignment using contrastive learning

1
1. evan.kiefl 30 Jul 2024
  
  in Arcadia Science
  
  Structure-aware protein sequence alignment using contrastive learning
  
  I found this study very interesting and creative! Fine-tuning the embedding space to account for structural similarity via contrastive learning seems like a wonderful idea and the results are very impressive. Here are some of my thoughts about your paper, presented in no particular order. Please feel free to take or leave any of my suggestions.
  
  One advantage of CLAlign compared to structural aligners is that you don't need to calculate structures. However, the hardware requirements for CLAlign are probably non-trivial, since pLM embeddings have to be calculated. Hardware requirements are missing from the manuscript so it's hard to know. Relatedly, there is no information provided about the speed of CLAlign. I think the manuscript should be expanded to include detailed runtime statistics and hardware requirements so that CLAlign can be better benchmarked against the other tools.
  
  While Table 1 gives us an overall picture of the alignment quality, it would be nice to know the tool's strengths and weaknesses. How does it perform when sequences are distant homologs? Or when there are large length mismatches? Since embedding-based alignments are state-of-the-art, this kind of information would be broadly useful for readers.
  
  Figure 1 looks more like a draft than a complete figure. And without a caption it doesn't make sense.
  
  The performance is very impressive, and it has me curious how much further the performance could be improved simply by increasing the epochs or training dataset. Visualizing the loss curve could help contextualize the performance and help readers understand the extent to which there is room for improvement.
  
  Small notes:
  
  Throughout the manuscript, consistent reference to pLMs is made without any specificity. But there are many different architectures, e.g. BERT, T5, autoregressive, etc. I found this confusing.
  
  There are many grammatical mistakes. Consider passing the manuscript through a grammar checker.
  
  Final thoughts:
  
  Great work! I am curious to try CLAlign once it is made available.
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2024.03.09.583681v1
Jun 2024
www.biorxiv.org www.biorxiv.org

GTalign: High-performance protein structure alignment, superposition, and search

4
1. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  The right panel shows the cumulative TM-score plotted against runtime in seconds
  
  My apologies if I missed this, but I was expecting to find a section in the Methods section that explained what hardware was used for the right panels. In particular, I was curious whether GTalign was ran in CPU-only mode, or whether GPUs were used. Maybe some details could be added either as a section in the Methods section or as a quick description within the Figure 1 caption.
2. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  user-friendly nature
  
  I think GTalign could be made user-friendly by creating simpler install instructions. In my opinion, that is likely the largest barrier preventing its use in the scientific world. See this issue for details: https://github.com/minmarg/gtalign_alpha/issues/1
3. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  Notably, the desktop-grade machine, housing a more recent and affordable GeForce RTX 4090 GPU, outpaced the server with three Tesla V100 GPU cards when running GTalign. The detailed runtimes for each GTalign parameterized variant on these diverse machines are presented in Table S5.
  
  This is very surprising. Is there a dataset size at which the server starts to eek out performance gains?
4. evan.kiefl 21 Jun 2024
  
  in Arcadia Science
  
  In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score. Vertical lines indicate the number of alignments with a TM-score ≥ 0.5. The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek (13,371)
  
  The middle panel presents the data in a way that I've never seen before, and I had quite a difficult time wrapping my head around. I think my confusion boils down to these two main concerns: (1) Why are the curves in the left panels repeated in the middle panels? and (2) I think it is incorrect to label the x-axis as "# top hits". I would have understood this plot right away if the curves were removed and the x-axis label was replaced with "# hits with TM-score > 0.5".
Visit annotations in context

Annotators

evan.kiefl

URL

biorxiv.org/content/10.1101/2023.12.18.572167v3

Annotators

URL

Annotators

URL