Reviewer #1 (Public review):
Summary:
In this manuscript, Taujale et al describe an interdisciplinary approach to mine the human channelome and further discover orthologues across diverse organisms, culminating in delineating co-conserved patterns in an example ion channel: CALHM. Overall, this paper comes in two sections, one where 419 human ion channels and 48,000+ channels from diverse organisms are found through a multidisciplinary data mining approach, and a second where this data is used to find co-conserved sequences, whose functional significance is validated via experiments on CALHM1 and CALHM6. Overall, this is an intriguing data-first approach to better understand even understudied ion channels like CALHM6. However, more needs to be done to pull this story together into a single, coherent narrative.
Strengths:
This manuscript takes advantage of modern-day LLM tools to better mine the literature for ion channel sequences in humans and other species with orthologous ion channel sequences. They explore the 'dark channome' of understudied ion channels to better reveal the information evolution has to tell us about our own proteins, and illustrate the information this provides access to in experimental studies in the final section of the paper. Finally, they provide a wealth of information in the supplementary tables (in the form of Excel spreadsheets) for others to explore. Overall, this is a creative approach to a wide-reaching problem that can be applied to other families of proteins.
Weaknesses:
Overall, while a considerable amount of work has been done for this manuscript, the presentation, both in terms of writing and figures, leaves much to be desired. One can imagine a story that clearly describes the need for a better-curated sequence database of ion channels, and clearly describes how existing resources fall short, but here this is not very clearly illustrated.
One question that arises with the part of the manuscript that discusses the identification and classification of ion channels is whether they plan to make these sequences available to the wider public. For the 419 human sequences, making a small database to share this result so that these sequences can be easily searched and downloaded would be desirable. There are a variety of acceptable formats for this: GitHub/figshare/zenodo/university website that allows a wider community to access their hard work. The authors have included enough information in the supplementary tables that this could be done by a motivated reader, but providing such a resource would greatly expand the impact of this paper. The same question can be asked of the 48,000+ ion channels from diverse organisms. For these, one is even worried that these are not properly sequenced genes? What checks have been done to confirm this? Uniport contains a good deal of unreviewed sequences, especially from single-celled organisms. Potentially, this is covered in the sentence in the Methods: "Finally, the results obtained from both the full-length and pore domains were retained as true orthologous relationships to remove extraneous hits." But this process could be discussed in more detail, clearly illustrating that the risk of gene duplicates and fragments in this final set of ion channel orthologues has been avoided. Related to this, does this analysis include or exclude isoforms?
Another aspect of the identification and classification of ion channel genes that could be improved is the figures for this section. One is relatively used to seeing trees as shown in Figures 3 and 4, which show relationships between genes as distances or evolutionary relationships. The decision to show the families of ion channels in Figure 1 as pie charts within a UMAP embedding is intriguing but somewhat non-intuitive and difficult to understand. Illustrating these results with a standard tree-like visualization of the relationship of these channels to each other would be preferred.
One aspect of the pie-chart/UMAP visualization that works well is the highlighting of the 'dark' ion channels according to the status as designated by IDG, which highlights a strength of this whole paper. However, throughout the paper, this could be emphasized more as the key advantage of this approach and how this or similar approaches could be used for other families of proteins. Specifically, in the initial statement describing 'light' vs 'dark channels', the importance of this distinction and the historical preference in science to study that which has already been studied can be discussed more, even including references to other studies that take this kind of approach. An example of a relevant reference here is to the Structural Genomics Consortium and its goals to achieve structures of proteins for which functions may not be well-characterized. Furthermore, this initial statement mentioning 'light channels' was initially confusing -- does this mean light-sensing channels? As one reads on this is clearly not the case, but for such an important central focus of this paper, these kinds of misunderstandings do not serve the authors well. Clarifying these motivations throughout the entire paper would strengthen it considerably.
Additionally, since the authors have generated this UMAP visualization, it would be interesting to understand how the human vs orthologue gene sets compare in this space. Furthermore, Figure 1, for just the human analysis, should say more clearly that this is an analysis of the human gene set and include more of the information in the text: 419 human ion channel sequences, 75 sequences previously unidentified, 4 major groups and 55 families, 62 outliers, etc. Clearer visualizations of these categories and numbers within the UMAP (and newly included tree) visualization would help guide the reader to better understand these results.
One of the most peculiar aspects of this paper is that it feels like two papers, one about better documenting the ion channel genes across species, and another with well-executed experiments on CALHM channels. One suggestion for how to link these two sections together better is to show that previous methods to analyze conserved residues in CALHM were significantly lacking. What results would that give? Why was this not enough? Were there just not enough identified CALHM orthologues to give strong signals in conservation analysis?
Some of the analysis pipeline is unclear. Specifically, the RAG analysis seems critical, but it is unclear how this works - is it on top of the GPT framework and recursively inquires about the answer to prompts? Some example prompts would be useful to understand this. Furthermore, the existence of 76 auxiliary non-pore containing 'ion channel' genes in this analysis is a little confusing, as it seems a part of the pipeline is looking for pore-lining residues. Furthermore, how many of these are picked up in the larger orthologues search? Are these harder to perform checks on to ensure that they are indeed ion channel genes? A further discussion of the choice to include these auxiliary sequences would be relevant. This could just be further discussion of the literature that has decided to do this in the past.
Overall, this manuscript is a valuable contribution to the field, but it requires a few main things to make it truly useful. Namely, how has this approach really improved the ability to identify conserved residues over a less-involved approach? A better description of their methods and results is required in the first section of the paper, as well as some cosmetic improvements.