10,000 Matching Annotations
  1. Last 7 days
    1. eLife Assessment

      This revised manuscript retreats from the original claim of establishing a causal link between cardiolipin deficiency and the progression from steatotic liver disease to steatohepatitis and instead advances a more limited mechanistic conclusion: that cardiolipin deficiency perturbs electron transport and promotes electron leak from the mitochondrial respiratory chain. The experimental evidence supporting this revised claim is now solid, and the potential for increased electron leak to contribute to liver pathophysiology is demonstrated. However, absent evidence that cardiolipin deficiency is causally upstream of disease progression, the overall significance of the work remains limited. While the study provides a convincing analysis of mitochondrial bioenergetics, the narrowing of its central claim diminishes its impact relative to that proposed in the original submission.

    2. Joint Public Review:

      Cardiolipin, is a key lipid constituent of mitochondrial membranes. Perturbation of its abundance is thus poised to affect broad aspects of mitochondrial function. Given the important role of mitochondria, it is not surprising that cardiolipin deficiency would have pervasive effects on cell physiology.

      The original version of this paper advanced the idea that cardiolipin deficiency, and the attendant mitochondrial dysfunction, plays a causative role in the progression of fatty liver (a common feature in the human population) to a more pathogenic inflammatory state known as steatohepatitis. Given the prevalence of this form of liver disease in the human population this claim for discovery was deemed sufficiently interesting to merit peer review at eLife.

      Peer review reaffirmed the importance of the claim but also revealed important limitations in the experimental support provided. Specifically, the lack of experimental interventions that uncouple the correlation between progression in a mouse model and changes in cardiolipin abundance to test the causal relationship. The review process also recognised the utility of other aspects of the paper, namely the evidence implicating cardiolipin deficiency in altered properties of the mitochondrial membrane, its contribution to an electron leak and the potential for these features to contribute to pathology.

      The revised version of the manuscript now focuses on the importance of cardiolipin sufficiency to mitochondrial integrity and contains various improvements to the data supporting this aspect. At the same time the revised paper retreats from the most interesting claim of a causal role for cardiolipin deficiency in disease progression. We are left with a more convincing but less significant paper.

    3. Author response:

      The following is the authors’ response to the original reviews.

      As the reviewers noted, the evidence we provide is the strongest on the mechanistic link between hepatic cardiolipin deficiency and electron leak from the electron transport chain. This narrative is supported by our assessment of site-specific electron leak as well as reconstitution of exogenous cardiolipin in the small unilamellar vesicles deficient with CL. On the other hand, as pointed out by the Reviewer 2, the mechanistic link between cardiolipin to MASLD/MASH is less robust. At this moment, we have not experimentally demonstrated that the MASLD/MASH induced by CLS deletion can be rescued by replacement of mitochondrial CL in vivo. Taken together, our current narrative makes an incomplete loop between CL deficiency, electron leak, and MASLD/MASH. Nevertheless, as indicated by all the reviewers, this manuscript highlights a previously undescribed role that CL potentially plays in MASH pathology, particularly with the data that human MASH coincides with reduction in liver mitochondrial CL. We focused this revision primarily on additional descriptive experiments in CLS-LKO mice that were requested by the reviewers. Even though it is not a component of the current manuscript, we have recently successfully developed mice with hepatocyte-specific CLS overexpressing mice and began performing experiments to test causality of CL deficiency to MASLD/MASH which we hope to complete in a few years. We are hopeful that the MASLD/MASH research community will still find evidence on CL contained in this manuscript plausible, and that it provides critical information to our understanding of mechanisms for MASH pathogenesis.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Brothwell and colleagues describes a central role for hepatic cardiolipin deficiency in MASH. The authors identify cardiolipin as a mediator of two long-standing problems in the field: how dysregulated lipid metabolism relates to altered mitochondrial metabolism during MASLD, and what the innate changes are in the steatotic liver that cause the increased respiration. The authors identified reduced liver cardiolipin in humans with MASH and in a variety of mouse models with MASH. When they knocked out hepatic cardiolipin synthesis, mice developed steatosis and inflammation. These mice also recapitulated the elevated hepatic oxidative metabolism and oxidative stress found in obese humans with MASLD. Some of the in vivo functional data related to glucose homeostasis and substrate metabolism could be stronger, and interpretation of the in vitro flux data needs some clarification, but in both cases, the data are not essential to the main conclusions of the manuscript. Overall, the study offers compelling evidence that cardiolipin is reduced in MASLD and that impaired cardiolipin synthesis is sufficient to recapitulate many features of MASLD.

      We thank the reviewer 1 for the positive feedback emphasizing novel and important findings in our manuscript.

      Strengths:

      The main strengths of the study are:

      (1) The identification of reduced cardiolipin levels in the liver of humans with MASLD and in a variety of mouse models of MASLD.

      (2) The finding that loss of cardiolipin synthesis recapitulates steatosis and inflammation in MASH.

      (3) The finding that loss of cardiolipin increases mitochondrial respiration, ROS production, and fat oxidation (in a separate hepatocyte cell line), again recapitulates several previous studies in obese humans with MASLD.

      (4) Evidence, though less definitive, that cardiolipin deficiency promotes electron leak by disrupting respiratory supercomplexes and preventing CoQ reduction.

      Weaknesses:

      (1) Figure 3A-D tries to make the point that liver CLS KO causes defects in substrate handling in vivo, based on glucose and pyruvate tolerance tests. The KO mice have a blunted response to a glucose tolerance test, but the pyruvate tolerance test showed very little (almost no) effect on glucose levels in either WT or LKO mice. The small blunting of the response in the LKO is impossible to interpret (if it's real), since the ability to clear glucose is also increased, and no tracers were used. It might be useful to monitor pyruvate and lactate levels during the experiment. However, this reviewer doesn't think the data is essential to prove the authors' main points.

      Thank you for pointing this out. We have now revised our manuscript to correctly reflect our findings on GTT and PTT. In our initial submission, we failed to clearly articulate that CLS deletion appeared to increase systemic glucose handling, which is the opposite of what one might expect in liver with steatosis. We agree that additional experiments would be helpful to better understand the systemic substrate handling in the CLS-LKO mice. As the reviewer indicates, we decided to focus this particular manuscript on intracellular and mitochondrial metabolism because of cardiolipin’s known localization to mitochondria, and the central role that this organelle plays in the pathogenesis of MASLD.

      (2) After presenting convincing evidence that respiration is elevated in isolated mitochondria from CLS KO liver, the authors follow up the findings by investigating whether 13C-palmitate and 13C-glucose oxidation are altered by CLS knockdown in murine Hepa1-6 cells (Figure 4).

      A few comments are worth mentioning about Figure 4:

      (a) It is not clear why the authors chose to use a hepatoma cell line rather than primary hepatocytes from LKO mice. The latter would be more convincing, since there could be important differences in metabolism between hepatoma cells and hepatocytes (e.g., preference for fatty acids vs glucose). Nevertheless, I think the approach is sufficient to test the general effect of loss of CLS on substrate metabolism.

      We appreciate the sentiment and agree that primary hepatocytes would have been a better model. We simply have not had prior expertise to culture primary hepatocytes and do not have the system working. We completely agree that it’s important to discuss the limitation of hepa1-6 cells as a hepatoma cells and now discuss this in our manuscript.

      (b) The authors use the M+2 enrichments of TCA cycle intermediates to infer rates of oxidation of [U-13C] palmitate or [U-13C] glucose. It is important to note that this kind of data reports fractional carbon sources (i.e., substrate preference) rather than rates of oxidation. For example, data from the 13C-palmitate experiment indicates that the CLS KD cells increase the fractional contribution from 13C palmitate (compared to glucose, for example) to the TCA cycle, but the actual rate of palmitate oxidation is not implicit in the data. However, it is reasonable to suggest that, in combination with the increased rates of O2 consumption observed in isolated mitochondria, this data supports increased fat oxidation.

      We agree with the reviewer that the nuances are important: that M+2 enrichments from [U-13C] palmitate or [U-13C] glucose reflects the fractional contributions of labeled substrates to the TCA cycle rather than oxidation. We have now revised the text to clarify that the data represent carbon incorporation patterns.

      (c) I have some concern that the [U-13C] glucose experiment is more complicated to interpret than the description implies. I'm not sure what happens in this cell line, but in the liver, most labeling from pyruvate (i.e., originating from glucose in this case) enters the TCA cycle via pyruvate carboxylase, with smaller amounts entering via PDH (depending on the nutritional state). Since one could expect pyruvate carboxylase to contribute M+3 labeled TCA cycle intermediates initially, and M+2 on the first turn of the cycle, it's hard to conclude what the data indicates about glucose oxidation. The authors could generalize the conclusion by framing the TCA cycle enrichment data as the contribution of glucose carbons and noting in Figure 4A that pyruvate carbons can enter the TCA cycle via PDH or pyruvate carboxylase, without attempting to assign their relative contributions. There are better ways to do it, but it's a small nuance here since the authors aren't making a critical point about the pathways.

      This expert comment is much appreciated. We have revised the text to more broadly describe glucose carbon entry into TCA cycle through PDH and PC. We also revised the schematic to reflect this notion.

      Reviewer #2 (Public review):

      In this study, the authors show that alterations in the lipid composition of the inner mitochondrial membrane, particularly changes in cardiolipin (CL) content, lead to defects in electron transport, supercomplex formation, and oxidative stress. Using liver-specific CLS knockout mice, which are characterized by dysfunctional capacity for cardiolipin synthesis, the authors highlight an underappreciated role for CL in MASH pathology. Overall, this is an interesting study highlighting the importance of functional/physiological electron transport (and in this context, electron leakage) in MASH pathophysiology. Despite that, this manuscript has several weaknesses that require attention.

      We thank the reviewer 2 for the constructive criticisms and identifying areas of weakness were additional data or explanations can improve the manuscript.

      (1) For all LKO studies, it is stated that the decrease in hepatic CL is causal for the observed phenotype. However, it is evident that many other lipids are impacted by CLS KO, including a marked increase in hepatic PG. In this respect, the authors show no evidence that the observed metabolic phenotype is indeed due to the reduction in CL and not to other accompanying changes.

      Thanks for this comment. We agree that because deletion of CLS promotes changes in mitochondrial lipids other than CL, we cannot conclusively attribute phenotypes we observed to CL and not to other lipids such as PG. In our experience, rescuing mitochondrial phospholipids by exogenous supplementation is problematic as they most certainly are not exclusively destined to the tissue of interest, nor to the organelle of interest, and often metabolized to produce other lipids, etc, making it difficult to interpret the data. We now have mice that conditionally overexpress CLS, which could be used to address this question, but the study is in its early phase and are outside the scope of the current study.

      The one experiment we performed is the ex vivo CL supplementation by SUV fusion to mitochondria, which has an ability to rescue electron leak. While they do not demonstrate the role of CL in all phenotypes found in the CLS-LKO mice, we think that bioenergetic phenotype associated with CLS deletion is therefore likely due to the reduction in CL. We now provide these additional discussions in lines.

      (2) In the results, the authors highlight that 'MASLD has been shown to alter the total cellular lipidome in liver.' Given that this study focused on CL, it would be useful to include specific studies that pointed to changes in hepatic CL content in MASLD/MASH/fibrosis.

      We now provide citations for these studies (PMID: 30042157, PMID: 34257827).

      (3) The initial human mitochondrial lipidomics studies show a reduction in mitochondrial CL and PG content. What was the content/expression of CL synthase and PGP synthase in these samples? If this cannot be assessed, is there any association of CLS or PGPS expression and MASLD/fibrosis (etc) in publicly available databases (e.g, GEP liver) that may explain the reduction in mitochondrial PG and CL content?

      Thanks for this suggestion. Quantification of mitochondrial lipidome require a good amount of tissue, and we do not have sufficient biomaterials left to quantify gene expression. Upon our survey of publicly available database (including GepLiver), we did not find that human MASLD was associated with an increase in CLS or other enzymes of CL biosynthesis compared to healthy controls.

      (4) The validation of MASH in patients (Figure 1B) is not convincing (ie., no quantification/scoring provided). NAS /fibrosis scoring (according to Kleiner) would help to define if all patients have indeed MASH, and what subset has fibrosis. Could the reduction in CL/PG content be (also) associated with fibrosis? In addition, Masson's Trichrome should be added to Figure 1B.

      The diagnosis was based on obvious bridging fibrosis and/or regenerative nodules on H&E staining (see additional zoomed-out images in Figure 1 – figure supplement 1). Due to the severity of these cases, formal NAS scoring was not applied. We do not have the Trichrome staining available but all MASH samples had fibrosis. Thus, it is possible that reduced CL/PG is related to fibrosis. We now added more descriptions on this point.

      (5) In human lipidomics, the authors suggest that reductions are observed in tetralinoleoyl CL (Figure 1C). However, Figure 1C only shows the combined FA acyl chain length + unsaturation, therefore not allowing for FA-specific ID (unless such data are available from the LC/MS analysis).

      Thanks for pointing this out. Per lipidomic nomenclature guideline we assign combined FA acyl chain length + unsaturation when MS2 is not performed. We have validated that our 72:8 peak corresponds to TLCL, but we do not perform MS2 on every lipid species for every sample. We now clarify this point in our manuscripts.

      (6) Figures 1 J/K/I. It is obvious that the background in all murine immunoblotting analysis has been altered. The authors should provide unaltered images for these immunoblots.

      We apologizes with the confusion. In Figure 1J/K/L/M, each panel actually represents two western blots (not one, similar to Figure 3H). The above represents a western blot with OXPHOS antibody cocktail (CV, CIII, CIV, CII, and CI), while the bottom represents the second western blot with citrate synthase (CS). Thus, we had not manipulated parts of the western blot to look different. To clarify, we now place an outline in each of the western blot to clearly demarcate individual blots to avoid confusion (new Figure 1J-M).

      (7) For Figure 1, it is unclear what is meant by 'we performed all mitochondrial lipidomic analyses by quantifying lipids per mg of mitochondrial proteins'. Was the murine lipidomics carried out on fractionated mitochondria or whole liver? If whole liver, then how were the data corrected, particularly given that PG is not a mitochondria-specific lipid?

      The data are all from lipidomic analyses performed in isolated mitochondria.

      (8) While total CL content seems indeed decreased across the different mouse models, this is mostly due to 1-2 CL species showing a pronounced reduction, with the remainder being unaltered. This should at least be acknowledged in the results. This is similarly the case in the LKO livers.

      Thanks for pointing this out. We now provide additional clarification in the text.

      (9) Figure 2. A secondary biochemical analysis of changes in lipid content should be provided, e.g., total triglyceride content, particularly given that the histology analysis does not show any major changes in hepatic lipid droplets/steatosis. In addition, the Masson's Trichrome staining shows almost no collagen deposition.

      We now provide a quantification of triglycerides in Figure 2J.

      (10) Figure 3. 'CLS deletion modestly reduced glucose handling' should be reworded. The LKO mice show improved glucose tolerance (despite the MASH phenotype), which is not evident from the above wording.

      We modified our text accordingly.

      (11) Looking at the mechanism behind the increase in hepatic steatosis, the authors state that lipid accumulation can occur due to increased lipogenesis, or dysfunctional VLDL secretion or beta oxidation, and subsequently assessed the relevant proteins/pathways. What about fatty acid uptake, which is also one of the four major pathways impacted in MASLD? This should be included in this assessment in Figure 3.

      Thank you for this comment. We now provide data for genes involved in fatty acid uptake, which was not reduced with CLS deletion (Figure 3E).

      (12) For Figure 5A, it is simply stated 'CLS deletion promotes liver fibrosis in standard chow-fed condition', and it is unclear what is highlighted within the selected EM images and what the arrows refer to. The authors should clarify this within the text.

      We have modified the text accordingly.

      Reviewer #3 (Public review):

      Summary:

      Mitochondrial oxphos causes lipid accumulation, leading to MASH, although the mechanism has been poorly understood. In this study, Funai and colleagues identify that reductions in cardiolipin in the mitochondria cause disruptions in the electron transport chain. Knockout of cardiolipin synthase was sufficient to drive MASH phenotypes, increase respiratory capacity, and cause electron leak at complexes II and III. It is well established that loss of cardiolipin increases ROS. Studies to date have been performed on whole tissue lysates, but to rule out which changes in mitochondrial lipids are driven by changes in mitochondrial number versus lipid synthesis/turnover, the authors uniquely purified mitochondria from human and mouse livers in MASH and NASH models for this study. This study provides critical information to the field that will inevitably help us better understand the mechanisms underlying MASH and NASH onset. The evidence provided is both convincing and compelling. With further suggested revision experiments, this study has the potential to change our understanding of MASH and NASH pathogenesis.

      We would like to thank the reviewer 3 for the highly-encouraging feedback.

      Strengths:

      The authors use a unique approach of lipidomics on purified mitochondria. They also analyze many distinct MASH models and provide a unique resource for the field of comprehensive lipidomics analysis of the different ways in which MASH can be induced. The use of human tissue elevates the impact/significance of the findings.

      Weaknesses:

      The data on the super complexes was the least compelling, and frankly, I do not think the authors needed those data to make a compelling argument! The authors should shift their focus more to the compelling electron leak data they have collected. If possible, it would also strengthen the work to include cardiolipin rescues on more of the experiments. Finally, expanding their explanations of the model systems would be very helpful for the readership.

      Thank you for this comment. We have now revised our argument to highlight the electron leak data and less emphasis on the supercomplexes.

      Reviewer #4 (Public review):

      Summary:

      Here, the authors wish to shed light on factors that contribute to the development of liver disease in what used to be called 'the metabolic syndrome'. This is a human-health problem of considerable significance, and the insights they provide, namely the implication of a defect in mitochondrial cardiolipin (CL) content to the progression from metabolic dysfunction associated steatotic liver disease to steatohepatitis, are plausible.

      We would like to thank the reviewer 4 in an encouraging feedback.

      Strengths:

      The experimental evidence proffered is derived from the observation of lower levels of (CL) in mitochondria from the liver of patients undergoing liver transplant or resection due to endstage steatohepatitis compared with mitochondria derived from livers of patients with other conditions. This correlation is buttressed by observations made in mice with liver-selective compromise in CL synthesis and which suggest a pathological environment associated with mitochondrial dysfunction and enhanced oxidative stress, features deemed to play a role in the progression from steatotic liver disease to steatohepatitis.

      The paper is well written, and the findings are well explained and superficially convincing.

      Weaknesses:

      It is unclear how much can be learned from compromising a key enzyme that produces a key mitochondrial lipid in a busy metabolic organ like the liver - isn't the discovery of a mitochondrial defect in such a context rather trivial? And how reliably can these findings be related to the human observations? Most importantly, the chain of causality implied by the title is unproven: the key question of whether or not (somehow) preventing the drop in cardiolipin content affects the course of steatohepatitis remains unanswered.

      We agree with the reviewer that the current manuscript does not directly provide evidence that reduction in CL causes MASLD in humans, which as the reviewer describes, must be tested by rescuing CL content in the context of MASLD. We have now obtained mice with conditional overexpressor and have begun the experiments, but findings from these mice are beyond the scope of the current study. We have modified our title to “Cardiolipin deficiency disrupts electron transport chain AND drives steatohepatitis” to reduce the implication for causality.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript states that loss of mitochondrial respiration is expected in MASLD. Forexample, line 187 "MASLD is known to be associated with reduced mitochondrial oxidative capacity". A more accurate statement is that "MASH" is known to be associated with reduced mitochondrial oxidative capacity and increased ROS production in humans. As you correctly cite later for an ex vivo human mitochondrial respiration study, early MALSD, especially with obesity, is associated with elevated mitochondrial respiration (40). Since those measurements are maximal respiration rates, which might not reflect actual in vivo flux, you might also make readers aware that your data is consistent with in vivo human studies that found increased hepatic oxidative flux (TCA cycle flux) in obese subjects with moderate steatosis (PMID: 22152305), which appears to wane with severe steatosis and/or inflammation (PMID: 31012869, PMID: 40272888).

      Thank you for these suggestions. We have made the suggested changes to the text.

      Reviewer #3 (Recommendations for the authors):

      (1) Throughout the manuscript, the authors refer to the inner mitochondrial membrane, although they never perform assays to distinguish the inner vs outer mitochondrial membrane. It would be better to just refer to the cardiolipin being measured as "mitochondrial."

      Thank you. We made these changes.

      (2) In figures showing changes in cardiolipins, not all of them change; only a handful of them are reduced in NASH. Could the authors add commentary in the manuscript about what is known about these different cardiolipin species, and speculate as to why certain CLs are changing while others are not?

      Thank you. Reviewer #2 had similar comments and we provided additional discussions.

      (3) In the human tissues, what do the other mitochondrial inner membrane lipids (PC, PE, PI, PS, LPC, LPE) look like in the healthy vs NASH patients (Figure 1A-D)?

      Thank you for this request. We did not include these data in the manuscript as we have a separate ongoing study (the second author is the lead author on this paper) where we are following up on hepatic mitochondrial PS and PE, which we found to be decreased in human MASH samples compared to healthy livers. This turned out to be a convoluted story so we decided not to include it in the paper.

      (4) The descriptions of the different MASLD/MASH models are a little sparse. Especially needing more detail is the model for carbon tetrachloride injection, causing NASH. The authors should explain how each of these models typically induces MASLD/MASH.

      We now provide these details.

      (5) In figures 2E and F, total body mass is unchanged in CLS-LKO mice, but liver mass is decreased; yet on the chow diet, there appears to be lipid accumulation in the liver as well; I am wondering what the authors' reasoning is for this decreased liver mass.

      It is difficult to say conclusively, but we suspect it is due to cell death evidenced by fibrosis. It’s important to note that while there is lipid accumulation in the liver, steatosis is relatively mild and the increase in liver triglyceride is quite marginal (Figure 2J).

      (6) The lipidomics analysis and comparison of livers in these different models is a wonderful dataset that needs far more depth in terms of unpacking and describing the findings. For example, all the models of MASH show similar changes in most of the lipid species analyzed. NASH appears to be quite different than MASH. This, among other trends, is certainly worth highlighting as it will be of interest to the field.

      Thanks for this comment. We agree that while CL phenotype were common to mouse and human MASH samples, there were other changes that we observed in other lipids that may be biologically significant. As described above, we have an ongoing study pursuing mitochondrial PS in the liver.

      (7) Figure 2B - It is interesting that the CLS KO only impacts certain CLs. The 72:8 CL, which is regulated by CLS, is also a CL that appears to change in the human patient samples. The information on the specific CL that is changing seems critical to the mechanism of the role of the CL in the disease. Throughout the manuscript, it is important to specify which specific CL is being referred to, instead of broadly characterizing the changes to cardiolipins, especially since most of the cardiolipins shown do not change; only a handful of them do.

      Thank you for this suggestion. We have included additional discussions on 72:8 CL in the manuscript.

      (8) One potential non-specific mechanism whereby CLS knockout can cause MASH would be if the mice change their overall food consumption. It is an important control to test if the total food intake is different in WT vs KO mice to formally rule out this possibility.

      The food intake was not different between the group (Figure 2E).

      (9) To determine the extent to which de novo cardiolipin synthesis underlies the change in MASH/fatty liver observed in the HFD, GAN, and CCl4 models in Figure 1, the authors should also put the CLS KO mice on these diets and perform liver histology, analysis of inflammation markers, and analyze immune cell infiltration. Alternatively, the authors could try to rescue the CLS KO model by supplementing cardiolipin in the diet or by injection.

      Thank you. We have an ongoing experiment to examine the effect of hepatocyte-specific CLS overexpression on protection from GAN-induced MASLD.

      (10) Figure 3F shows a decrease in UQCRC2 by RNA but no change at the protein level in Figure 3H. The authors should comment a bit more on this disparity, and the data in Figure 3F don't mean much for the main point of the study if the levels of the proteins are unchanged.

      The reviewer is correct. We initially performed RNAseq in trying to broadly capture how CLS knockout influences liver health, which implicated that transcriptional program for mitochondrial proteins were downregulated. Nevertheless, gold standard measurements of mitochondrial content (mitochondrial protein or mtDNA) did not show change in the abundance with CLS deletion.

      (11) The increase in respiration and spare respiratory capacity upon CLS KO shown in Figure 3J is extremely interesting! The explanation of the experiment and its meaning should be significantly expanded upon.

      Thank you. We included additional discussion on this point.

      (12) Figure 4 - It is interesting that the fraction of the TCA cycle metabolites labeled is increasing with the palmitate tracer and decreasing with the glucose tracer. This implies a "fuel switch," such that more of the TCA cycle carbons originate from fatty acids than glucose upon loss of CLS. The authors should make note of this point. Also, to understand if the total molar quantity of labeling in the TCA cycle from palmitate and glucose is changing, the authors should also report the relative abundance (instead of just the fraction labeled) of the labeled metabolites and unlabeled metabolites.

      Thanks for this suggestion, we have now added this discussion.

      (13) In Figure 5C-F, the authors show that CLS deletion can activate the caspase pathway, but do not see any change in cytochrome c localization. Can the authors clarify if CLS deletion is sufficient to induce apoptosis?

      CLS deletion certainly causes cell death that induces tissue fibrosis. Activation of the caspase pathway suggests that the cell death may be due to apoptosis but we did not see changes in cytochrome c localization. Our lab is currently performing additional experience to test the possibility that CLS deletion may induce ferroptosis.

      (14) Figure 6A-C- The authors discuss the I + III2 + IV supercomplex substantially and consistently decreasing in the CLS-KO mice, however, the quantifications do not look statistically significant. Can the authors confirm if these changes are or are not significant and adjust the text accordingly?

      The reviewer is correct. Abundances of I+III2+IV supercomplexes are decreased in CLS-LKO mice compared to control mice when quantifying with supercomplex antibody cocktail or with UQCRSF1 (complex III subunit) antibody, but not with complex I antibodies. The discrepancy for these results are not entirely clear but it’s likely a combination of antibody sensitivity and a tricky nature to dissolve high molecular weight protein complexes.

      (15) The most compelling data to indicate electron leakage increasing upon CLS knockout is in Figures 7A-E. I would suggest the authors decrease their emphasis on the rearrangement of the supercomplexes and focus their discussion on the very compelling results of Figure 7.

      Thanks for this suggestion. We have modified our text.

      (16) Figure 7D shows that a major site of electron leak is from site II, and these results also fit with the profound succinate-induced respiration observed in earlier experiments. It would be nice if the authors could test the ability of cardiolipin to rescue these phenotypes, similar to the assay in Figure 5I. Assessing this rescue on the CoQ redox state would also strengthen the claims.

      Thank you for this comment. We are encouraged with your suggestions. We have thought about this quite extensively during the preparation of the manuscript but we refrained from making conclusive statements regarding complex II because the magnitude of the increase in electron leak is equally elevated at complex II and III. It’s true that CLS deletion increases succinate-induced respiration, but this might also be because succinate elicits the highest increase in respiration even in wildtype mice (see values in Figure 3K and L compared to other substrates). It would be intriguing to examine the influence of CLS deletion on complex II/III electron leak as well as succinate-induced respiration in tissues where succinate is not a preferred substrate. We have attempted cardiolipin rescue in SUV but unfortunately, we could not get this assay to work for site-specific electron leak measurements.

      (17) In Figure 7G-H, it would be nice to see a ratio of oxidized to reduced CoQ, in the CLS deletion mice and in human NASH livers, if samples are available.

      Thanks for this suggestion. Data shown (Figure 7- figure supplement 1P-S).

      (18) CoQH2 can also deliver electrons to complex II (via its reversal). Complex II shows a remarkable contribution to the electron leak phenotype (Figure 7D). Also, as the complex II monomer showed much larger changes in the native gels of Figure 6 than the complexes involving complex III. A more likely model is that oxidized CoQ accumulates in the CLS knockout model because of increased CoQH2 leak via complex II.

      Perhaps. We also thought about this but we are not sure if this fits with the observation that CLS deletion increases succinate-induced respiration, which suggests increased succinate to fumarate conversion, a notion that I am not sure can be congruent with increase CoQH2 reversal to complex II. Overall, I think we lack the tools or evidence to conclusively implicate whether CLS deletion primarily acts on complex II or III. Nevertheless, we appreciate the reviewer’s enthusiasm on these topics as we perform additional experiments on the mechanism of interactions between CL and the ETC.

    1. eLife Assessment

      This important study assesses the portability of epigenetic clocks across ancestries, including in the context of accelerated aging in Alzheimer's Disease patients. It provides convincing evidence for population differences in age estimation accuracy across a variety of epigenetic clocks, driven in large part by continuous variation in ancestry. Given the accelerating use of epigenetic clocks across fields, this study is likely to be of interest to researchers working on human genetic and epigenetic variation or who apply epigenetic clocks to diverse human populations.

    2. Reviewer #3 (Public review):

      The authors find that DNA methylation-based clocks are generally less accurate at predicting age in cohorts with large proportions of non-European (especially African) ancestry, compared to cohorts with high European ancestry proportions (which more closely reflects the genetic composition of individuals included in training sets). They provide evidence for this ancestry bias via ancestry-stratified analyses, and in analyses of continuous ancestry proportion effects on clock error. They then test two hypothesized underlying causes of ancestry bias: that ancestry-differentiated SNPs disrupt CpG sites preventing methylation, and that ancestry-differentiated SNPs influence DNA methylation levels. They find clear evidence especially for the second cause, in the form of meQTL that influence clock CpG sites and vary in frequency across ancestry groups. Finally, the authors provide key discussions of potential paths forward to alleviate bias and improve portability for future clock algorithms.

      The topic is timely due to the increasing popularity of DNA methylation-based clocks and the acknowledgment that many algorithms (e.g., polygenic risk scores) lack portability when applied to cohorts that substantially differ in ancestry or other characteristics from the training set. This has been discussed to some degree for DNA methylation-based clocks, but could of course use more discussion and empirical attention, which the authors nicely provide using an impressive and diverse collection of data. The inclusion of data from multiple cohorts, the analysis of ancestry as a continuous variable, and the attempts to address the underlying causes of ancestry-based differences in accuracy provide comprehensive evidence that genetic background influences clock portability.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Cruz-Gonz´alez and colleagues draw on DNA methylation and paired genetic data from 621 participants (n=308 controls; n=313 participants with Alzheimer’s Disease). The authors generate a panel of epigenetic biomarkers of aging with a primary focus on the Horvath multi-tissue clock. The authors find weaker correlations between predicted epigenetic age and chronological age in subgroups with higher African ancestry than within a subgroup identified as White. The authors then examine genetic variation as a potential source for between-group differences in epigenetic clock performance. The authors draw on a large collection of publicly available methylation quantitative trait loci datasets and find evidence for substantial overlap between clock CpGs located within the Horvath clock and methQTLs. Going further, the authors show that methQTLs that overlap with Horvath clock CpGs show greater allelic variation in African ancestral groups pointing to a potential explanation for poorer clock performance within this group.

      Thank you for this summary.

      Strengths:

      This is an interesting dataset and an important research question. The authors cite issues of portability regarding polygenic risk scores as a motivation to examine between-group differences in the performance of a panel of epigenetic clocks. The authors benefit from a diverse cohort of individuals with paired genetic data and focus on a clinical phenotype, Alzheimer’s disease, of clear relevance for studies evaluating age-related biomarkers.

      Thank you.

      Weaknesses:

      While the authors tackle an important question using a diverse cohort the current manuscript is lacking some detail that may diminish the potential impact of this paper. For example:

      (1) Information on chronological ages across groups should be reported to ensure there are no systematic differences in ages or age ranges between groups (see point below).

      Thank you for pointing out this omission. The distributions are now presented in Supplementary Figure 1. While there is some variation in median age, the age ranges are similar across cohorts (median 73.1 to 79.3). The small differences do not explain the differences in accuracy between the cohorts, e.g., the median age of the African Americans (76.4) is lower than the median age for the White cohort (77.7).

      (2) The authors compare correlations between chronological age and epigenetic age in sub-groups within to correlations reported by Horvath (2013). Attempting to draw comparisons between these two datasets is problematic. The current study has a much smaller N (particularly for sub-group analyses) and has a more restricted age range (60-90yrs versus 0-100 yrs). Thus, is an alternative explanation simply that any weaker correlations observed in this study are driven by sample size and a restricted age range? Reporting the chronological ages (and ranges) across subgroups in the current study would help in this regard. Similarly, given the lack of association between AD status and epigenetic age (and very small effect in the white group), it may be of interest to examine the correlation between chronological age and epigenetic age in each group including the AD participants: would the between-group differences in correlations between chronological age and epigenetic be altered by increasing the sample size?

      Our conclusions about the reduced accuracy of the clock in admixed individuals are based on the comparison within the MAGENTA cohorts, not a comparison of MAGENTA to previously published studies. We find significantly reduced accuracy in the admixed cohorts compared to the White MAGENTA cohort. Further supporting this conclusion beyond he MAGENTA cohort, we analyzed three independent whole blood methylation datasets. Two focused on African American individuals—the Grady Trauma Project (n = 422) and the GENOA study (n = 1,394)—and one focused on White Swedish individuals (n = 729). As observed in MAGENTA, the Horvath clock had significantly lower accuracy for the African American cohorts (Figure 3 than for the White Swedish cohort.

      When comparing results across studies, the reviewer is correct that lower correlations are generally seen for older cohorts. Indeed, other studies applying the Horvath clock have seen similar correlations in older cohorts to those observed in MAGENTA (Marioni et al., 2015, Horvath 2013, and Shireby et al., 2020). We now also include the chronological age distributions of the cohorts in this study, along with their mean and standard deviations (Supplementary Figure 1). This shows that the distribution of chronological ages for White individuals is similar to the cohorts where the clocks did not perform as well. Finally, as suggested, we correlated chronological and epigenetic age with the inclusion of AD cases in each cohort for the Horvath clock. The significantly lower performance of the clock on Puerto Ricans and African Americans, relative to White individuals, remains even after including all individuals in each cohort. Thus, combining cases and controls did not qualitatively change the performance relationships for the African Americans and Puerto Ricans relative to the Whites (Supplementary Figure 3).

      (3) The correlation between chronological age and epigenetic age, while helpful is not the most informative estimate of accuracy. Median absolute error (and an analysis of MAE across subgroups) would be a helpful addition.

      We used correlation because it is commonly used to evaluate the performance of epigenetic age clocks, but we agree that other error quantification metrics provide a complementary perspective. We now include MAE and MSE comparisons across sub-groups in the revision (Supplementary Table 1). We find that across all accuracy metrics, the African American and Puerto Rican cohorts perform worse than the White and Peruvian cohorts. Interestingly, the Cubans show relatively high error despite a high correlation between predicted and chronological age. However, there are only 21 non-demented Cuban controls. In addition, we evaluated the same metrics in three replicate datasets (two African American cohorts and one for White Swedish individuals) and found the same patterns of lower accuracy across metrics in African ancestry individuals, albeit with some variation in accuracy between cohorts (Supplementary Table 2). Notably, as discussed above, this is not driven by differences in chronological age distributions: when we subset to older individuals (≥ 55 years old) in order to facilitate comparisons to MAGENTA study individuals, the median age for the White Swedish individuals (70 years old) is higher than that of the GENOA (62.7 years old) and Grady (58 years old) individuals. Despite the difference in median ages, the clock performs better on White Swedish individuals across all accuracy metrics than the African ancestry cohorts with younger individuals.

      (4) More information should be provided about how DNAm data were generated. Were samples from each ancestral group randomized across plates/slides to ensure ancestry and batch are not associated? How were batch effects considered? Given the relatively small sample sizes, it would be important to consider the impact of technical variation on measures of epigenetic age used in the current study. The use of principal Component-based versions of these clocks (Higgins Chen et al., 2023; Nature Aging https://doi.org/10.1038/s43587-022-00248-2) may help address concerns such concerns.

      Thank you for pointing out the need for additional context on data generation. We have added details to the Methods. All omics data from the MAGENTA study were generated using standard protocols that ensure minimal technical artifacts and batch effects. Samples were randomized across plates and chips to ensure that ancestry, age, and sex were not confounded with each batch. We also performed a principal components analysis of the normalized methylation data used as inputs for all MAGENTA analyses. We found that the samples did not stratify by sample plate, cohort, ethnicity, or ascertainment center along the principal components (Supplementary Figure 2).

      We also thank the reviewer for their suggestion to apply the principal component clock to account for potential technical variation. As outlined in the new section “Principal component versions of the methylation clocks also have lower age prediction accuracy for genetically admixed individuals,” using the principal component version of the Horvath clock did not result in consistent improvement in age prediction accuracy or generalization across MAGENTA cohorts (Supplementary Figures 4 and 5). The lower accuracy for age prediction in individuals with substantial African ancestry was present for the PC clock in the replication cohorts, just as in the MAGENTA cohorts (Supplementary Figure 6).

      (5) Marioni et al., (2015) found a very weak cross-sectional association between DNAm Age and cognitive function (r∼0.07) in a cohort of >900 participants. Given these effect sizes, I would not interpret the absence of an effect in the current study to reflect issues of portability of epigenetic biomarkers.

      We agree that previous links between DNAm Age and AD or cognitive function have been relatively small in magnitude. For example, the PhenoAge paper (Levine et al., 2018) and a study using the Horvath clock (Levine et al., 2015) found age acceleration of less than a year in AD patients relative to non-demented individuals. Similar results have also been observed in studies with smaller sample sizes (e.g., 700 for Levine et al. 2015 and 604 for Levine et al. 2018). Given these small effect sizes, we agree that accounting for statistical power is essential for interpretation of our results. We performed power calculations based on an effect of the size observed in previous studies (0.5 year acceleration). We have 86% power in the full MAGENTA data set to detect an effect of this size. Stratifying by cohorts, we have 75% power for the African Americans, 72% for the Puerto Ricans, 72% for the Whites, 65% for the Peruvians, and 47% for the Cubans. Thus, we believe we have high enough power that the consistent lack of association outside of the White cohort in MAGENTA is likely meaningful. Based on these calculations, there is only a 1% chance that we would not observe an effect in any of the other cohorts if the effect was present across cohorts. Nonetheless, we have added caveats about power and the small sample size to our suggestion that the reduced accuracy of the clocks contributes to the lack of AD association outside of Whites.

      (6) The methQTL analyses presented are suggestive of potential genetic influence on DNAm at some Horvath CpGs. Do authors see differences in DNAm across ancestral groups at these potentially affected CpGs? This seems to be a missing piece together (e.g., estimating the likely impact of methQTL on clock CpG DNAm).

      We agree. Thank you for this suggestion. We have added Figure 6 in the main text to address this gap. In short, we analyzed additional whole blood methylation data from inidividuals with African ancestry and found that a substantial proportion of the CpGs in methylation clocks are differentially methylated in African ancestry individuals relative to European ancestry individuals. In the case of the Horvath clock, we find that 84/353 (23.8%) of the clock CpGs are differentially methylated between ancestries. In parallel, we found that 56 of these differentially methylated clock CpGs are also affected by meQTL, many of which are at different frequencies between populations. We also investigated whether the meQTL-affected clock CpGs are associated with increased clock error in the MAGENTA individuals. We found 56 clock CpGs whose methylation levels associated with increased clock error, and 42 of these have at least one meQTL. Thus, while meQTL are not the only factor to affect the portability of methylation clocks across global populations, we suggest that they are a significant contributor, especially in the case of the Horvath clock.

      Reviewer #2 (Public review):

      Summary:

      This paper seeks to characterize the portability of methylation clocks across groups. Methylation clocks are trained to predict biological aging from DNA methylation but have largely been developed in datasets of individuals with primarily European ancestries. Given that genetic variation can influence DNA methylation, the authors hypothesize that methylation clocks might have reduced accuracy in non-European ancestries.

      Strengths:

      The authors evaluate five methylation clocks in 621 individuals from the MAGENTA study. This includes approximately 280 individuals sampled in Puerto Rico, Cuba, and Peru, as well as approximately 200 self-identified African American individuals sampled in the US. To understand how methylation clock accuracy varies with proportion of non-European ancestry, the authors inferred local ancestry for the Puerto Rican, Cuban, Peruvian, and African American cohorts. Overall, this paper presents solid evidence that methylation clocks have reduced accuracy in individuals with non-European ancestries, relative to individuals with primarily European ancestries. This should be of great interest to those researchers who seek to use methylation clocks as predictors of age-related, late-onset diseases and other health outcomes.

      Thank you for this summary.

      Weaknesses:

      One clear strength of this paper is the ability to do more sophisticated analyses using the local ancestry calls for the MAGENTA study. It would be valuable to capitalize on this strength and assess portability across the genetic ancestry spectrum, as was recently advocated by Ding et al. in Nature (2023). For example, the authors could regress non-European local ancestry fraction on measures of prediction accuracy. This could paint a clearer picture of the relationship between genetic ancestry and clock accuracy, compared to looking at overall correlations within each cohort.

      Thank you for this suggestion. To model portability across genetic ancestry as a spectrum, we regressed the Horvath clock error on the proportions of African ancestry in the genomes of the MAGENTA individuals, adjusting for chronological age. The proportion of African ancestry is significantly associated with increased Horvath clock error (p = 0.039), with the clock making less accurate age predictions by 1.46 years for individuals with full African ancestry compared to no African ancestry. We have added this new analysis to the Results.

      The authors present two possible reasons that methylation clocks might have reduced accuracy in individuals with non-European ancestries: genetic variants disrupting methylation sites (i.e., ”disruptive variants”) and genetic variants influencing methylation sites (i.e., meQTLs). The authors conclude disruptive variants do not contribute to poor methylation clock portability, but the evidence in support of this conclusion is incomplete. The site frequency spectrum of disruptive variants in Figure 4 is estimated from all gnomAD individuals, and gnomAD is comprised of primarily European individuals. Thus, the observation that disruptive variants are generally rare in gnomAD does not rule them out as a source of poor clock portability in admixed individuals with non-European ancestries.

      In the revision, we now additionally report ancestry-specific allele frequencies to demonstrate the rarity of CpGclock disrupting variants (Supplementary Figure 9). The global allele frequencies were so low that even if they all occurred in individuals of non-European ancestries, they would still be extremely rare.

      It is also unclear to what extent meQTLs impact methylation clock portability. The authors find that the frequency of meQTLs is higher in African ancestry populations, but this could reflect the fact that some of the analyzed meQTLs were ascertained in African Americans. The number of meQTL-affected methylation sites also varies widely between clocks, ranging from 6 to 271; thus, meQTLs likely impact the portability of different clocks in different ways. Overall, the paper would benefit from a more quantitative assessment of the extent to which meQTLs influence clock portability.

      We agree that the meQTL likely influence the clocks in different ways and that the ascertainment of the meQTLs in different populations makes direct comparisons challenging. To more directly link meQTL to clock performance, we identified 56 Horvath clock CpG sites whose methylation levels significantly associate with increased clock error in the MAGENTA study individuals. Of these, 42 (75%) are affected by an meQTL, including nine that are affected by an African ancestry-differentiated meQTL. As such, meQTL, and specifically meQTL that were likely not present in the training data of the Horvath clock, associated with both the methylation of CpG sites and clock error. However, as the reviewer suggests, determining causality among these factors is challenging. Given our incomplete knowledge of meQTL in different ancestries, we have added caveats to our conclusions about the effect of meQTL on clock portability.

      The paper implies that methylation clocks have an inferior ability to predict AD risk in admixed populations relative to white individuals, but the difference between white AD patients and controls is not significant when correcting for multiple testing. This nuance should be made more explicit.

      We agree that the signal is not strong in the white cohort; however, it is similar in magnitude to previous studies. As outlined in response to Reviewer 1’s Point 5, we have now added power calculations that indicate reasonable power (≥72%) to detect small effect sizes (0.5 year increase) in the white, Puerto Rican and African American cohorts. We now interpret the AD association tests in the context of these power calculations and multiple testing correction.

      Finally, this paper overlooks the possibility that environmental exposures co-vary with genetic ancestry and play a role in decreasing the accuracy of methylation clocks in genetically admixed individuals. Quantifying the impact of environmental factors is almost certainly outside of the scope of this paper. However, it is worth acknowledging the role of environmental factors to provide the field with a more comprehensive overview of factors influencing methylation clock portability. It is also essential to avoid the assumption that correlations with genetic ancestry necessarily arise from genetic causes.

      We entirely agree and have now clarified the scope of our analyses and importance of environmental factors in the revision. We intersected clock CpGs with enviromental-factor-associated CpGs from multiple epigenome-wide association studies (EWAS) and found overlaps that suggest an environemtnal contribution to differences in clock CpG methylation. However, given the lack of environmental data on the MAGENTA study individuals, as well as the lack of datasets for replication, we cannnot directly compare the environmental and genetic contributions to clock accuracy. Nevertheless, the new analyses in the revision highlight the contribution of both genetic and environmental factors to lack of portability for certain methylation clocks.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 64: An association between methylation patterns and genetic ancestry does not presuppose that meQTLs vary in frequency between genetic ancestries; environmental factors could also play a role. It would be nice to comment on this further in the Introduction.

      We agree that environmental factors likely play a role in the decrease in methylation clock performance in admixed populations. We have added text highlighting this in the revised Discussion. Regarding meQTL, we agree that associations between methylation patterns and genetic ancestry do not necessarily imply that meQTL will vary in frequency between genetic ancestries. However, our new analyses in the revision find African-ancestry differentiated meQTL that associate with Horvath clock CpG methylation levels and overall clock error (Figure 6E-F and Supplementary Figure 13).

      (2) Line 116 implies Puerto Ricans have “substantial amounts of African ancestry” but the median ancestry is 15% (which is not much more than the Peruvian and Cuban cohorts).

      Thank you for pointing this out. We have clarified this statement in the text. While the median proportion of African ancestry in Puerto Ricans is 15% (vs. 6% and 2% for the Peruvian and Cuban individuals in MAGENTA), there are many individuals with substantially higher African ancestry. The upper quartile is >25% and several Puerto Ricans have >50% African ancestry.

      (3) In Figure 2B, Puerto Ricans have worse accuracy than Peruvians but a higher proportion of inferred CEU ancestry, which is interesting and defies intuition - is there any hypothesis for why this might be the case?

      In light of our new meQTL analyses, we hypothesize that the African ancestry differentiated meQTL that affect Horvath clock CpGs drive the increase in clock error for these individuals, despite having more European ancestry across their genome. Given that the Peruvians (and Cubans, for that matter) hold very little African ancestry, and also very few of the African-differentiated meQTL, this could explain some of the large difference in clock errors for the cohorts.

      (4) Figure 2C would be improved with confidence intervals.

      We thank the reviewer for this suggestion and have added confidence intervals for Figure 2C.

      (5) It’s interesting that the correlation with Cubans is positive in Figure 3B (for one clock, significantly so). Is there any rationale for this?

      We noticed this as well, but have not been able to come to a definitive conclusion. It is possible that environmental factors contribute. However, the Cuban cohort is the smallest in MAGENTA (22 cases and 21 controls) and the none of the differences are statistically significant, so more investigation in a large cohort is required.

      (6) Line 231: Which population(s) is allele frequency estimated in?

      This is the global frequency reported in gnomAD, which is calculated across all populations in gnomAD v3.0. As noted above, we now also report allele frequencies by gnomAD population (Supplementary Figure 9).

      (7) Were the meQTLs pruned? How many independent variants are there per methylation site? It would be nice to see a distribution for the sites in the Horvath clock.

      We now report the distribution of meQTL across clock CpG sites. The mean number of variants is 108; the median is 36; and the maximum is 1,699. We have now included a plot of the distribution for all 271 (out of 353) Horvath clock CpG sites (Supplementary Figure 14). We did not perform any pruning in these initial results for several reasons. First, we sought to demonstrate the great potential for meQTL to influence these CpGs and to compare the distributions of these common meQTL across populations (based on gnomAD data). Second, identifying the causal variant or variants is challenging. Given that many of these meQTLs likely reflect redundant signals, for the new analyses of African-differentiated meQTL, we restrict to a single variant per clock CpG site. We focus on the variant with the greatest absolute beta, as reported by the original meQTL study from which the variant originates.

      (8) Figure 5C might benefit from a geom density rather than overlapping bar plots; the trends are hard to see.

      We appreciate the reviwer’s suggestion and have now reworked the figure and based it on just the density curves so that readers may better appreciate the differences in allele frequencies.

      (9) Several figures would be more legible with larger font sizes.

      We appreciate this recommendations and have made the font sizes for all plots larger and more legible.

      Reviewer #3 (Public review):

      This manuscript examines the accuracy of DNA methylation-based epigenetic clocks across multiple cohorts of varying genetic ancestry. The authors find that clocks were generally less accurate at predicting age in cohorts with large proportions of non-European (especially African) ancestry, compared to cohorts with high European ancestry proportions. They suggest that some of this effect might be explained by meQTLs that occur near CpG sites included in clocks, because these variants may be at higher frequencies (or at least different frequencies) in cohorts with high proportions of non-European ancestry relative to the training set. They also provide discussions of potential paths forward to alleviate bias and improve portability for future clock algorithms.

      The topic is timely due to the increasing popularity of DNA methylation-based clocks and the acknowledgment that many algorithms (e.g., polygenic risk scores) lack portability when applied to cohorts that substantially differ in ancestry or other characteristics from the training set. This has been discussed to some degree for DNA methylationbased clocks, but could of course use more discussion and empirical attention which the authors nicely provide using an impressive and diverse collection of data.

      Thank you for this summary.

      The manuscript is clear and well-written, however, some key background was missing (e.g., what we know already about the ancestry composition of clock training sets) and most importantly several analyses would benefit from being taken one step further. For example, the main argument of the paper is that ancestry impacts clock predictions, but this is determined by subsetting the data by recruitment cohort rather than analyzing ancestry as a continuous variable. Extending some of the analyses could really help the authors nail down their hypothesized sources of lack of portability, which is critical for making recommendations to the community and understanding the best paths forward.

      Thank you for this suggestion. As noted in our response to Reviewer 2’s Point 1, we have analyzed ancestry as a continuous variable and found that the proportion of African ancestry in the genomes of the MAGENTA individuals significantly associates with increased difference in chronological and predicted age, even after controlling for chronological age (1.46 years more error for 100% vs. 0% African ancestry; p = 0.039). As outlined below, we have also added details on the training of previous clocks and the important additional previous work highlighted by the Reviewer.

      Reviewer #3 (Recommendations for the authors):

      Major comments

      There is previous literature addressing who is in the training set for methylation clocks. To my knowledge, this work has been primarily led by Nancy Krieger. It would be a valuable addition to discuss her work (and any similar work by other investigations) in the introduction. In other words, what do we currently know about the degree of bias in the training sets for methylation-based clocks? The assumption of the introduction is that the training sets are overwhelmingly European ancestry (which I assume is true) but I think some quantitative information about this would be helpful for understanding the source and magnitude of the problem.

      We thank the reviewer for bringing the work of Dr. Nancy Krieger to our attention. It directly supports the rationale for this study: the sociodemographic characteristics of the individuals used to train these clocks are poorly reported, limited to outdated population descriptors (for example, the use of “Caucasians” to describe some of the individuals used to train the Horvath and the Hannum clocks) or race and ethnicity labels. Moreover, where labels are available for training individuals, they tend to underrepresent the individuals of diverse backgrounds, as in the Horvath clock. We have incorporated Dr. Krieger’s work into the Introduction, including details of how this supports the rationale and purpose of our study.

      Related to the above comment, there has been pretty extensive previous work on the effects of race and ethnicity on epigenetic clock estimates (e.g., https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1030-0), and that seems like it could be more explicitly weaved into the introduction and discussion.

      We thank the reviewer for highlighting this relevant article. We have added discussion of it into the Introduction. Several factors make direct comparison with our results challenging. First, the grouping of individuals based on race and ethnicity without consideration of genetic ancestry complicates comparisons. Race and ethnicity commonly do not match genetic ancestry components (see Gouveia et al., 2025 https://www.cell.com/ajhg/fulltext/S00029297(25)00173-9). Second, the study reports differences in epigenetic age accelerations (intrinsic and extrinsic) in individuals from various race and ethnic groups. It does not directly evaluate the accuracy of the epigenetic age predictions in these groups. Thus, it is challenging to interpret whether the differences in acceleration are driven by biological factors or biases in the performance of the clocks themselves.

      The main analysis that felt like it was missing was asking whether the age deviations are larger for individuals with greater proportions of African ancestry. The authors have the ability to analyze ancestry as a continuous variable, but instead performed analyses in various a priori subsets of the data; the subsets do have average differences in ancestry, but also there is heterogeneity within groups. Given that the authors calculated admixture proportions already, it seems like a missed opportunity not to use these estimates. This would also sidestep the issue of the problematic labels applied to the subsets, which mix ancestry, nationality, and race terms (note that I thought the legacy reasons why these labels are used were well-explained, but they are nevertheless problematic for biological explanations that center on ancestry/genetic information as the driver of bias).

      We appreciate the reviewer’s suggestion to investigate clock accuracy in the context of African ancestry proportions. As noted in the response to Reviewer 2’s Point 1, we modeled the clock error as a function of the fraction of African ancestry of each individual, adjusting for an individual’s chronological age. The proportion of African ancestry is significantly associated with increased Horvath clock error (p = 0.039), with the clock estimated to give less accurate age predictions by 1.46 years for individuals with 100% African ancestry compared to no African ancestry. We now report this in the Results.

      Another missed analysis opportunity occurs in lines 259-261, where the authors state “Thus, the clock with the largest decrease in performance in admixed cohorts (in terms of predicting chronological age and identifying age acceleration in AD) has the most and largest fraction of meQTLs influencing its CpGs.” This is another place where the authors make generalizations about a given cohort based on average ancestry rather than testing the claim empirically on an individual basis (e.g., by examining the number of meQTL variants a given individual is heterozygous for or has the non-European allele for).

      We thank the reviewer for this comment. This feedback motivated us to evaluate the relationship between differences in meQTL frequencies and methylation clock error. We found differences in meQTL frequency in the MAGENTA individuals, specifically many of the clock CpG affecting meQTL are most common in the African American cohort, consistent with our theory (Figure 6E,F). Nonetheless, there are 84 Horvath clock CpGs (24%) that are differentially methylated in AFR individuals, and 56 of these are affected by an meQTL, including 11 that are affected by an African ancestry-differentiated meQTL (Figure 6G). Finally, we find that 42 Horvath clock CpG sites in MAGENTA individuals with methylation levels that are significantly associated with increased clock error, and that are also affected by an meQTL (Figure 6B). However, at the individual level we do not find a clear relationship between the number of meQTL or ancestry-differentiated meQTL and methylation clock error. In light of these data, we have reframed our conclusions to state that meQTL likely contribute to clock error, while also being clear that they are not the sole cause.

      Can the authors explain or offer an investigation into why predicted age is often better in Cubans than Whites? They gave much attention to the opposite effect (of similar magnitude) in African Americans and Puerto Ricans but didn’t really discuss the surprisingly accurate prediction in Cubans.

      We did not focus on the results in the Cuban cohorts for several reasons. As discussed in response to Reviewer 2’s comment, the Cuban cohort had the smallest sample size (22 cases and 21 controls). Thus, while the correlation between methylation age and chronological age is similar to Whites, and in a few cases higher, the differences were not statistically significant. Second, looking at other error metrics, like mean absolute error, the clocks are comparatively less accurate in Cubans than on the White cohort (Supplementary Table 2). Finally, the clocks consistently find that Cubans with AD have lower predicted age than controls, though this is only significant for the ZhangEN clock. However, given these inconsisencies and the very small sample size, we caution against over-interpretation of these results. We clarify this in the manuscript and suggest that more work is needed on larger Cuban cohorts before any clear conclusions can be made.

      I was not a conceptual fan of the ensemble clock. The clocks are trained on very different things (e.g., chronological age versus clinical biomarkers) and are designed to capture different aspects of biology. Without more validation and motivation, I don’t think it makes sense to average values that are not designed to measure the same thing.

      We agree that combining the first and second-generation clocks for the task of age prediction is not sensible. However, for AD risk stratification, combining values from multiple clocks that capture different aspects of biology and aging could be beneficial. As mentioned in the main text, we took inspiration from approaches in polygenic risk scores, as well as the broader machine learning field, where ensembling often makes for better predictors. Nonetheless, consistent with the Reviewer’s intuition, we do not see improvement here.

      Minor comments

      (1) Typo in line 91.

      Thank you for bringing this to our attention. Fixed.

      (2) Lines 111-115, sample sizes would be helpful.

      We have added the sample sizes of the non-demented controls that were used to calculate these correlations in each cohort.

      (3) Line 137-138, the correlation stats would be helpful here. This is a common issue throughout the paper, more in-text statistics would help readers to evaluate the authors’ claims. For example, lines 249-251 as well. The authors refer the reader to Figure 5C, which itself has no statistics, this has two plots so it’s unclear which the authors are putting forward as the primary evidence.

      We have added more statistical details in the text and figures to address this comment. In this instance, we have removed the referenced figure.

      (4) Lines 258 and 261, I believe the authors report the same result in both these lines.

      Thank you for pointing out this lack of clarity. These lines report different, but related, results about the frequency of clock-affecting meQTL in different ancestral contexts. The first reports the frequency of clock CpGaffecting meQTL in individuals of African ancestry across all of gnomAD. The second result gives the frequency of those meQTL in different local ancestry backgrounds in admixed individuals. This is distinction is relevant since admixed individuals’ genomes are mosaics of multiple genetic ancestries. As such, a genetic variant might be present in haplotype whose ancestry is not in line with expectations based on global ancestry (e.g., an African American individual inherits a genetic variant within a European ancestry block). This local ancestry difference could modify the effect of the variant or obscure causal variants. Given the potential for confusion and similar results considering global and local ancestry context in this case, we have focused on the first result in the Main Text.

      (5) Somewhere, it would be helpful to provide the distribution/range of ages broken by cohort. Similarly, I didn’t see the breakdown of AD versus control cases within each cohort. Both of these features will impact power within a given cohort for certain analyses.

      We have added the distribution of ages by cohort in Supplementary Figure 1. Table 1 provides a breakdown of cases versus controls for each of the cohorts in the MAGENTA study.

      (6) Figure 3 is pretty hard to read. It would also be helpful if the authors put the white cohort in Figure 3A as a ’baseline’ comparison, as they use this as the baseline comparison in the text.

      We have made these changes to the figure and used larger text overall.

      (7) The various acronyms in the labels in Figure 5 are not explained. For Figure 5C - this is over-plotted and therefore hard to see.

      We have added the full population descriptors from gnomAD to the boxplots showing allele frequencies (Figure 6E). In addition, what used to be Figure 5C has been simplified and moved to Supplementary Figure 12.

      (8) The authors correct for cell type heterogeneity, which is known to vary across populations and can impact clock estimates. However, as far as I can tell, the cell type proportion estimates are coming from the DNA methylation data. The deconvolution algorithms for cell type proportions also have the same problem as the clocks of being trained on a very specific subset of human genetic and environmental diversity. Do the authors have any empirically derived estimates of cell type heterogeneity to sanity-check these deconvolution estimates? At the very least, it would be helpful to acknowledge this limitation.

      We thank the reviewer for commenting on this. There are no empirically derived estimates of cell type counts for the samples in the MAGENTA study. This is an inherent limitation of our study, and we have included text to make note of this.

      (9) There are very different sample sizes for each group, did the authors consider that their null results for the AD analyses in different cohorts are just a lack of power? This could be evaluated with power analyses or by comparing against sample sizes from similar studies in the literature.

      We agree that this is an important analysis and have added it to the manuscript. Given these small effect sizes, accounting for statistical power is essential for interpretation of our results. We performed power calculations based on an effect of the size observed in previous studies (0.5 year acceleration). Considering the full study, we have 86% power to detect an effect of this size. Stratifying by cohorts, we have 75% power for the African Americans, 72% for the Puerto Ricans, 72% for the Whites, 65% for the Peruvians, and 47% for the Cubans. Thus, we have high enough power that the consistent lack of association observed outside of the White cohort in MAGENTA is likely meaningful. Based on these calculations, there is only a 1% chance that we would not observe an effect in any of the other cohorts if the effect was present across cohorts. Nonetheless, we have added caveats about power and the small sample size to our suggestion that the reduced accuracy of the clocks contributes to the lack of association outside of Whites.

      (10) There has been a fair amount of discussion recently that single CpG-based clocks are much more variable than clocks that combine information across CpG sites, either using PC-based or window-based approaches. For example, the PC clock R package from the Levine Lab (https://github.com/MorganLevineLab/PC-Clocks) is very easily implemented and generally gives much less variable age estimations than site-level clocks. It would be nice to consider integrating or discussing these later-generation clocks as ways to improve clock performance in diverse human groups.

      We thank the reviewer for their suggestion to apply the principal component clock to account for potential technical variation. As outlined in the new section “Principal component versions of the methylation clocks also have lower age prediction accuracy for genetically admixed individuals,” using the principal component version of the Horvath clock did not result in consistent improvement in age prediction accuracy or generalization across MAGENTA cohorts (Supplementary Figures 4 and 5). The lower accuracy for age prediction in individuals with substantial African ancestry were present for the PC clock in the replication cohorts, just as in the MAGENTA cohorts (Supplementary Figure 6)

    1. eLife Assessment

      This study presents a comparison of the efficiency and precision of two prime editing methods to introduce single-nucleotide variants and longer exogenous DNA sequences into the zebrafish genome. Convincing data support the conclusion that the PE2 prime editor Nickase is more effective at introducing single-nucleotide variants, while the PEn prime editor nuclease is more effective at integrating sequences from 3 up to 46 base pairs, for both somatic and germline editing. The results will be valuable for the zebrafish community, in particular to model human disease variants in this model organism.

    2. Reviewer #1 (Public review):

      Ono et al., compared the activity of prime editor nickase PE2 and primer editor nuclease PEn in introducing SNPs and short exogenous DNA sequences into the zebrafish genome to model human disease variants. They find the nickase PE2 prime editor had a higher rate of precise integration for introducing single nucleotide substitutions, whereas the nuclease PEn prime editor showed improved precision of integration of short DNA sequences. In somatic tissue the percentage of SNP variant precision edits improved when using PE2 RNP injection instead of mRNA injection, but increased precision editing correlated with elevated indel formation. While PEn overall had higher rates of precision edits, the indel rate was also elevated. Similar rates were observed when introducing a 3 bp stop codon into the ror gene using a standard pegRNA with a 13-nucleotide homology arm, or a springRNA driving integration by NHEJ. Inclusion of an abasic sequence in the springRNA prevented imprecise edits caused by scaffold incorporation, but did not improve the overall percentage of precise edits in somatic tissue. Both PE2 and PEn showed higher frequency of 3 bp precision integration, compared to CRISPR HDR mediated knock-in using a single strand donor DNA template with short homology. Recovery of a germline ror-TGA integration allele using PEn with RNP was robust, resulting in 5 out of 10 founders transmitting a precise allele. The authors demonstrate PEn was effective at integration of a 30 bp nuclear localization signal into the 5' end of GFP in an existing muscle-specific reporter line. PEn-mediated integration of long sequences was further demonstrated by integration into the wls gene of a 46bp attP sequence for phiC31 integrase recombination. Additional analyses are needed to determine if the approach can be used to isolate stable germline alleles of variants that are potentially dominant negative or gain of function in nature.

      The conclusions of the paper are well supported, demonstrating PE2 increases precision, while PEn increases efficiency, for integrating short DNA sequences. Introducing longer sequences up to 46 bp wit PEn highlights the potential broad utility of this approach for insertion of functional motifs for protein modification and gene expression.

      (1) In Figure 3 the data indicates a significant increase in precise edits of the 3 bp TGA using PE2 RNP (11.5%) vs. PE2 mRNA (1.3%). At the adgrf3b locus both PE2 RNP, PE2 mRNA, PEn RNP and PEn mRNA were tested for introducing the 3 bp TGA and a longer 12 bp insertion. PEn RNP showed the highest rate of precision for integration of the longer 12 bp sequence. A comparison of somatic precision editing at additional loci, and analysis of germline transmission rates using PE2 vs. PEn, would support the conclusion that PEn is preferred for precise integration of longer templates, and recovery of germline integration alleles.

      (2) Figure 4 shows the results of introducing a TGA stop codon that is predicted to result in nonsense mediated decay. Testing the ability to also isolate different substitution mutations in the germline would be useful information for identifying the most effective approach for generating human disease variant models.

    3. Reviewer #2 (Public review):

      The manuscript by Ono et al compares two prime editing strategies in zebrafish, one based on a nickase and the other on a nuclease, and evaluates their performance for introducing substitutions, short insertions, and transmission to the next generation. The study aims to clarify the relative strengths of these approaches and to extend their use for inserting short DNA sequences in vivo.

      The study provides a useful and well-executed comparison of two editing strategies in a vertebrate model. In particular, the finding that the nuclease-based approach shows higher efficiency for short insertions is of practical interest for functional studies. The authors also present convincing evidence supporting their conclusions, including sequencing and phenotypic validation at selected loci. These results support the reliability of the approach in this system.

      The overall conceptual advance remains somewhat limited, as the general strategy of delivering prime editing components in zebrafish has been described previously. The present study extends this work by comparing two editing modes and exploring insertion efficiency, which represents a useful but incremental advance.

      Regarding the comparison between the two systems, the authors have made efforts to address concerns about generalizability by adding data from additional loci and by refining the scope of their conclusions. These additions strengthen the manuscript. However, the comparison is still based on a relatively small number of loci, and the conclusions may therefore remain somewhat context-dependent.

      Overall, the authors largely achieve their stated aims of comparing two editing strategies and demonstrating their applicability in zebrafish. The data generally support the conclusions, particularly within the tested loci. The work provides practical value to the community, especially for researchers seeking efficient strategies for short sequence insertion in this model system, although its broader impact is somewhat limited by its incremental nature.

    4. Reviewer #3 (Public review):

      The manuscript by Ono et al describes application of prime editors to introduce precise genetic changes in the zebrafish model system. Probably the most important observation is that compared to the "standard" PE2, prime editor with full nuclease activity appears to be more efficient at introducing insertions into the genome. Although many laboratories around the world have successfully used oligonucleotide-mediated HDR to insert short exogenous sequences such as epitope tags or loxP sites into the zebrafish genome, the method suffers from high frequency of indels at the edit site. Thus, additional tools are badly needed, making this manuscript very important.

      Comments on revised version.

      Thank you for thoroughly addressing my minor concerns.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Thank you very much for handling our revised manuscript and for the careful and constructive comments from the reviewers. We are grateful for the detailed feedback, which has helped us improve both the experimental presentation and the framing of the study. In response to the comments, we have substantially revised the manuscript, updated the figures and supplementary figures, and clarified several points in the text. We have also added new experimental analyses, which were essential to strengthen the manuscript.

      We would like to highlight the major changes in the revised version:

      Added the late phenotype analysis of the ror2 mutant, including loss of nasal and maxillary barbels and altered adult jaw morphology by microCT, strengthening the disease-model relevance.

      Added new data on a further target locus (wls) showing 46 bp attP insertion by PEn and comparison with HDR-mediated knock-in at the same site.

      Expanded the analysis of insertion performance at adgrf3b and clarified comparison with previously reported PE2 data.

      Added the analysis of HDR-mediated knock-in and prime editing substitution to generate ror2 W722X allele.

      Added comparative off-target analysis for PE2, PEn and HDR at three predicted off-target sites for the ror2 target.

      Resolved the cloning/NGS inconsistency for ror2 by increasing clone analysis

      We have also moderated several statements in the manuscript, for example, that editing efficiency is locus- and edit-dependent, and that broader comparison of germline transmission efficiencies between prime editing systems will require future work.

      A few reviewer suggestions would have required substantial additional experimental work that is technically demanding and beyond the immediate scope of the present methods-focused resubmission, for example, a direct side-by-side germline comparison of PE2 and PEn across several loci, or systematic cost benchmarking against HDR across multiple edit classes. Rather than overstate these points, we have acknowledged these limitations directly in the revised manuscript and narrowed our claims accordingly.

      Public Reviews:

      Reviewer #1 (Public review):

      From the work presented, it is unclear how prime editing could be used to transiently model human pathogenic variants, given the low frequency of precision edits in somatic tissue, or to isolate stable germline alleles of variants that are potentially dominant negative or gain-of-function in nature. Without a direct comparison with CRISPR/Cas9 nuclease HDR-based methods that use oligonucleotide templates to introduce edits, the advantage of prime editing is unclear. A cost comparison between prime editing and HDR methods would also be of interest, particularly for integration of longer DNA sequences

      We thank the reviewer for this important comment. In response, we added a direct comparison between PEn-mediated editing and HDR-mediated knock-in at the ror2 locus and the wls locus using insertion of a 46 bp attP sequence. This new dataset shows that PEn can achieve programmed insertion at a higher efficiency in ror2 and comparable efficiency in wls to HDR at the same target site, thereby providing a more direct benchmark within zebrafish embryos. We also revised the Discussion to better position prime editing as a practical donor DNA-free approach rather than as a universally superior method. We agree that a formal cost comparison would be informative; however, such an analysis would depend strongly on locus, edit size, optimisation burden, and local reagent production pipelines, and we believe this is beyond the scope of the present manuscript. Instead, we now discuss these practical considerations more cautiously in the revised Discussion.

      (1) In Figure 3, the data indicate a significant increase in precise edits of the 3 bp TGA using PE2 RNP (11.5%) vs. PE2 mRNA (1.3%). At the adgrf3b locus, only PEn mRNA was tested for introducing the 3 bp and 12 bp insertions. The previous study testing PE2 for 3 and 12 bp insertions was mentioned, but the frequency was not listed, and the study wasn't cited (lines 204 - 207). A comparison of germline transmission rates using PE2 vs. PEn would support the conclusion that PEn allows precise integration of longer templates and recovery of germline integration alleles.

      We appreciate this point. We revised the adgrf3b section to include the relevant reference and explicitly state the previously reported PE2 frequencies, allowing clearer comparison with our PEn data. We added our own experimental data to compare PE2 and PEn with mRNA or RNP form in adgrf3b locus (Figure 3i and j). We also refined the wording of our conclusions so that we do not imply a direct germline comparison between PE2 and PEn where such data are not available. In the revised manuscript, we now state that our germline transmission results apply to PEn-mediated insertions in the loci tested here. A full side-by-side germline comparison between PE2 and PEn across multiple loci would indeed be valuable, but this would require substantial additional animal work and time and is beyond the scope of the present resubmission.

      (2) Figure 4 shows the results of introducing a TGA stop codon that is predicted to result in nonsense-mediated decay. Testing the ability to also isolate different substitution mutations in the germline would be useful information for identifying the most effective approach for generating human disease variant models.

      We agree that this would be useful. In the present study, we focused experimentally on establishing stable lines for the insertion-based edits, while the substitution experiments were used to compare PE2 and PEn performance in somatic editing at the crbn locus. We also tested the generation of ror2 W722X allele by prime editing substitution (Supplementary Figure 3). We have therefore revised the manuscript to clarify the scope of the disease-modelling claim and now state more explicitly that our data support the generation of disease-relevant alleles in cases where short, programmed substitutions or insertions are sufficient.

      A comparison with the prime editing variant knock-in frequencies reported in the recent publication by Vanhooydonck et al., 2025, Lab Animal should be included in the Discussion.

      We have added this study to the revised manuscript and now discuss our findings in relation to the frequencies reported by Vanhooydonck et al. (2025).

      Reviewer #2 (Public review):

      The comparative analysis between PE2 and PEn systems suffers from limited evidentiary support. The comparison relies on single loci for substitutions (crbn) and insertions (ror2), raising concerns about generalizability. Additional validation across multiple loci is necessary to support broad conclusions about PE2/PEn performance

      We appreciate this concern. To strengthen the manuscript, we added new experimental data at an additional target locus, wls, where we tested insertion of a 46 bp attP sequence and compared PEn with HDR-mediated knock-in. We also included the adgrf3b insertion data more prominently. At the same time, we revised the wording throughout the manuscript so that our conclusions are more carefully limited to the loci tested here.

      Reviewer #3 (Public review):

      (1) The logic for introducing two nucleotide changes (at +3 and +10) to change a single amino acid (I378) should be explicitly explained in the main body of the manuscript. It is indeed self-explanatory when looking at Supplementary Figure 1. One way of doing it could be to include Supplementary Figure 1a in Figure 1.

      We thank the reviewer for pointing this out. We have now explained this directly in the main text. Specifically, we state that one nucleotide change introduces the desired missense mutation, whereas the second was included to reduce potential pegRNA misfolding caused by complementarity between the spacer and the PBS/RT template region.

      (2) It is not clear why a 3-nucleotide insertion was used to generate W722X. The human W720X is a single-nucleotide polymorphism, and it should be possible to make a corresponding zebrafish mutant by introducing two nucleotide changes.…

      We agree that this point and have now explained in the main text that the 3 bp stop-codon insertion was chosen as a proof-of-principle strategy for generating a precisely truncated protein through programmed insertion, a type of edit that can be broadly applied to target loci. We also tested the generation of ror2 W722X allele by prime editing substitution (Supplementary Figure 3). We also clarify that prime editing substitution was tested separately here.

      (3) Lines 137-138: T7 Endonuclease assay used in Figure 2d detects all polymorphisms, both precise changes and indels. Thus, if this assay were performed on embryos shown in Figure 1c-d, the overall percentage of modified alleles would be similarly higher for PEn over PE2 (add up precise prime edits and indels). The conclusion in the last sentence of the paragraph is, therefore, incorrect, I believe.

      We agreed with this point and revised the sentence accordingly. The text now states that no obvious cleavage was observed with the PE2/pegRNA condition, suggesting fewer editing events compared with PEn, rather than implying greater precision from the T7E1 result alone.

      (4) Use of terminology. "Germline transmission" is typically used to refer to the fraction of F0s transmitting desired changes (or transgenes) to their progeny, while "germline mosaicism" refers to the fraction of F1s with the desired change in the progeny of a given F0. "Germline transmission" in line 217 should be replaced with "germline mosaicism".

      We have replaced the terminology accordingly in the revised manuscript.

      (5) Lines 253-255: The fraction of injected embryos that had mosaic nuclear expression of GFP, indicative of NLS insertion, should be clarified. It should also be clarified whether embryos positive for nuclear GFP were preselected for amplicon sequencing and germline transmission analyses. This is extremely important for extrapolation to scenarios like epitope tagging, where preselection is not possible.

      We agree and have clarified this in the revised manuscript. We now state the fraction of injected embryos showing mosaic nuclear GFP expression, and we explicitly note that embryos were not preselected prior to sequencing or founder analysis. We further explain that preselection was not practical because the transgene is multicopy and individual fibres showed variable ratios of nuclear to cytoplasmic GFP, which made reliable scoring difficult.

      (6) Statistical analyses. It would be helpful to clarify why different statistical tests are sometimes used to assess seemingly very similar datasets (Figures 1c, 1d, 2b, 2c, 2f).

      We have clarified this in the Materials and Methods section and now state that the choice of statistical test depended on the normality and variance structure of the experimental data.

      (7) Discussion. Since authors suggest that PEn might be especially beneficial for insertion of additional sequences, it is important to stress locus-to-locus variability of success. While the precise +3 insertion was indeed tremendously efficient at both tested loci (ror2 and adgrf3b), +12 addition into adgrf3b was over 10 times less efficient. In contrast, +30 into smyhc:GFP using the shorter pegRNA was highly efficient again. Longer pegRNA did not work nearly as well. As dangerous as it is to extrapolate from small datasets, perhaps these observations indicate that optimization of RT template and PBS may be needed for each new locus in order to significantly outperform oligonucleotide-mediated HDR? If so, would the cost of ordering several pegRNAs and the effort needed to compare them factor in when deciding which method to use?

      We fully agree and have substantially revised the discussion to reflect this point. We now emphasise more clearly that editing efficiency is locus- and edit-dependent and likely influenced not only by insertion length but also by spacer sequence and pegRNA complexity. We cite the relevant literature on prime editing determinants and discuss that locus-specific optimisation may be required. We also softened our concluding claims so that the manuscript presents PEn as a practical donor DNA-free approach rather than as a universally high-efficiency solution.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Because this is a genome editing methods paper, including frequency or percentages of somatic and germline editing in the abstract, in comparison to previously published studies, it would be useful information for the intended audience

      We agree and revised the abstract to include concrete editing frequencies. We now indicate the strongest insertion efficiencies observed. We also retained the statement that edited alleles were transmitted to the next generation.

      Reviewer #2 (Recommendations for the authors):

      (2) Please include additional loci for substitutions and insertions to strengthen conclusions about PE2/PEn efficiencies.

      In response, we added further substitution data at the ror2 (Suppl. Data 3) and insertion at the wls locus (Suppl. Data 6) and strengthened the presentation of the adgrf3b insertion data: first, by adding new locus data where feasible; and second, by narrowing the wording of our conclusions so that they are explicitly limited to the loci tested here.

      (3) Please provide direct comparisons between zebrafish ror2 W722X phenotypes and human Robinow syndrome symptoms to support disease modeling claims.

      We addressed this by adding analysis of the late ror2 phenotype. In the revised manuscript, zygotic and maternal-zygotic mutants are reported to lack nasal and maxillary barbels, and one-year-old mutants show altered jaw morphology with a less protrusive lower jaw (Figure 4).

      (4) The substitution of two nucleotides (+3 G→C and +10 A→G) to target residue I378 of crbn is not justified. It is unclear why two substitutions were required to model thalidomide sensitivity or validate editing efficiency. Please explain why dual nucleotide substitutions were necessary in the crbn experiments and whether single substitutions would suffice.

      We now explain in the main text that the second substitution was introduced to reduce potential inhibitory intramolecular interactions within the pegRNA, while the primary substitution generated the intended amino-acid change. This clarification is now stated explicitly in the Results.

      (5) The reported 10.3% precise editing efficiency for PEn/pegRNA at ror2 conflicts with Supplementary Figure 2, where none of the 20 clones from PEn/pegRNA showed precise edits, while one clone from PEn/springRNA did. Please address the inconsistency between NGS and cloning results at ror2, possibly by increasing sample size or reanalyzing sequencing data.

      We addressed this directly by repeating and expanding the clone analysis. The revised Supplementary Figure 2 now includes the updated clone dataset, and the result is in much better agreement with the NGS-based frequency estimates.

      (6) Figure 3d highlights edits from PEn/springRNA but omits PEn/pegRNA results, despite the latter being described as superior. This creates ambiguity about the relative performance of pegRNA vs. springRNA. Please include PEn/pegRNA results in Figure 3d to fairly represent pegRNA performance.

      We agree. We therefore revised Figure 3e so that it now includes alignment data for PE2/pegRNA, PEn/pegRNA and PEn/springRNA, allowing more direct visual comparison of the editing outcomes.

      (7) The study does not specify the version of PEn used, or introduce some background of PE2 and springRNA. Comparisons to prior PE work in zebrafish, base editing, or HDR efficiencies are absent, obscuring the novelty of this approach. Please specify the PEn variant used, describe springRNA/PE2 structures, and compare results to prior zebrafish PE studies, BE, and HDR efficiencies for similar edits, contextualizing where PE2/PEn offers unique advantages.

      We thank the editors for this helpful suggestion. We have clarified the PEn and PE2 systems in the manuscript, specified the nuclease-based PEn used, and improved the background text introducing these editing strategies. We added the data to directly compare prime editing and HDR in the ror2 locus (Figure 3). We also expanded the Discussion to place the current findings in the context of prior zebrafish prime editing, HDR-based knock-in and base-editing work. We did not test all alternative systems experimentally in the current study, but we now discuss their relevance and clearly define the specific contribution of the present work.

      (8) The manuscript does not explore advanced PE variants (e.g., PE3, PEmax), codon optimization, or scaffold modifications to improve efficiency. Please discuss whether codon optimization, PE3/PEmax systems, or pegRNA modifications were tested or could improve outcomes.

      We agree that this should be discussed and we added recent work on zebrafish prime editing optimisation, codon optimisation, pegRNA engineering and related advances to the discussion, and explain that these are promising avenues for improving efficiency in future studies.

      (9) No data compares the off-target effects of PE2 and PEn, a critical consideration for evaluating specificity and safety. Please perform comparative off-target analyses for PE2 and PEn to assess specificity.

      In response, we performed comparative off-target analysis for the ror2 target and analysed three predicted off-target sites. These data are now included in Supplementary Figure 3 and show no significant increase in non-specific editing for the prime editing conditions tested.

    1. eLife Assessment

      This important study used five metrics to compare the cost-effectiveness of intramural and extramural research funded by the National Institutes of Health in the United States between 2009 and 2019. They found that each type of research had its own set of strengths: extramural research was more cost-effective in terms of publications, whereas intramural research was more cost-effective in terms of influencing clinical work. The evidence supporting these findings is solid.

    2. Reviewer #1 (Public review):

      Summary:

      This paper carefully compares intramural vs. extramural National Institutes of Health funded research during 2009-2019, according to a variety of bibliometric indices. They find that extramural awards more cost-effectively fund outputs commonly used for academic review such as number of publications and citations per dollar, while intramural awards are more cost-effective at generating work that influences future clinical work, more closely in line with agency health goals.

      Strengths:

      Great care was taken in selecting and cleaning the data, and in making sure that intramural vs. extramural projects were compared appropriately. The data has statistical validation. The trends are clear and convincing.

    3. Reviewer #2 (Public review):

      This article reports a cost-effectiveness comparison of intramural and extramural that NIH funded between 2009 and 2019. Using data obtained from NIH RePORTER, they linked total project costs to publication output, using robust validated metrics including Relative Citation Ratio (RCR), Approximate Potential to Translate (APT), and clinical citations. They find that after adjusting for confounders in regression and propensity-score analyses, extramural projects were generally more cost-effective, though intramural projects were more cost effective for generating clinical citations. They also describe differences in the topics of intramural- and extramural-funded publications, with intramural projects more likely to generate papers on viral infections and immunity or cancer metastases and survival, but less likely to generate papers on pregnancy and maternal health, brain connectivity and tasks, and adolescent experiences and depression. The authors aptly describe the different natures of the intramural and extramural funding models, including that extramural researchers spend much time writing grant applications and that the work described in extramural publications often receives funding from sources other than NIH grants.

      Strengths:

      The authors leveraged publicly available data (including RePORTER and the iCite repository) and used robust validated metrics (RCR, APT, clinical citations). They carefully considered a large number of confounders, including those related to the PI, and performed several well-described regression analyses.

    4. Reviewer #3 (Public review):

      This article demonstrates a comparative study on two funding mechanisms adopted by the National Institutes of Health (NIH). The authors adopted a quantitative approach and introduced five metrics to compare the output of intramural and extramural grants. These findings reveal the impacts of intramural and extramural grants on the scientific community, providing funders with insights into the future decisions of funding mechanisms they should take.

      Strengths:

      The authors clearly presented their methods for processing the NIH project data and classifying projects into either intramural or extramural categories. The limitations of the study are also well-addressed.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Strengths:

      Great care was taken in selecting and cleaning the data, and in making sure that intramural vs. extramural projects were compared appropriately. The data has statistical validation. The trends are clear and convincing.

      We thank the reviewer for highlighting the strengths of the manuscript.

      Weaknesses:

      The Discussion is too short and descriptive, and needs more perspective - why are the findings important and what do they mean? Without recommending policy, at least these should discuss possible implications for policy.

      The Discussion has been substantially expanded. We added several new paragraphs discussing: the 2024 Senate HELP Committee proposal for NIH reform; implications for portfolio management (positioning extramural for basic research, intramural for clinical translation); generalizability to other agencies (DoD, NSF FFRDCs, DoE national labs); and the extramural program's role in workforce training as a societal benefit distinct from research outputs.

      The biggest problem I have with this submission is Figure 3, which shows a big decrease in clinical-related parameters between 2014 and 2019 in both intramural and extramural research (panels C, D and E). There is no obvious explanation for this and I did not see any discussion of this trend, but it cries out for investigation. This might, for example, reflect global changes in funding policies which might also influence the observed closing gaps between intramural and extramural research.

      We added an explicit explanation in the Results: because the dataset is truncated at 2020, clinical citations naturally approach zero near the window's end, consistent with the ~7-year lag for clinical citations to accrue documented in prior work (Hutchins et al., 2019). The APT metric declines less steeply because it uses the forward citation network for predictions.

      Reviewer #2 (Public review):

      Strengths:

      The authors leveraged publicly available data (including RePORTER and the iCite repository) and used robust validated metrics (RCR, APT, clinical citations). They carefully considered a large number of confounders, including those related to the PI, and performed several well-described regression analyses.

      We thank the reviewer for highlighting these strengths of the manuscript

      Figure 3A shows intramural projects producing about 2.75 papers per year in 2009, whereas extramural projects are producing just over 1 paper per year. Extramural projects appear to catch up over the next five years. While the authors attempt to explain the difference in their figure legend, another explanation is that the intramural projects started well before 2009 but, as the authors state, intramural data only became available in 2009.

      We added a methodological note acknowledging that some intramural projects may have had start dates prior to 2009 that are not captured in the data, and that the ramp-up of new intramural projects is slower because they are more tied to new PI hiring. We also note the exclusion of projects matched in 2008 as possible continuations. However, the slow ramp-up of Intramural costs in Supplemental Figure 3 is consistent with hiring-associated lagged investment suggesting that our filtering of continuing projects was very successful. Nevertheless, because we cannot completely rule out some continuing projects made it through despite our efforts, we have made the caveats mentioned above in the “Comparison of research topics” section of the Results and the Data section of the Methods.

      As the authors note, funding information is often complex and difficult to characterize for an analysis like this. How did the authors handle: i) publications linked to multiple extramural grants; ii) publications linked to intramural and extramural grants; iii) publications linked NIH grants and non-NIH grants?

      I would think it necessary to somehow apportion credit, as otherwise it would appear that extramural projects are more productive than they truly are.

      We have now explicitly stated that papers with both intramural and extramural funding links were excluded, while papers with multiple links within the same funding type were retained. A new Supplemental Figure 6 was added showing the distribution of papers by number of funding sources for both extramural and intramural grants, demonstrating that the vast majority acknowledged only one project. These changes are in the Methods, Data section and Supplemental Figure 6

      Apportioning credit among a many-to-many graph like the ones used here is indeed a high value problem to solve, but one with many researcher-degrees-of-freedom about analytical design decisions that impact the results. We are working on a rigorous methodology for this, but the amount of time required to do this well is its own research project, and out of scope for manuscript revisions.

      Also, it is not clear if the authors took account of the indirect costs paid by the NIH to universities that have received extramural grants.

      We added explicit language clarifying that all cost comparisons use inflation-adjusted total costs (direct + indirect) for extramural grants. We also added a new sensitivity analysis (Supplemental Figure 4) inflating extramural indirect costs by 30% to approximate unrecovered university expenditures, with the finding that the fundamental pattern holds even under this adjustment. These are found in the “Comparison of funding” and “Comparison of cost effectiveness” sections of the Results, as well as Supplemental Figure 4.

      Reviewer #3 (Public review):

      Strengths:

      The authors clearly presented their methods for processing the NIH project data and classifying projects into either intramural or extramural categories. The limitations of the study are also well-addressed.

      We thank the reviewer for highlighting these strengths of the manuscript

      Weaknesses:

      The article would benefit from a more thorough discussion of the literature, a clearer presentation of the results (especially in the figure captions), and the inclusion of evidence to support some of the claims.

      The Introduction was updated with more specific framing of prior literature (e.g., explicit mention of risk management, funding disparities, and diminishing marginal returns as the focus of prior work). New references were added throughout, including Sampat (2012) on mission-oriented NIH research, Ioannidis et al. (2019) on grant competition inefficiencies, Drummond et al. (2005) on health economic evaluation methods, and the Cassidy (2024) Senate report, throughout the introduction and discussion.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      The article would benefit from a more detailed analysis/discussion about the recovery of indirect costs for extramural research.

      I note that the authors are from the University of Wisconsin, which is part of the IRIS network (https://iris.isr.umich.edu/iris-members-map/). They could work with IRIS (also called UMETRICS) to get a better sense as to the true costs of extramural research for each project (e.g., all labor costs, all equipment costs). The IRIS data are extraordinarily robust. Here's an example of an IRIS / UMETRICS paper: https://www.science.org/doi/10.1126/sciadv.abb7348.

      They could, for example, re-do the analyses assuming that the recorded indirect cost covers only 70% of the true indirect costs. Thus, if they get $700,000 indirect costs from RePORTER, they should assume that the true indirect costs were $1,000,000. Similarly, they can add the costs of the time the PI spent writing the grant proposal, using the Bergstrom paper as a guide.

      Another option would be to conduct sensitivity analyses taking into account ~30% incomplete indirect cost recovery (see https://docs.house.gov/meetings/AP/AP07/20171024/106525/HHRG-115-AP07-Wstate-DroegemeierK-20171024.pdf) and lost efficiency due to excess time writing grant proposals (see https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000065).

      We conducted a sensitivity analysis as requested inflating extramural indirect costs by 30%, citing the Droegemeier (2017) Congressional testimony as the basis for this estimate. The cost of grant-writing time is now acknowledged in the Discussion as an unreimbursed hidden cost of the extramural system, citing Ioannidis et al. (2019). This narrowed the gap between extramural research and intramural research, but did not close it completely. In addition, our updated regression (Supplemental Figure 4) showed similar trends as our main Figure 4, but with the Intramural advantage heightened and the Extramural advantage diminished. Both remained significant. We have also added to the discussion that there are additional costs and benefits that may not be fully captured in an analysis such as ours.

      The authors appear to have used an agency-perspective for their cost-effectiveness analyses. Generally, it is preferable to use a wider societal perspective. While that may be difficult, the article would benefit from some discussion from the perspective of the government and universities.

      We added a new paragraph explicitly acknowledging the agency-centered perspective and its limitations, noting that it does not capture the full economic cost borne by universities (startup costs, philanthropy, endowments, state contributions, graduate student training, faculty retention, infrastructure). The extramural program's contribution to the US workforce pipeline is specifically highlighted as a societal benefit not captured by the cost-effectiveness metrics.

      Reviewer #3 (Recommendations for the authors):

      Line 84-87: "The overrepresentation of viral research is likely because of the outsize investment toward the intramural Vaccine Research Center, and the cancer/genetics overrepresentation due in part because National Cancer Institute intramural investigators conduct research at that institute as well as at the NIH Clinical Center for their human genetics work." What evidence is there to support this claim?

      A citation to the NCI Center for Cancer Research website was added to support the claim about NCI intramural investigators working at the Clinical Center and Center for Cancer Research, where vaccine research is extensively discussed.

      Lines 107-109. "Given that NIH funding for intramural research has remained relatively constant as a percent of total funding over the years, this indicates larger single awards for intramural research while extramural investigators may increasingly require multiple concurrent grants to sustain their labs." Authors may consider adding a panel to Figure 2 showing the percentage of total funding of intramural vs. extramural funding.

      Rather than adding a panel to Figure 2, we added a new Supplemental Figure 3 showing the cost breakdown and intramural percentage of total funding by year.

      Discussion section: Are any of the findings of this study relevant to other funding agencies in the US (such as the National Science Foundation, the Department of Energy, and the Department of Defense)?

      A new paragraph to the Discussion was added discussing implications for the Department of Defense (including the Congressionally Directed Medical Research Programs), NSF FFRDCs, and the Department of Energy's national labs and FFRDCs, arguing that the incentive-alignment logic likely generalizes across agencies.

      Methods section: Please add an explanation of the technique used for propensity score matching.

      A detailed step-by-step description of the PSM procedure was added, covering propensity score estimation, within-year matching, matched cohort construction, outcome regression on matched data, and visualization of results.

      Figure 1: Please clarify if the relative ratio of intramural projects is calculated from the numbers of grants (as suggested in lines 95-96 and 98-100) or the numbers of publications (as suggested in lines 82-83 and 97-98).

      Also, this figure would be more intuitive if, for each topic, it showed the relevant intramural number (as it currently does) and also the relevant extramural number.

      The caption and Methods were updated to clarify that clustering and ratio calculation are based on projects/grants, not publications. A formula was added to the Methods to make the ratio calculation explicit. The figure itself was not modified to add extramural bars, though the ratio calculation already implicitly encodes both.

      Figure 2: Please change "(red)" to "(blue)" in the caption, and remove the A as there is only one panel in this figure

      Figure 4: Please change "(red)" to "(blue)" in the caption.

      These changes have been made.

      Lines 19-21: I suggest rewriting this sentence as follows:

      "We find that extramural awards are more cost-effective for producing outputs commonly used for academic evaluation, such as publications and citations per dollar, while intramural awards are more cost-effective for generating research that influences future clinical work, more closely in line with agency's health goals."

      The sentence was rewritten substantially in line with the reviewer's suggestion, now reading more clearly with "per dollar" removed as a parenthetical and the structure of the comparison clarified.

      Lines 31-34: Please rewrite this sentence along the following lines to provide more context on previous research into the grant funding system:

      Certain aspects of the grant funding system have been the focus of research, such as AAAA (Azoulay et al., 2009), BBBB (Goldstein and Kearney, 2020), CCC (Hoppe et al., 2019), DDDD (Lauer et al., 2017), EEEE (Wahls, 2018a) and FFFF (Wahls, 2018b), but the relative merits of intramural and extramural funding have received little attention to date.

      The sentence was rewritten to name specific contributions of each cited paper (e.g., risk management, funding disparities, diminishing marginal returns), replacing the generic list of citations.

      Lines 41-44: Please explain "merit score" and please add a reference to an article or website that explains the review process at the NIH.

      "Merit score" was revised to "percentile ranking of overall impact merit score" and a citation to the NIH CSR website ("What happens to your application during and after review?," 2025) was added.

      Lines 53-54: Please change Intramural to intramural (two instances, and also in line 284), and Extramural to extramural.

      "Intramural" and "Extramural" were corrected to lowercase throughout.

      Line 65-67: This sentence ("Potential advantages of the intramural approach are that researchers in the NIH's own laboratories allow the NIH to hire researchers whose research agendas more closely align with its mission.") reads awkwardly. Please clarify.

      The sentence was rewritten to read more clearly: "An advantage of the intramural approach are that NIH has the direct ability to hire scientists whose research closely aligns with agency goals, and researchers do not need to devote time and effort on preparing and submitting grant applications."

      Line 95-97: Authors should consider including an equation to help explain the following sentence: "The relative ratio of intramural projects for each topic was calculated by taking a ratio of the proportions of total grants a topic represented in the intramural vs. extramural portfolios. A relative ratio >1 signifies a higher share of intramural project publications on that topic relative to their share across all topics."

      A formula was added to the Methods defining the topic-level ratio calculation explicitly.

      Line 143: The phrase "may reflect the extra attention intramural investigators are afforded" reads awkwardly - please reword.

      Reworded to "may reflect the extra time intramural investigators save because they do not have teaching and grant writing responsibilities."

      Lines 303-304: This sentence ("First, as the renewal of project contracts may alter the topic and arrangement of the projects, we dropped 70,297 projects with renewal records in our data.") reads awkwardly. Please clarify.

      Reworded to "Since the scientific focus of a study may drift over time, we dropped 70,297 projects with renewal records in our data."

      Line 378-379: Please specify the model of ChatGPT used.

      Done.

    1. eLife Assessment

      Du et al. present a valuable study examining neural activation in medial prefrontal cortex (mPFC) subpopulations projecting to the basolateral amygdala (BLA) and nucleus accumbens (NAc) during behavioral tasks assessing anxiety, social preference, and social dominance. The strength of the evidence linking in vivo neural physiology to behavioral outcomes was considered solid. Overall, the reviewers felt that the revised work provides insight into how distinct mPFC→BLA and mPFC→NAc pathways influence anxiety, exploration, and social behaviors.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      It is well known that neurons in the medial prefrontal cortex (mPFC) are involved in higher cognitive functions such as executive planning, motivational processing and internal state mediated decision-making. These internal states often correlate with the emotional states of the brain. While several studies point to the role of mPFC in regulating behavior based on such emotional states, the diversity of information processing in its sub-populations remains a less explored territory. In this study, the authors try to address this gap by identifying and characterizing some of these sub-populations in mice using a combination of projection-specific imaging, function-based tagging of neurons, multiple behavioral assays and ex-vivo patch clamp recordings.

      Strengths:

      The authors targeted mPFC projections to the nucleus accumbens (NAc) and basolateral amygdala (BLA). Using the open field task (OFT), the authors identified four relevant behavioral states as well as neurons active while the animal was in the center region ("center-ON neurons"). By characterizing single unit activity and using dimensionality reduction, the authors show differentiated coding of behavioral events at both the projection and functional levels. They further substantiate this effect by showing higher sensitivity of mPFC-BLA center-ON neurons during time spent in the open arms of the elevated plus maze (EPM). The authors then pivoted to the three-chamber social interaction (SI) assay to show the different subsets of neurons encode preference of social stimulus over non-social. This reveals an interesting diversity in the function of these sub-populations on multiple levels. Lastly, the authors used the tube test as a manipulation of the anxiety state of mice and compared behavioral differences before/after in the OFT and social interaction tasks. This experiment revealed that "losers" of the tube test spend less time in the center of the open field while "winners" show a stronger preference for the familiar mouse over the object. Using patch-clamp experiments, the authors also found that "winners" exhibit stronger synaptic transmission in the mPFC-NAc projection while "losers" exhibit stronger synaptic transmission in the mPFC-BLA projection. Given the popularity of the tube test assay in rank determination, this provides useful insights into possible effects on anxiety levels and synaptic plasticity. Overall, the many experiments performed by the authors reveal interesting differences in mPFC neurons relative to their involvement in high or low anxiety behaviors, social preference and social rank.

      Weaknesses:

      The authors have addressed all comments.

    3. Reviewer #2 (Public review):

      Summary:

      The goal of this proposal was to understand how two separate projection neurons from the medial prefrontal cortex, those innervating the basolateral amygdala (BLA) and nucleus accumbens (NAc), contribute to the encoding of emotional behaviors. The authors record the activity of these different neuron classes across three different behavioral environments. They propose that, although both populations are involved in emotional behavior, the two populations have diverging activity patterns in certain contexts. A subset of projections to the NAc appear particularly important for social behavior. They then attempt to link these changes to the emotional state of the animal and changes in synaptic connectivity.

      Strengths:

      The behavioral data builds on previous studies of these projection neurons supporting distinct roles in behavior and extend upon previous work by looking at the heterogeneity within different projection neurons across contexts, this is important to understand the "neural code" within the PFC that contributes to such behaviours and how it is relayed to other brain structures.

      Weaknesses:

      The diversity of neurons mediating these projections and their targeting within the BLA and NAc is not explored. These are not homogeneous structures and so one possibility is that some of the diversity within their findings may relate to targeting of different sub-structures within BLA or NAc or the diversity of projection neuron subtypes that mediate these pathways. This is an important future direction for this work but does not detract from the main finding as reported.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      Weakness:

      The diversity of neurons mediating these projections and their targeting within the BLA and NAc is not explored. These are not homogeneous structures and so one possibility is that some of the diversity within their findings may relate to targeting of different sub-structures within BLA or NAc or the diversity of projection neuron subtypes that mediate these pathways. This is an important future direction for this work but does not detract from the main finding as reported. The electrophysiological data in Figure 7 have some experimental confounds that makes their interpretation challenging.

      We thank the reviewer for these thoughtful comments. We fully agree that targeting different substructures within the BLA or NAc, as well as the diversity of projection neuron subtypes mediating these pathways, represents an important direction for future investigation. We will certainly explore these possibilities in future studies.

      We have also removed the optogenetics and electrophysiology data, as they may introduce confounds. The removal of these data and figures does not affect our main conclusions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (a) The authors have improved the manuscript somewhat by refining their description of the results. However, the normalized EPSC experiments still do not make much sense. If you have a higher light intensity or LED duration the curve of the EPSC response will saturate earlier. Similarly, if you are in a highly, or poorly labeled slice or subregion of a slice then you will see responses emerge at different intensities based on the number of synapses labelled. There is no standardization in the way these experiments were performed, so performing some arbitrary post hoc normalisation does not correct for this. Similarly, they also place the fibreoptic manually above the slice each time. This makes it much harder to determine the actual light intensity delivered to the slice on a cell by cell and group by group basis.

      I have reduced my public statement from significant experimental confounds, to some experimental confounds. But the way the experiments were performed does not allow the normalized data to really be interpretable. They still argue that normalized EPSCs are relatively larger. I don't even really understand what this means biologically.

      The subsequent rise/decay and other measures is now better described. However, they note that the decay constant is larger. This means that the kinetics are slower, not enhanced, as they describe.

      Again, we thank the reviewer for the careful advice. We recognize the limitations of the optogenetics and electrophysiology data and have therefore removed them to avoid potential confounds.

    1. eLife Assessment

      This multimodal neuroimaging study leverages fMRI, PET, and deep learning to predict memory performance. The authors introduce the brain-cognition gap to link these different imaging modalities to cognition and evaluate their results in two independent cohorts. The results are solid and provide an important contribution to the literature and will be of interest to neuroscientists working at the interface of cognition, neuroimaging.

    2. Reviewer #1 (Public review):

      Summary:

      The authors attempted to identify if a new deep learning model could be applied to both resting and task state fMRI data to predict cognition and dopaminergic signaling. They found that resting state and moving watching conditions best predict episodic memory, but only movie watching predicts both episodic and working memory. A negative 'brain gap' (where the model trained on brain connectivity predicts worse performance than what is actually observed) was associated with less physical activity, poorer cardiovascular function, and lower D1R availability.

      Strengths:

      The paper should be of broad interest to the journal's readership, with implications for cognitive neuroscience, psychiatry, and psychology fields. The paper is very well-written and clear. The authors use two independent datasets to validate their findings, including two of the largest databases of dopamine receptor availability to link brain functional connectivity/activity with neurochemical signaling.

      Weaknesses:

      The deep learning findings represent a relatively small extension/enhancement of knowledge in a very crowded field.

      It's unclear from these results how much utility the brain gaps provide above and beyond observed performance. It would be helpful to take a median split the dataset on observed performance, and plot aside the current Fig 3 results to see how the cardiovascular and physical activity measures differ based on actual performance. Could the authors perform additional analyses describing how much additional variance is explained in these measures by including brain gaps?

      Some of the imaging findings require deeper analysis. For figure 1f - Which default mode regions have high salience? DMN is a huge network with subregions having differing functions.

      Along the same lines, were the striatal D1R findings regionally specific at all? It would be informative to test whether the three nuclei (Accumbens, Caudate, Putamen) and/or voxelwise models would show something above and beyond what is achieved from averaging D1R across the striatum. What about cortical D1R, which are highly abundant, strongly associated with cognitive (especially WM) performance, and have much unique variance beyond striatal D1R? https://www.science.org/doi/full/10.1126/sciadv.1501672. The PET findings are one of the unique strengths of this paper and are underexplored. It's also unclear if the measure of brain entropy should simply be averaged across all regions.

      It is not clear from the text that the authors met the preconditions for mediation analysis (that is, demonstrating significant correlations between D1R and entropy, in addition to the correlation with brain gap. Could they please report this as well?

      Was age controlled for in the mediation analysis? I would not consider this result valid unless that is the case.

      The discussion is long, but the authors would do better to replace some less helpful sections (e.g., the paragraph on methodological tweaks to parcellations and model alignment) with a couple of other important points, including:

      (1) Discuss the 'sweet-spot' of movie watching for behavior prediction in the context of studies showing that task states 'quench' neural variability: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007983. This may not be mutually exclusive of the discussion on dopamine and signal-to-noise ratio, but it would be helpful for the authors to discuss their potential overlap vs. unique contributions to the observed findings.

      (2) The argument that dopamine signaling increases signal-to-noise ratio is based on some preclinical data as well as correlational data using fMRI with pharmacological challenges. It is less clear how PET-derived estimates of D1R and D2R availability equate to 'dopamine signaling' as it is thought of in this context. Presumably, based on these data, higher D1R or D2R availability would be related to greater levels of tonic dopaminergic signaling. However, in the case of the COBRA dataset with D2R estimates, those are based on raclopride -- which competes with endogenous dopamine for the D2 receptor. Therefore, someone with higher levels of endogenous dopamine signaling should theoretically have lower raclopride binding and lower D2R estimates. I'm not arguing that the authors logic is flawed or that D1R and D2R are not good measures of dopamine signaling, but I'd ask the authors to dig into the literature and describe more direct potential links for how greater receptor availability might be associated with greater dopamine signaling (and hence lower entropy). Adding this to the discussion would be very valuable for PET research.

      Comments on revised version:

      I thank the authors for their extensive efforts to revise the manuscript. I have no further concerns.

    3. Reviewer #2 (Public review):

      The authors have made several corrections to the original manuscript. For example, they revised the bootstrapping analysis to avoid arbitrarily inflating the degrees of freedom. However, most substantive concerns remain inadequately addressed.

      (1) The primary issue is still the lack of baseline models against which to benchmark the predictive performance of the proposed DenseNet model. This concern was raised independently by two reviewers. Without such benchmarks, it is difficult to interpret the reported results in the context of prior work on MRI-based cognition prediction.

      Notably, the authors state: "While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework, we did not conduct a comprehensive benchmark across all available machine learning methods, nor was this the aim of the present study."

      However, I could NOT find any discussion or results related to the CPM model in the manuscript. It is therefore unclear whether the DenseNet model was actually statistically compared with CPM, and, if so, how the comparison was conducted.

      Note that the statement, "While Vieira et al. show that the majority (76%) of prior studies used linear modeling approaches, including CPM and penalized regressions, these models are often vulnerable to overfitting, especially when applied to high-dimensional fMRI data," is not entirely accurate. Linear models typically have far fewer parameters than deep-learning models and are therefore often less prone to overfitting. In fact, it is well established that deep-learning models are particularly susceptible to overfitting and usually require substantially larger sample sizes to achieve stable and reliable performance. Although deep-learning models may outperform shallower models once sufficient data are available and training is well controlled, this does not justify the authors' claim as stated. I therefore disagree with the argument put forward by the authors.

      The authors further justify the absence of benchmarking by stating: "In this context, deep learning was employed as a flexible framework capable of modelling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach." However, most shallow models can likewise be applied across different brain states and cognitive targets. This rationale does not establish deep learning as a uniquely appropriate or necessary choice. If deep learning is indeed a better approach in this context, the authors should demonstrate this empirically through appropriate benchmarking against established baseline models.

      (2) Additional analysis shows that "BCG is not significantly associated with cognition itself". This is the most perplexing result. This is like saying Brain Age Gap is not related to chronological Age. It is counterintuitive since the Brain Age Gap is calculated by chronological age minus actual age, and most research has shown a strong relationship between the Brain Age Gap and age.

      If the brain cognition gap is not related to cognition, is it possible that the results found are mainly due to the predictive model not fitting well with another dataset? Regardless, the lack of association between BCG and cognition deserves a discussion.

      (3) I still do not fully understand the rationale of the mediation analysis. The analysis and findings are still not related to aims 1 and 2, since DA and entropy are not part of the prediction models. But I appreciate the explanation that this part is related to the authors' previous work, and that the authors attempted to link to them somehow.

    4. Author response:

      The following is the authors’ response to the original reviews.

      In the revised version, our primary focus has been to more clearly demonstrate the unique contribution of the brain-cognitive gap (BCG) beyond what is captured by cognitive performance alone, and to show that the BCG is not trivially driven by the observed cognitive scores. Additional analyses now demonstrate that the BCG provides complementary and nuanced information regarding factors associated with cognitive resilience, above and beyond the cognitive measures themselves.

      In response to the comment regarding the inclusion of a baseline predictive model, we would like to clarify that the central aim of our study is to compare predictive utility across different cognitive states (resting state, movie watching, and n-back), rather than to establish a single universally optimal prediction model. Several previous studies have already systematically compared deep learning approaches with more traditional machine learning methods for functional connectome-based prediction. In contrast, the goal of the present study is to examine how brain state modulates the ability of AI-based functional connectome models to capture individual differences in working memory and episodic memory.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors attempted to identify whether a new deep-learning model could be applied to both resting and task state fMRI data to predict cognition and dopaminergic signaling. They found that resting state and moving watching conditions best predict episodic memory, but only movie watching predicts both episodic and working memory. A negative 'brain gap' (where the model trained on brain connectivity predicts worse performance than what is actually observed) was associated with less physical activity, poorer cardiovascular function, and lower D1R availability.

      Strengths:

      The paper should be of broad interest to the journal's readership, with implications for cognitive neuroscience, psychiatry, and psychology fields. The paper is very well-written and clear. The authors use two independent datasets to validate their findings, including two of the largest databases of dopamine receptor availability to link brain functional connectivity/activity with neurochemical signaling.

      Weaknesses:

      The deep learning findings represent a relatively small extension/enhancement of knowledge in a very crowded field.

      It's unclear from these results how much utility the brain gaps provide above and beyond observed performance. It would be helpful to take a median split of the dataset on observed performance and plot aside the current Figure 3 results to see how the cardiovascular and physical activity measures differ based on actual performance. Could the authors perform additional analyses describing how much additional variance is explained in these measures by including brain gaps?

      We thank the reviewer for raising this important point. In response to their request, we first examined the relationship between the BCG and the cognitive measure itself. We did not find any significant relationship in either the DyNAMiC sample (r =0.01, p =0.939) or the COBRA sample ((r =0.01, p=0.894) (see Author response image 1).

      Author response image 1.

      We then conducted additional analyses, splitting the sample into high and low EM performers, and compared their levels of physical activity and Framingham cardiovascular disease (CVD) risk scores. We found no significant difference in physical activity (DyNAMiC: p =0.56, 95% CI: –14.99 - 8.13; COBRA: p =0.29, 95% CI: –3.54 - 1.05) or Framingham CVD risk score (DyNAMiC: p =0.11, 95% CI: –1.08 - 10.72; COBRA: p =0.41, 95% CI: –1.86 - 4.58) between high and low EM perfprmers. Given the significant difference in physical activity and Framingham CVD risk score between positive and negative BCG groups, our results support that BCG provides unique information, beyond the observed cognitive measure (episodic memory score), regarding factors that contribute to cognitive resilience. These results have been added to Section 2.4, and Figure 3 has been updated.

      Some of the imaging findings require deeper analysis. For Figure 1f - Which default mode regions have high salience? DMN is a huge network with subregions having differing functions.

      Grad-CAM provides a coarse, gradient-based attribution that reflects how the learned feature maps contribute to the model output. It is not designed to produce specific input-level interpretations, such as symmetric edge-wise importance values. Therefore, the primary interpretation remains at the network level rather than at the level of individual FC edges.

      Along the same lines, were the striatal D1R findings regionally specific at all? It would be informative to test whether the three nuclei (Accumbens, Caudate, Putamen) and/or voxelwise models would show something above and beyond what is achieved from averaging D1R across the striatum. What about cortical D1R, which is highly abundant, strongly associated with cognitive (especially WM) performance, and has much unique variance beyond striatal D1R? https://www.science.org/doi/full/10.1126/sciadv.1501672. The PET findings are one of the unique strengths of this paper and are underexplored. It's also unclear if the measure of brain entropy should simply be averaged across all regions.

      In this study, we focused on D1DR/ D2DR averaged across the caudate and putamen, which has been reported in our previous work to be more strongly associated with cognitive functions (Johansson et al., 2023, Nyberg et al., 2016), compared to the nucleus Accumbens, which tends to show lower D1DR/D2DR levels and limited association with these cognitive domains. Following the Reviewer’s suggestion, we examined regional variations and found that while both caudate and putamen D1DR showed significant associations with BCG, there were no significant associations for D1DR in the nucleus accumbens or DLPFC with BCG. For D2DR, we observed a significant association between caudate/putamen D2DR and BCG.

      D1DR:

      Partial correlation between:

      Caudate_Bilateral vs. NegGap, (r =0.37, p =0.02

      Putamen_Bilateral vs. NegGap, r =0.34, p =0.03

      Accumbens_Bilateral vs. NegGap, r =0.07, p =0.69

      Mean (LRCaud, LRput, LRacc) vs NegGap, r =0.35, p =0.03

      DLPFC_Bilateral vs NegGap, r =0.21, p =0.21

      Striatum_Bilateral (Mean (LRCaud, LRput)) vs. NegGap, r =0.40, p =0.01

      Caudate_Bilateral vs. PosGap, r=–0.37, p=0.02

      Putamen_Bilateral vs. PosGap, r=–0.53, p=0.02

      Accumbens_Bilateral vs. PosGap, r=–0.25, p=0.31

      Mean (LRCaud, LRput, LRacc) vs PosGap, r=–0.41, p=0.08

      DLPFC_Bilateral vs. PosGap, r=–0.30, p=0.21

      Striatum_Bilateral (Mean (LRCaud, LRput)) vs. PosGap, r=–0.49, p=0.03

      Author response image 2.

      D2DR:

      Correlation between:

      Caudate_Bilateral vs. NegGap, r=0.36, p=0.0003

      Putamen_Bilateral vs. NegGap, r=0.22, p=0.03

      Accumbens_Bilateral vs. NegGap, r= –0.01, p=0.91

      Mean (LRCaud, LRput, LRacc) vs PosGap, r= –0.24, p=0.01

      Striatum_Bilateral vs. NegGap, r=0.39, p=0.0001

      Caudate_Bilateral vs. PosGap, r= –0.34, p=0.004

      Putamen_Bilateral vs. PosGap, r= –0.37, p=0.002

      Accumbens_Bilateral vs. PosGap, r= –0.21, p=0.09

      Mean (LRCaud, LRput, LRacc) vs PosGap, r= –0.38, p=0.001

      Striatum_Bilateral vs. PosGap, r= –0.49, p=0.0001

      We have added the following sentence to the Results section to highlight these regional differences in D1DR/D2DR in relation to BCG.

      “Both D1DR and D2DR availability in the striatum were associated with BCG, such that lower dopamine receptor availability was linked to a greater behavioral-cognitive gap. However, these associations varied by region. For D1DR, significant correlations with BCG were observed in the caudate (positive gap: r = –0.37, p =0.02; negative gap: r= 0.37, p =0.02) and putamen (positive gap: r = –0.53, p=0.02; negative gap:r=0.34, p=0.03), but not in the nucleus accumbens (positive gap: r= –0.25, p= 0.31; negative gap: r =0.07, p=0.69) or the DLPFC (positive gap: r = –0.30, p=0.21; negative gap: r =0.21, p=0.21). For D2DR, both caudate (positive gap: r = –0.34, p=0.004; negative gap: r =0.36, p=0.0003) and putamen (positive gap: r = –0.37, p=0.002; negative gap: r =0.22, p=0.03) showed significant associations with BCG.”

      Author response image 3.

      It is not clear from the text that the authors met the preconditions for mediation analysis (that is, demonstrating significant correlations between D1R and entropy, in addition to the correlation with brain gap. The authors should report this as well.

      This is a fair question. We recalculated entropy in the striatum, given that D1DR is more strongly expressed in this region and, therefore, reduced striatal D1DR may have a more pronounced impact on local entropy (as the reviewer suggested, it may not be appropriate to compute entropy across all brain regions). Our analyses showed that lower D1DR/D2DR levels were associated with higher entropy, which in turn was related to higher BCG.

      DyNAMiC; negative gap:

      Partial correlation between:

      Entropy and D1DR, r = –0.33, p=0.04.

      Entropy and NegGap, r = –0.36, p=0.03.

      DyNAMiC; positive gap:

      Partial correlation between:

      Entropy and D1DR, r = –0.56, p=0.01.

      Entropy and PosGap, r r =0.47, p=0.04.

      COBRA; negative gap:

      Correlation between:

      Entropy and D2DR, r = –0.22, p=0.03.

      Entropy and NegGap, r = –0.27, p=0.007.

      COBRA; positive gap:

      Correlation between:

      Entropy and D2DR, r = –0.26, p=0.03.

      Entropy and PosGap, r = 0.25, p=0.03.

      We have added these results under the result section 2.6. We have further updated Figure 4 in the revised manuscript, reporting these correlation results.

      Was age controlled for in the mediation analysis? I would not consider this result valid unless that is the case.

      We utilized the mediation package in R, and to control for a covariate age in the mediation analysis, we added age as a covariate in both the mediator model and the outcome model. The following information has been added in the method section in the revised version of the manuscript.

      “To assess the statistical significance of this mediation effect, we employed the bootstrapping method as outlined by Preacher and Hayes (145) and age has been controlled for in all statistical analysis.”

      The discussion section is long, but the authors would do better to replace some less helpful sections (e.g., the paragraph on methodological tweaks to parcellations and model alignment) with a couple of other important points, including:

      (1) Discuss the 'sweet-spot' of movie watching for behavior prediction in the context of studies showing that task states 'quench' neural variability: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007983. This may not be mutually exclusive of the discussion on dopamine and signal-to-noise ratio, but it would be helpful for the authors to discuss their potential overlap vs. unique contributions to the observed findings.

      Thank you for the comment. We have now eliminated the section about methodological tweaks and extended the discussion on the sweet-spot of the task for behavioral prediction by referencing the paper that the reviewer suggested. Here comes the paragraph discussing this topic:

      “Additionally, previous research showed that movie-watching alters the propagation of activity across cortical pathways (105), particularly within and between regions involved in audiovisual processing and attention. These alterations lead to a less segregated and more integrated network organization (106). Similarly, the n-back task has been associated with increased integration of task-positive cortico-cortical connectivity (104, 107) and striato-cortical connectivity (102). Our findings also suggest that certain task contexts strike an optimal balance between reducing neural variability and maintaining sufficient richness to capture individual differences. Prior work shows that task states quench neural variability, leading to a more reliable and predictable neural signal (108). In this context, movie watching may represent such a sweet spot constraining neural dynamics through shared audiovisual stimulation, while simultaneously engaging a broad range of cognitive processes that preserve individual differences.”

      (2) The argument that dopamine signaling increases signal-to-noise ratio is based on some preclinical data as well as correlational data using fMRI with pharmacological challenges. It is less clear how PET-derived estimates of D1R and D2R availability equate to 'dopamine signaling' as it is thought of in this context. Presumably, based on these data, higher D1R or D2R availability would be related to greater levels of tonic dopaminergic signaling. However, in the case of the COBRA dataset with D2R estimates, those are based on raclopride -- which competes with endogenous dopamine for the D2 receptor. Therefore, someone with higher levels of endogenous dopamine signaling should theoretically have lower raclopride binding and lower D2R estimates. I'm not arguing that the authors' logic is flawed or that D1R and D2R are not good measures of dopamine signaling, but I'd ask the authors to dig into the literature and describe more direct potential links for how greater receptor availability might be associated with greater dopamine signaling (and hence lower entropy). Adding this to the discussion would be very valuable for PET research.

      Thank you for raising this important point. We agree that D1R and D2R availability should not be taken as direct proxies of dopamine signaling. However, prior work has suggested meaningful associations between pre- and post-synaptic markers. For instance, a well-powered study demonstrated a significant correlation between D2R availability and dopamine synthesis capacity measured by FMT (Berry et al., 2018). This finding supports the idea that postsynaptic receptor markers may, under certain conditions, serve as an indirect proxy for dopaminergic signaling. Moreover, the number of dopamine-producing neurons innervating the striatum during development has been proposed to shape the structural maturation and arborization of dendrites (McAllister, 2000; Whitford et al., 2002), potentially providing a structural and functional basis for observed associations between pre- and post-synaptic measures.

      At the same time, smaller-scale studies have yielded mixed findings, reporting either non-significant associations (Heinz et al., 2005; Kienast et al., 2008) or negative correlations (Ito et al., 2011). Importantly, the latter studies employed [18F]FDOPA to index dopamine synthesis, which has been argued to provide a less reliable estimate of synthesis capacity compared to FMT, as used in Berry et al. (2018). These inconsistencies underscore that the relationship between pre- and post-synaptic markers is not straightforward and requires further examination in larger, well-powered samples. The following paragraph has been added to the discussion.

      “An important caveat is that D1DR and D2DR availability do not provide a direct measure of dopamine signaling. Instead, they reflect receptor availability, which interacts with endogenous dopamine in a complex manner. PET measures of D1R and D2R availability reflect the density of unoccupied dopamine receptors and the degree to which endogenous dopamine competes with radioligand binding. D2R binding potential is sensitive to competition from synaptic dopamine, such that higher ambient dopamine generally reduces tracer binding; D1R binding, however, is less affected by endogenous dopamine under physiological conditions, reflecting more directly receptor expression levels. Previous studies demonstrated a significant association between D2R availability and dopamine synthesis capacity measured by FMT (117, 118), suggesting that postsynaptic receptor markers may, under certain conditions, serve as a proxy for dopaminergic signaling. Developmental factors, such as the number of dopamine-producing neurons innervating the striatum, may further influence the structural and functional relationship between pre- and post-synaptic markers. By contrast, smaller studies have reported non-significant (119, 120) or negative (121) associations, although these studies relied on [18F]FDOPA, which is considered a less precise index of dopamine synthesis than FMT. Taken together, these reports indicate that the relationship between pre- and post-synaptic markers is complex and not necessarily linear. Accordingly, our observation that lower receptor availability is associated with greater neural variability should not be interpreted as direct evidence of weaker dopaminergic signaling, but rather as reflecting the interplay between receptor density and endogenous dopamine occupancy, particularly in the case of D2DR.”

      Reviewer #2 (Public review):

      Summary:

      The authors developed a deep learning model based on a DenseNet CNN architecture to predict two cognitive functions: working memory and episodic memory, from functional connectivity matrices. These matrices were recorded under three conditions: during rest, a working memory task, and a movie, and were treated as images for the CNN algorithm. They tested their model's performance across different conditions and a separate dataset with a different age distribution (using the same MRI scanner, scanning configurations, and cognitive tests). They also calculated the "brain cognition gap" based on the model trained on resting functional connectivity to predict working memory. Extending from the commonly used index "brain age," the brain cognition gap was defined as the difference between the working memory score predicted by their model (predicted working memory) and the working memory score based on the working memory test itself (observed working memory). This brain cognition gap was found to be associated with physical activity, education, and cardiovascular risk. The authors also conducted additional mediation tests to examine whether regional functional variability mediated the relationship between PET-derived measures of dopamine and the brain cognition gap.

      Strengths:

      The major strength of this manuscript is the extensive effort the authors have put into creating a new 'biomarker' that links deep learning with fMRI, PET, physical activity, education, and cardiovascular risk across two studies. This effort is impressive.

      Weaknesses:

      There are several weaknesses in the current methods and results, making many of the claims unconvincing. These weaknesses include:

      (1) The lack of baseline models to benchmark the predictive performance of their DenseNet models.

      (2) The inappropriate calculation of the brain cognition gap due to the lack of control for regression-toward-the-mean and the influence of the working memory itself (a common practice in brain age studies).

      (3) The lack of benchmarking of the brain cognition gap against the 'corrected' brain age gap and the direct prediction of physical activity, education, and cardiovascular risk.

      (4) Minimal justification for their PET mediation analysis.

      We appreciate the reviewer’s constructive comments on the strengths and weaknesses of our study. In this revised version, we’ve addressed the concerns regarding the calculation of the brain-cognitive gap, clarified the unique variance that the brain-cognitive gap contributes beyond cognition itself, and provided additional justification for the PET mediation analysis. For the lack of a baseline model, it is important to highlight that our aim has never been to compare the predictive power of different deep learning or machine learning approaches. Therefore, the text in the introduction and discussion has been amended to avoid miscommunication on this topic.

      Regarding the impact of the work on the field and the utility of the methods and data to the community, I see its potential. However, addressing all the weaknesses listed above is crucial and likely to change the conclusions of the results.

      It is important to note that many statements in the manuscript are overstated, making the contribution of the manuscript seem exaggerated.

      We have run additional analysis based on the reviewer’s suggestions. The effect sizes and statistical values were adjusted due to the corrections; the overall conclusions remain largely consistent. The relationships between the brain-cognition gap and key factors such as physical activity, and cardiovascular risk persisted. We have updated the manuscript accordingly and revised the relevant sections to reflect these refinements and the resulting interpretations.

      For instance, the abstract claims "there is a lack of objective biomarkers to accurately predict cognitive function," and the discussion states, "across various studies, the correlation between predicted and actual fluid intelligence typically hovers around 0.25 (98-100)." However, a meta-analysis by Vieira and colleagues (2022 https://doi.org/10.1016/j.intell.2022.101654) found over 37 studies up to 2020 predicting cognitive abilities from fMRI with machine learning, with 24 studies published in 2019-20 alone. Since 2020, with the rise of machine learning and AI, even more studies have likely been published on this topic, all claiming to show objective biomarkers to accurately predict cognitive function. Vieira and colleagues also found an average performance of these objective biomarkers in predicting general cognition at r = .42, similar to what was found in this manuscript. Based on this alone, it is unclear how novel or superior their method is without a proper systematic benchmark.

      We appreciate the opportunity to clarify our study’s contribution relative to prior work. We have revised the introduction and discussion to highlight the contribution of other methods when it comes to biomarkers. As for the comment related to the work by Vieira and colleagues, Vieira et al. (2022) indeed present a comprehensive meta-analysis of studies predicting general and fluid intelligence using neuroimaging and machine learning. However, there are two critical differences between ours verus previous work:

      Target Cognitive Domains:

      Our study does not focus on general or fluid intelligence, but rather on comprehensive EM (3 tests) and WM (3 tests), two distinct cognitive domains that are critically important for aging research. These distinct abilities, in this context (measured by three independent tests to boost the reliability) are less frequently studied as predictive targets in the existing fMRI-ML literature, particularly using deep learning methods.

      Critically, our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back), with the aim of identifying the states that best capture individual differences across domains. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.

      Our primary objective is to test how brain state influences the ability of functional connectivity to predict domain-specific cognitive performance, using a deep learning framework. As now stated explicitly in the revised manuscript, this objective is operationalized through three clearly defined aims:

      (1) To compare the predictive utility of functional connectomes derived from different brain states (resting state, movie watching, and n-back task) for EM and WM;

      (2) To introduce and evaluate a brain-cognition gap as a marker of individual differences beyond chronological age; and

      (3) To examine the contribution of dopaminergic integrity to variability in connectome uniqueness and brain-cognition gaps.

      We have revised the manuscript text to make this focus clearer and to avoid any misinterpretation of our aims. Specifically, we removed statements in the Discussion that could be read as suggesting that our deep learning approach outperforms prior machine learning methods. While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework for some of the prediction models, we did not conduct a comprehensive benchmark across all available machine learning methods nor was this the aim of the present study. Accordingly, we have adjusted the text to avoid implying methodological/biomarker superiority beyond the scope of our analyses.

      Modeling Approach:

      While Vieira et al. show that the majority (76%) of prior studies used linear modeling approaches, including CPM and penalized regressions, these models are often vulnerable to overfitting, especially when applied to high-dimensional fMRI data. Our use of a DenseNet-based CNN architecture is motivated by the need to leverage inductive biases suited to functional connectivity data, and we evaluate this approach across multiple cognitive tasks and independent datasets.

      Vieira and colleagues report that studies predicting general intelligence from fMRI (particularly from the HCP dataset) average around r =0.42, while those predicting fluid intelligence average around r =0.15. Our original claim about the correlation hovering around 0.25 is therefore not incorrect – and aligns with the Vieira meta-analysis. We have, however, nuanced this statement in the manuscript, now stating that correlations are higher for general intelligence than fluid intelligence.

      Altogether, we considered the reviewer’s comments and therefore conducted a careful revision of the manuscript text to moderate and clarify statements that may have come across as overstated. We have refined the language throughout the Introduction and Discussion sections to better align with the strength of the evidence and the scope of our contributions. A few examples are:

      “Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back), with the aim of identifying the states that best capture individual differences across domains. The relative performance of deep learning and other non-linear approaches depends on multiple factors, including sample size, model architecture, feature representation, and domain-specific characteristics of the prediction target. In this context, deep learning was employed as a flexible framework capable of modeling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.”

      Also in page 14.

      “Our study introduces a deep neural network architecture that features dense connections and incorporates an attentional mechanism. While our findings demonstrate that a deep learning framework can provide reasonable predictive accuracy, it is important to note that other machine learning approaches (e.g., tree-based models) may offer comparable predictive power, as suggested by prior benchmarking work (29, 30).”

      Similarly, the authors claim superior performance of deep learning and mischaracterize machine learning algorithms: "In particular, deep neural networks (DNN) methods have been successfully applied to behavioral and disease prediction (24-26), and have been found to outperform other machine learning approaches (27-29)," and "Deep learning approaches overcome the limitation of predictive techniques that solely rely on linear associations between connectivity and behavioral phenotypes (17)." However, the superiority of deep learning is debatable. Studies show comparable performance between machine learning (such as kernel regression) and deep learning (such as fully-connected neural networks, BrainNetCNN, Graph CNN (GCNN), and temporal CNN), e.g., He and colleagues (2019) and Vieira and colleagues (2024) https://doi.org/10.1016/j.neuroimage.2019.116276 and Vieira and colleagues' https://doi.org/10.1101/2024.03.07.583858.

      We agree that the performance gap between traditional machine learning models and deep learning (which is a subcategory of machine learning) in neuroimaging is debatable and task-dependent. Indeed, both He et al. (2019) and Vieira et al. (2024) offer evidence that kernel regression can achieve performance on par with deep learning models, applied to appropriate datasets.

      We have therefore nuanced the statements in the revised version of the manuscript as follows:

      Introduction:

      “In particular, deep neural networks (DNN) methods have been successfully applied to behavioral and disease prediction (24-26), and were initially expected to outperform other machine learning approaches (27-29). However, this superiority remains debatable, as recent studies have reported comparable performance between DNNs and traditional methods (He et al.,2019; Vieira et al.,2024). Accordingly, the present study does not aim to benchmark deep learning against traditional machine learning approaches, but instead uses a consistent predictive framework to examine how brain state influences the utility of FC for cognitive prediction.”

      “Deep learning approaches offer a flexible modeling framework capable of capturing complex non-linear associations in high-dimensional data with potentially less sensitivity to training on a smaller subsample (Vieira et al., 2024)”.

      Discussion:

      We agree that traditional methods, such as kernel-based models, tree ensembles, and non-linear SVRs, can also effectively capture such relationships. The relative performance of our model and other non-linear approaches depends on several factors, including data size, model architecture, and domain-specific considerations. We have included additional explanations in the discussion to address this.

      Moreover, many non-deep learning predictive techniques are non-linear, e.g., XGBoost, CatBoost, random forest, kernel ridge, and support vector regression with non-linear kernels (such as RBF and polynomial). Thus, stating that machine learning can only model linear relationships is incorrect. Moreover, for the small amount of data the authors had, some might argue that a linear algorithm might be more appropriate to balance the bias-variance trade-off in prediction. Again, without a proper systematic benchmark, it is unclear how well their DenseNet algorithm performs compared to other algorithms.

      Thank you for bring this up. We have now removed statements implying that machine learning can only model linear relationship.

      Regarding the Brain Age literature, the authors also misinterpreted recent findings: "However, a recent study suggests that brain age predictions contribute minimally compared to chronological age for explaining cognitive decline (65), implying that cognitive predictions are more reliable." In this study, Tetereva and colleagues (2024) (https://doi.org/10.7554/eLife.87297.4) showed that non-deep-learning machine learning can make good predictions from MRI on both chronological age (with r up to .88) and fluid cognition (with r up to .627). Using the combination of functional connectivity matrices across rest and tasks to predict fluid cognition, they found performance at r = .565, comparable to what was found in the current manuscript with deep learning. Nonetheless, while brain age predicted chronological age well (and brain cognition predicted fluid cognition well), it was problematic to predict fluid cognition from brain age. They showed that, because brain age, by design, shared so much common variance with chronological age, brain age and chronological age captured the same variance of fluid cognition. When chronological age was controlled for in the prediction of fluid cognition, brain age no longer had high predictive ability. In the case of the current manuscript, the brain cognition gap is not appropriately controlled for cognition (to be more precise, a working memory score). I expect the performance in predicting physical activity, education, and cardiovascular risk will drop dramatically once cognition is controlled for. There are at least two ways to control cognition according to Tetereva and colleagues' study (see more in the recommendations).

      We thank the reviewer for breaking down the findings in the study by Tetereva and colleagues (2024). It was not our intention to suggest that Tetereva et al. showed brain age has little predictive value in general. Our understanding of the findings reported in that study is on par with the reviewers’ clarifications. We have now revised the introductions to avoid any misunderstanding:

      “A recent study demonstrated that while brain age can predict chronological age with high accuracy from MRI, its utility for predicting cognition is limited. Specifically, Tetereva and colleagues (2024) showed that brain age strongly tracks chronological age and that brain cognition (using functional connectivity) can predict fluid cognition. Yet, when used to predict cognition, brain age largely overlapped with chronological age, such that controlling for chronological age eliminated the predictive contribution of brain age. This finding suggests that brain-age models may provide little unique explanatory power for cognitive decline beyond what is already captured by chronological age. Building on this observation and extending the concept of a brain-age gap to a brain-cognition gap (BCG, defined as the discrepancy between predicted and observed cognitive performance), we propose that a BCG may serve as an informative marker of individual differences.”

      In addition, in response to the first comment from Reviewer 1, we have extended our results in the manuscript. We first showed that BCG is not significantly associated with cognition itself (see Author response image 1). Moreover, we conducted additional analyses, splitting the sample into high and low EM performers, and compared their levels of physical activity and Framingham cardiovascular risk scores. We found that no significant difference in physical activity (DyNAMiC: p =0.56, 95% CI: -14.99 – 8.13; COBRA: p =0.29, 95% CI: -3.54 – 1.05) or Framingham CVD risk score (DyNAMiC: p =0.11, 95% CI: -1.08 – 10.72; COBRA: p =0.41, 95% CI: -1.86 – 4.58) between high and low EM performers. Given the significant difference in physical activity and Framingham CVD risk score between positive and negative BCG groups, our results support that BCP provides unique information, beyond cognitive measures, regarding factors that contribute to cognitive resilience. This text has been added into the result section, and Figure 3 has been updated in the manuscript.

      The authors mentioned, "The third aim of the current study is to uncover the contribution of dopamine (DA) integrity to brain-cognition gaps." However, I fail to see how mediation analysis would test this. The authors also mentioned, "Insufficient DA modulation can affect neurocognitive functions detrimentally (69, 74, 76-78)." They should test if DA levels are related to working memory scores in their study, and if so, whether the relationship is mediated by the "corrected" brain-cognition gaps. Note see more on the recommendation for the calculation of the "corrected" brain-cognition gaps.

      Our mediation was not designed to test whether DA predicts episodic memory performance directly, nor whether BCG mediates such a relationship. Instead, we specifically investigated whether the effect of DA on BCG operates through functional variability, the theoretical framework emphasizing the role of DA on neuronal grain and signal-to-noise ratio (see our recent work in Korkki et al., 2025). We agree that future work could extend our approach by directly examining whether BCG mediates the link between DA and cognitive outcomes. However, in the present study, our primary focus was on testing the mechanistic pathway of DA → entropy → BCG.

      In line with this aim, we found that lower DA receptor availability was associated with larger BCGs (Figure 4). We then asked whether this relationship is mediated by functional signal variability, such that lower DA is linked to reduced signal-to-noise ratio (i.e., greater entropy), which in turn contributes to less reliable prediction of cognition and, consequently, larger BCGs. Our mediation analysis supports this pathway (please see also our reply to Reviewer 1, Comment 6).

      Reviewer #3 (Public review):

      Summary:

      This paper by Esmaeili and co-authors presents a connectome prediction study to predict episodic memory and relate prediction errors to other phonotypic variables.

      Strengths:

      (1) A primary and external validation dataset.

      (2) Novel use of prediction errors (i.e., brain-cognitive gap).

      (3) A wide range of data was investigated.

      Weaknesses:

      (1) Lack of comparisons to other methods for prediction.

      (2) Several different points are being investigated that don't allow any particular one to shine through.

      (3) Some choices of analysis are not well-motivated.

      (4) How do the n-back connectomes perform for prediction if the authors do not regress task activations from the n-back task?

      We thank the reviewer for raising these important points. For the lack of comparisons to other methods, it is important to highlight that our aim has never been to compare the predictive power of different deep learning or machine learning approaches. Rather, our primary objective was to test how brain state influences the ability of functional connectivity to predict domain-specific cognitive performance, using a deep learning framework.Therefore, the text in the introduction and discussion has been amended to avoid miscommunication on this topic.

      We chose to regress out task-evoked activations based on prior work demonstrating that failing to do so can produce spurious but systematic inflation of task functional connectivity estimates (Cole et al., 2019). In that study, as well as subsequent reports (e.g., Gao et al., 2020; Gonzalez-Castillo & Bandettini, 2018), connectomes derived without activation regression tended to capture task-evoked coactivations rather than background task functional interactions, which can artificially boost predictive performance but limit interpretability (whether it is co-activation or intrinsic connectivity during an entire goal-oriented task) and generalizability. For this reason, our analyses focused on the more conservative approach of regressing out task activations. Accordingly, we compared predictive performance only under this preprocessing strategy.

      We have added the following sentence to clarify this in the method: “To avoid spurious inflation of task functional connectivity by task-evoked activations, we regressed out task activation patterns from the n-back data prior to estimating functional connectivity, following recommendations by Cole et al. (2019) and related work.”

      (5) I am a little concerned about overfitting with the convolutional neural net. For example, the drop-off in prediction performance in the external sample is stark. How does the deep learning approach used here compare to something simpler, like a connectome-based predictive model or ridge regression?

      (6) It may be nice to try the other models in the validation dataset. This would also provide a sense of the overfitting that may be going on with overfitting.

      We thank the reviewer for raising this point. The prediction performance indeed dropped for episodic memory when models trained on the DyNAMiC sample were applied to the COBRA sample, whereas performance for working memory remained nearly identical across datasets. Moreover, our prediction power is on par with previous studies reporting reliable prediction of intelligence using deep learning approach (Vieira et al., 2021; Fan et al.,2020). While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework, we did not conduct a comprehensive benchmark across all available machine learning methods nor was this the aim of the present study.

      We have revised the manuscript text to make this focus clearer and to avoid any misinterpretation of our aims. Specifically, we removed statements in the Discussion that could be read as suggesting that our deep learning approach outperforms prior machine learning methods. Finally, We have added the following paragraph to the discussion:

      “Our study used a deep neural network architecture that features dense connections and incorporates an attentional mechanism. While our findings demonstrate that a deep learning framework can provide reasonable predictive accuracy, it is important to note that other machine learning approaches (e.g., tree-based models) may offer comparable predictive power, as suggested by prior benchmarking work (29, 30). Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back) to identify the states that best capture individual differences across domains. The relative performance of deep learning and other non-linear approaches depends on multiple factors, including sample size, model architecture, feature representation, and domain-specific characteristics of the prediction target. In this context, deep learning was employed as a flexible framework capable of modeling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.”

      (7) While predictive models increase the power over association studies, they still require large samples to prevent overfitting. Do the authors have a sense of the power their main and external validation sample sizes provide?

      We thank the reviewer for this important point. Our main sample size, together with the external validation in COBRA, is moderate for deep learning applications. To reduce the risk of overfitting, we employed several strategies, including external validation, early stopping, dropout, and regularization. As noted, performance for episodic memory decreased in the external sample, which we acknowledge, but key associations such as the link between BCG and resilient factors remained significant. Importantly, prediction of working memory was maintained across datasets, reducing the likelihood that the observed findings are driven by overfitting. We have added a statement in the Discussion to reflect on the limitations of sample size and the implications for generalizability.

      We added the following sentence to the discussion:

      “We acknowledge that our main and validation samples are moderate in size for deep learning, which constrains statistical power and generalizability. Although external validation, early stopping, dropout, and regularization help mitigate overfitting, larger samples will be needed in future work to fully establish the robustness of these predictive models.”

      (8) I am not sure that the Mann-Whitney is the correct test for comparing the distributions of prediction performances. The distributions are dependent on each other as they are each predicting the same outcomes. Using the typical degrees of freedom formula would overestimate the degrees of freedom.

      We appreciate the reviewer’s comment and agree that applying statistical tests directly to bootstrapped samples can lead to inflated or misleading p-values, as the degrees of freedom are determined by the number of bootstrap iterations rather than the actual number of independent observations.

      In our analysis, the Mann-Whitney U test was applied to 1000 bootstrapped correlation coefficients (r) for each model. While this number is relatively low and was chosen to limit overestimation of significance, we recognize that these bootstrapped samples are not independent, and thus the use of a Mann-Whitney U test can still be problematic. To address this concern, we have revised our statistical analysis. Rather than applying the Mann-Whitney U test to the bootstrapped r distributions, we now compute the difference in correlation coefficients (Δ r = r<sub>actual</sub> − r<sub>rest</sub>) for each bootstrap iteration. We then calculate a 95% confidence interval for Δr. If this interval does not include zero, we consider the difference statistically significant. This approach avoids artificially inflating the sample size and adheres more closely to proper statistical inference.

      We have updated the Methods (the following text) and Results sections accordingly and clearly stated the limitations regarding the degrees of freedom for all tests.

      “For the bootstrap-based comparison of model performance (bootstrap resampling with 1000 iterations), no test statistic with an associated degree of freedom is reported. Instead, statistical inference is based on the bootstrap distribution of the difference in correlation coefficients (Δr) and its 95% confidence interval. As bootstrap confidence-interval–based inference does not rely on an analytic sampling distribution, degrees of freedom are not defined for this procedure.” This has now been explicitly stated in the Methods section to avoid ambiguity.

      In the result section, we have reported with corresponding CI.

      (9) The brain cognition gap is interesting. It is very similar conceptually to the brain age gap. When associating the brain age gap with other phenotypes, typically age is regressed from the brain age gap and the other phenotype. In other words, age is typically associated with a brain age gap as individuals at the tail ages often show the largest gaps. Is the brain cognition gap correlated with episodic memory and do the group differences hold if episodic memory is controlled for?

      We thank the reviewer’s comment regarding the relationship between the brain cognition gap and episodic memory.

      Since this question was raised by all reviewers, we have conducted additional analyses. We did find that BCG is independent from the cognitive measure and provided additional information, beyond cognition alone, about factors contributing to resilience. Please visit our response to the first comment of Reviewer 1.

      (10) I have the same question for the dopamine results. Particularly, in the correlations that are divided by brain cognition gap sign. I could see these types of patterns arise due to a correlation with a third variable.

      For dopamine results, we explored whether age or cognition alone might confound the dopamine–brain cognition gap relationships. However, neither was significantly correlated with the brain cognition gap groups. The associations remained significant after controlling for age, suggesting that the observed patterns are not likely due to these potential third-variable confounder. This is also inline with our observation of significant associations between DA and GAP in an age-homogeneous COBRA sample. That said, we found that entropy, indeed, mediates the direct link between DA and BAG, suggesting that individuals with lower DA exhibit greater regional variability, and in turn larger BCG.

      These results have now been embedded into the manuscript. We also highlighted that age has been controlled for in reported correlation and mediation analyses.

      Recommendations for the authors:

      Reviewing Editor Comment:

      We particularly recommend that the authors: (a) compare the performance of their deep learning model with other baseline models, and (b) adjust for cognitive performance within the brain-cognition gap. These steps would strengthen the evidence base.

      We thank the editor for their comments. As for the first comments, our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back), with the aim of identifying the states that best capture individual differences across domains. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach. We have revised the manuscript text to make this focus clearer and to avoid any misinterpretation of our aims. Specifically, we removed statements in the Discussion that could be read as suggesting that our deep learning approach outperforms prior machine learning methods. While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework, we did not conduct a comprehensive benchmark across all available machine learning methods, nor was this the aim of the present study. Accordingly, we have adjusted the text to avoid implying methodological superiority beyond the scope of our analyses. Finally, we have added the following paragraph to the discussion:

      “Our study used a deep neural network architecture that features dense connections and incorporates an attentional mechanism. While our findings demonstrate that a deep learning framework can provide reasonable predictive accuracy, it is important to note that other machine learning approaches (e.g., tree-based models) may offer comparable predictive power, as suggested by prior benchmarking work (29, 30).

      Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back) to identify the states that best capture individual differences across domains. The relative performance of deep learning and other non-linear approaches depends on multiple factors, including sample size, model architecture, feature representation, and domain-specific characteristics of the prediction target. In this context, deep learning was employed as a flexible framework capable of modeling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.”

      As for the second comment, we followed the instructions by Reviewer 1. In response to their request, we first examined the relationship between the Brain-Cognitive Gap (BCG) and the cognitive measure itself. Surprisingly, we did not find any significant relationship in either the DyNAMiC sample (r =0.01, p =0.939) or the COBRA sample (r =0.01, p =0.89) (see Author response image 1).

      We then conducted additional analyses, splitting the sample into high and low EM performers, and compared their levels of physical activity and Framingham cardiovascular disease (CVD) risk scores. We found no significant difference in physical activity (DyNAMiC: p =0.56, 95% CI: –14.99 - 8.13; COBRA: p =0.29, 95% CI: –3.54 - 1.05) or Framingham CVD risk score (DyNAMiC: p =0.11, 95% CI: –1.08 - 10.72; COBRA: p =0.41, 95% CI: –1.86 - 4.58) between high and low EM perfprmers. Given the significant difference in physical activity and Framingham CVD risk score between positive and negative BCG groups, our results support that BCG provides unique information, beyond the observed cognitive measure (episodic memory score), regarding factors that contribute to cognitive resilience. These results have been added to Section 2.4, and Figure 3 has been updated.

      Reviewer #1 (Recommendations for the authors):

      (1) The top and bottom triangles of the saliency maps, particularly in Figure 2, do not look symmetrical (this is most notable in the hotspot representing the between-network correlation of DMN and FPN). What is going on here? Was the image compressed or altered in some way, or is this a visual artifact of the interpolation method?

      We appreciate the reviewer’s insightful comment. Minor differences in the saliency maps between the upper and lower triangles of the FC matrix can arise due to several factors. For instance, Grad-CAM generates saliency maps at the resolution of the convolutional feature maps, which are then upsampled to match the input matrix dimensions. We initially used the default bilinear interpolation, which may have introduced slight asymmetries or blurring, resulting in interpolation artifacts. In response, we have reprocessed the saliency maps using spline interpolation in MATLAB. The updated saliency figures have been included in the revised version of the manuscript.

      (2) Pages 11-12. Please make it explicit in the text that the brain gap-education association was not significant in the COBRA dataset.

      Thanks for pointing this out. We added the following sentence to the discussion.

      “Note that the association with education was significant only in the DyNAMiC sample and did not reach significance in the COBRA dataset.“

      (3) Please overlay individual data points onto the boxplots in Figure 3 so that we can appropriately evaluate the data distributions.

      Figure 3 has now been updated.

      (4) Section 2.6: Was entropy calculated on movie-watching data, resting data, or all fMRI data? Please specify.

      We thank the reviewer for pointing this out. We have updated the text (Section 2.6) to clarify that entropy was calculated from the resting-state data. We intended to examine the mediating role of regional variability in the relationship between dopamine and the BCG of the winning model for episodic memory. Because resting state and movie-watching were the winning conditions for EM prediction, but movie-watching was not available in COBRA, we focused on entropy during rest, which exists in both datasets.

      (5) Was entropy during the resting state correlated with entropy during the task state, across individuals?

      We agree this is an interesting question. However, investigating the correlation of entropy between rest and task states goes beyond the scope of the present study. Our aim here was to test whether regional variability mediates the effect of dopamine on the BCG. Specifically, we examined whether individuals with lower striatal D1DR show higher local variability, which in turn relates to less accurate prediction and a larger gap. We assessed both the relationship between D1DR and entropy and the association between entropy and the gap, and these results have now been added to the manuscript (see also our response to Reviewer 1’s public comment).

      Reviewer #2 (Recommendation for authors):

      (1) The lack of baseline models to benchmark the predictive performance of their DenseNet models makes their results hard to interpret. This problem is quite common across ML literature. For instance, many DL-based algorithms were developed for tabular data without proper benchmarking against other ML algorithms. When they were properly tested, most weren't better than many tree-based ML algorithms (e.g., https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf). I can see that a similar problem might happen here.

      For this particular manuscript, the authors made strong statements without doing a proper benchmark, e.g., from the discussion, "Indeed, the predictive power in the current study is stronger than for CPM-based predictions reported before." And "Unlike the BrainNet convolutional neural network, which focuses on staged transformations, our densely connected model promotes extensive feature reuse, possibly leading to more robust feature extraction." I hope to see the performance of the proposed algorithm against 1) other DL algorithms (e.g., fully-connected neural networks, BrainNetCNN, Graph CNN (GCNN), temporal CNN, GRU, and LSTM, see https://doi.org/10.1016/j.neuroimage.2019.116276 and https://doi.org/10.1002/hbm.26415), 2) ML algorithms (e.g., SVR with linear, RBF and polynomial kernels, Elastic Net, XGBoost, random forest, CPM), 3) data reduction algorithms (e.g., PCA regression, Partial Least Square). The results of this benchmark will substantiate the claims made by the authors.

      Our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach. We have revised the manuscript text to make this focus clearer and to avoid any misinterpretation of our aims. Specifically, we removed statements in the Discussion that could be read as suggesting that our deep learning approach outperforms prior machine learning methods. While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework, we did not conduct a comprehensive benchmark across all available machine learning methods, nor was this the aim of the present study. Accordingly, we have adjusted the text to avoid implying methodological superiority beyond the scope of our analyses. Finally, we have added the following paragraph to the discussion:

      “Our study used a deep neural network architecture that features dense connections and incorporates an attentional mechanism. While our findings demonstrate that a deep learning framework can provide reasonable predictive accuracy, it is important to note that other machine learning approaches (e.g., tree-based models) may offer comparable predictive power, as suggested by prior benchmarking work (29, 30). Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back) to identify the states that best capture individual differences across domains. The relative performance of deep learning and other non-linear approaches depends on multiple factors, including sample size, model architecture, feature representation, and domain-specific characteristics of the prediction target. In this context, deep learning was employed as a flexible framework capable of modeling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.”

      (2) From Figure 6b, it looks like the functional connectivity matrices were converted to different images, and each of the four images (in grey, blue, yellow, and red) was treated as a separate channel. What are these grey, blue, yellow, and red images?

      In our study, the inputs to the deep learning models were subject-specific FC matrices of size 273×273. To augment the data, we created different versions of each FC matrix by reordering specific brain networks within the matrix. To visualize that the inputs were augmented, we used different color codings (grey, blue, yellow, and red) in Figure 6b. These colors were intended solely to represent different augmented versions of the same subject’s FC matrix. They were not treated as separate channels in the model. To avoid any confusion or misinterpretation, we have revised this part of the figure and now use only grey coloring to represent the augmented FC matrices.

      (3) The differences in performance between within vs. outside studies might simply be due to the fact that the models trained from DyNAMiC captured the brain variation due to age, which is also related to cognitive abilities. I was wondering if age is controlled for, would performance be more similar across the studies? The authors should provide the performance of models that are controlled for age.

      We initially conducted partial correlation between FC features and cognitive measures while controlling for age. This is further supported by the fact that the model trained on the age-heterogeneous DyNAMiC sample provided a fairly reasonable prediction in the age-homogeneous COBRA dataset, particularly for working memory (see figure 2d). Moreover, in our post hoc analyses, we additionally controlled for age when examining associations, for example, between GAP and dopamine measures.

      (4) Related to point (3), from the discussion, "Validation outcomes thus affirm that the models, particularly those constructed from rest data, are robust to the particulars of the dataset." The performance dropped around half, so I am not sure if this conclusion is warranted.

      We thank the reviewer for raising this point. The prediction performance indeed dropped for episodic memory when models trained on the DyNAMiC sample were applied to the COBRA sample, whereas performance for working memory remained nearly identical across datasets. Although both EM and WM are sensitive to age, the divergence in cross-dataset performance suggests that factors beyond age alone may contribute to these differences. To address this, we have revised the discussion as follows:

      “Differences between the DyNAMiC and COBRA datasets make cross-dataset prediction a harder problem, as the age ranges of samples significantly vary, and prior studies highlight the importance of individual characteristics like age in predicting behavior from FC (33). In line with this, model performance decreased when predicting EM in the COBRA sample whereas prediction of WM remained largely unchanged. Thus, validation outcomes suggest that the models, particularly those predicting WM, show robustness across datasets, whereas the reduced EM performance highlights potential data-specific influences that limit generalizability.”

      (5) Please report the degree of freedom in all of the statistical analyses. Was the Mann-Whitney U test done on the bootstrapped r? If so, the degree of freedom was arbitrarily set by the number of bootstrapping, and hence the p-value can be higher or lower depending on the number of bootstrapping. This could lead to misleading conclusions.

      We appreciate the reviewer’s comment and agree that applying statistical tests directly to bootstrapped samples can lead to inflated or misleading p-values, as the degrees of freedom are determined by the number of bootstrap iterations rather than the actual number of independent observations.

      In our analysis, the Mann-Whitney U test was applied to 1000 bootstrapped correlation coefficients (r) for each model. While this number is relatively low and was chosen to limit overestimation of significance, we recognize that these bootstrapped samples are not independent, and thus the use of a Mann-Whitney U test can still be problematic. To address this concern, we have revised our statistical analysis. Rather than applying the Mann-Whitney U test to the bootstrapped r distributions, we now compute the difference in correlation coefficients (Δr = r<sub>actual</sub> − r<sub>rest</sub>) for each bootstrap iteration. We then calculate a 95% confidence interval for Δr. If this interval does not include zero, we consider the difference statistically significant. This approach avoids artificially inflating the sample size and adheres more closely to proper statistical inference.

      We have updated the Methods (the following text) and Results sections accordingly and clearly stated the limitations regarding the degrees of freedom for all tests.

      “For the bootstrap-based comparison of model performance (bootstrap resampling with 1000 iterations), no test statistic with an associated degree of freedom is reported. Instead, statistical inference is based on the bootstrap distribution of the difference in correlation coefficients (Δr) and its 95% confidence interval. As bootstrap confidence-interval–based inference does not rely on an analytic sampling distribution, degrees of freedom are not defined for this procedure.” This has now been explicitly stated in the Methods section to avoid ambiguity.

      In the result section, we have reported with corresponding CI.

      (6) For predictive performance, the correlation was reported in the table, while R<sup>2</sup> is reported in the text. This is confusing. Also, could you clarify if the R<sup>2</sup> is calculated using the sum square definition, not Pearson r squared? If Pearson r squared was used, then R<sup>2</sup> of a negative Pearson r would be positive, which is misleading (see 10.1001/jamapsychiatry.2019.3671). Also, other performance indices apart from Pearson r and R² should be reported (e.g., MSE and MAE, again see 10.1001/jamapsychiatry.2019.3671). This will allow a better understanding of the models' performance.

      We thank the reviewer for this helpful comment. We acknowledge the inconsistency in reporting predictive performance metrics and have revised the manuscript for clarity. In the text, we have reported the r value, whereas in the table, we have reported r<sup>2</sup> using the sum-of-squared definition. Specifically, we now consistently report Pearson correlation (r), mean squared error (MSE), and mean absolute error (MAE) across both the text and Tables 1 and 2.

      Regarding r<sup>2</sup>, we confirm that it was calculated using the sum-of-squares definition (i.e.,

      rather than as the square of the Pearson correlation coefficient. This ensures that negative correlations do not result in misleading positive R<sup>2</sup> values, as pointed out by the reviewer and discussed in Poldrack et al. (2020). All performance metrics (r, r<sup>2</sup>, MSE, and MAE) are now reported in Tables 1 and 2 to allow a more comprehensive and interpretable comparison of model performance.

      We have included a description of the method under section 4.9. Statistical significance analysis.

      (7) Could you clarify how data are standardized across training, validation, and tests (including Z-standardization for the cognitive tests)? This is to prevent data leakage.

      Thanks for the comments. We did standardization the cognitive test from both training and test, separately.

      We have added the following paragraph to the method section:

      “A composite score of performances across the three tests was calculated and used as the measure of the cognitive domain in question (i.e., episodic memory, working memory). For each of the three tests, scores were summarized across the total number of trials. The three resulting sum scores were z-standardized and averaged to form one composite score for each domain. The standardization has been carried out independently for the training (DyNAMiC) and test (COBRA) samples.”

      (8) There is really no ground truth to confirm that Grad-CAM provides actual feature importance used by the models. Perhaps the authors should compare that with Haufe transformation, which is commonly used in the predictive model for cognition (e.g., https://doi.org/10.1016/j.neuroimage.2021.118648 and https://doi.org/10.1016/j.neuroimage.2023.120115).

      We appreciate the reviewer’s comment and the suggested references. The Haufe transformation is primarily applied in traditional machine learning models, particularly in cognitive neuroscience, to interpret linear predictive models by mapping classifier weights back to the input space. However, its direct applicability to deep learning models, especially convolutional neural networks, remains an open research area with no widely established methodologies. Furthermore, the Haufe transformation does not provide feature importance in the same manner as Grad-CAM. Grad-CAM highlights spatial regions within an image that contribute to a model’s decision, making it particularly useful for interpreting convolutional networks in vision tasks. In contrast, the Haufe method offers a weight transformation that is more suited for understanding linear models and may not be as intuitive for feature attribution in complex hierarchical representations such as those learned by deep neural networks.

      While we acknowledge that Grad-CAM, like other interpretability methods, does not provide absolute ground truth validation for feature importance, it remains one of the most widely used and validated techniques for deep learning interpretability, particularly in medical imaging applications. Given its integration with frameworks such as Keras and TensorFlow and its ability to provide spatial attributions aligned with domain knowledge, we believe it is a suitable choice for our study. Future work may explore additional interpretability techniques, including adaptations of the Haufe transformation if applicable to deep learning architectures.

      We have added more details on Grad-CAM implementations in the Method.

      (9) Related to Grad-CAM, "These edges, indicated by a salience intensity of {greater than or equal to}.5, exert a significant influence on the model (Figure 1f)." What does 'significant' in this context mean? And how did the authors come up with the .5 threshold? Is it based on permutation or bootstrapping tests?

      We appreciate the reviewer’s comment and the opportunity to clarify our approach. In this context, the term "significant" refers to the regions' relative contribution to the model’s decision, as shown by the Grad-CAM saliency map. However, to avoid implying statistical testing, we will revise the term to "highly contributing."

      Regarding the 0.5 threshold, this value was selected empirically based on the normalized Grad-CAM activation values, where saliency scores range between 0 and 1. A threshold of 0.5 was used as a heuristic to highlight regions with relatively strong activation. However, this was not determined through statistical methods such as permutation or bootstrapping tests. We recognize the importance of rigorous threshold selection and will clarify this in the text. Future work could incorporate statistical methods to define thresholds more objectively.

      We have included the following text in the Method section:

      ”Grad-CAM saliency maps were interpreted qualitatively, with a heuristic threshold (≥ 0.5) applied to highlight regions with relatively higher contribution to the model’s predictions. These values do not reflect statistical significance and should therefore be interpreted descriptively.”

      (10) Still related to the saliency map, I believe the upper and lower triangles of the functional connectivity matrix are the same. If so, why are there some differences in saliency? While the difference is not prominent, this might affect the accuracy of Grad-CAM.

      Minor differences in the saliency maps between the upper and lower triangles of the FC matrix can arise due to several factors. For instance, Grad-CAM generates saliency maps at the resolution of the convolutional feature maps, which are then upsampled to match the input matrix dimensions. We initially used the default bilinear interpolation, which may have introduced slight asymmetries or blurring, resulting in interpolation artifacts. In response, we have reprocessed the saliency maps using spline interpolation in MATLAB. The updated saliency figures have been included in the revised version of the manuscript.

      (11) Why did the authors only report the cross-study for EM on rest, and for WM on n-back? This is a bit unexpected since COBRA has both rest and n-back. If there is no good justification, please report both.

      We focused on reporting cross-study results for EM using rest because rest was the winning condition for predicting EM in the DyNAMiC sample. Importantly, n-back did not significantly predict EM in DyNAMiC, and rest did not significantly predict WM. For this reason, we highlighted only the conditions that showed meaningful predictive power in the original analyses.

      (12) Are codes, trained models, and data available? To ensure transparency and reproducibility, I hope to see the code from preprocessing to modeling and statistical analyses.

      The analysis code is openly available on our GitHub page https://github.com/MorEsm/AI-based-Prediction-of-Cognitive-Function. Due to ethical considerations and GDPR restrictions in the European Union, we are not permitted to publicly share the raw data. However, we can provide detailed information about preprocessing steps and analysis pipelines to facilitate reproducibility.

      (13 &14) The authors did not appropriately control for regression-toward-the-mean and the influence of the working memory itself when calculating the brain cognition gap. This is commonly done to brain age (see https://doi.org/10.7554/eLife.87297.4https://doi.org/10.1002/hbm.25533https://doi.org/10.1016/j.nicl.2020.102229https://doi.org/10.3389/fnagi.2018.00317). Otherwise, the brain cognition gap still depends on the cognition/working memory score itself. Based on Tetereva et al., "If, for instance, Brain Age was based on prediction models with poor performance and made a prediction that everyone was 50 years old, individual differences in Brain Age Gap would then depend solely on chronological age (i.e., 50 minus chronological age)." Because of this, Tetereva and colleagues found that the 'uncorrected' brain age gap that predicted chronological age the worst became the best index to predict fluid cognitive abilities. This shows the pitfall of the 'uncorrected' brain age gap. You can apply the same logic to the brain cognition gap.

      (14) Additionally, another way to show the unique contribution of brain cognition, over and above cognition per se, is to add both brain cognition and cognition together to predict physical activity, education, and cardiovascular risk.

      We thank the Reviewer for raising this important point. In response to their request and also the request from Rev. 1, we first examined the relationship between the Brain-Cognitive Gap (BCG) and the cognitive measure itself. Surprisingly, we did not find any significant relationship in either the DyNAMiC sample (r =0.01, p =0.939) or the COBRA sample (r =0.01, p =0.894) (see Author response image 1).

      We then conducted additional analyses, splitting the sample into high and low EM performers, and compared their levels of physical activity and Framingham cardiovascular risk scores. We found that no significant difference in physical activity (DyNAMiC: p =0.56, CI: -14.99 – 8.13; COBRA: p =0.29, CI: -3.54 – 1.05) or Framingham CVD risk score (DyNAMiC: p =0.11, CI: -1.08 – 10.72; COBRA: p =0.41, CI: -1.86 – 4.58) between high and low EM perfprmers. Given the significant difference in physical activity and Framingham CVD risk score between positive and negative BCG groups, our results support that BCP provides unique information, beyond cognitive measure, regarding factors that contribute to cognitive resilience. These results have been added to Section 2.4, and Figure 3 has been updated.

      (15) Related to the brain age gap, the brain cognition gap is actually just another way to quantify how generalizable models are to another sample, similar to MAE or MSE. If the models built from DyNAMiC don't fit well with samples from COBRA, you will get a higher (i.e., wider) brain cognition gap, which means a poor fit. The authors should discuss this interpretation - should your biomarker's performance be due to a fit of the model?

      We appreciate this insightful comment. We agree that BCG can be interpreted not only as a marker of individual differences and resilience factors but also as a measure of model fit, analogous to error metrics, such as MAE or MSE. A higher gap may, in part, reflect poorer generalizability of models across samples. We have now revised the Discussion to explicitly acknowledge this alternative interpretation and to emphasize that BCG should be viewed both as a candidate biomarker and as a reflection of model performance.

      We added the following paragraph in the discussion:

      “An important caveat is that BCG can also be conceptualized as an error metric, similar to mean absolute error or mean square error, reflecting the extent to which models trained in one sample generalize to another. From this perspective, a larger gap may not only indicate individual differences related to resilience factors and dopaminergic function, but also reduced model fit or generalizability across datasets. Thus, BCG likely reflects a combination of meaningful biological variability and methodological variance.”

      (16) It is unclear why the authors binarized the brain cognition gap when predicting physical activity, education, and cardiovascular risk, and not doing so with the striatal D1DR. It is rarely a good idea to binarize a continuous variable (see 10.1136/bmj.332.7549.1080). In this case, people who had a bigger negative brain cognition gap were treated equally to people who had a smaller negative brain cognition gap. I also do not think it is necessary to separately analyze positive and negative gaps. Perhaps the authors should correlate the corrected brain cognition gap with physical activity, education, and cardiovascular risk and provide scatter plots and effect sizes.

      Following the reveiwer suggestion, we directly correlated BCG with physical activity and cardiovascular risk. Our results confirmed our initial analysis that individuals with a negative gap exhibited lower physical activity and higher Framingham CVD risk across both COBRA and DyNAMiC datasets. We have reported these results on page 10.

      Author response image 5.

      (17) Given that the motivation is to move away from brain age, the authors should benchmark the corrected brain cognition gap against the corrected brain age gap, as well as against the performance when directly predicting physical activity, education, and cardiovascular risk from the functional connectivity metrics.

      Author response image 6.

      We agree that benchmarking BCG against BAG in predicting lifestyle and vascular risk factors would be valuable. We have calculated adjusted BAG and related it to lifestyle and vascular risk factors. Interestingly, we did not find any significant association, suggesting that BCG might be more sensitive to cognitive resilience. However, this investigation was beyond the scope of the present study. Our aim was not to compare BCG with BAG, but rather to examine whether BCG provides information beyond cognition itself. We also note that introducing BAG would open a separate line of investigation, namely, which cognitive state (rest, movie-watching, n-back) best estimates biological age. While this is an interesting question in its own right, addressing it here would considerably broaden the scope and complexity of an already dense manuscript. To prevent misunderstanding, we have clarified this point in the Discussion and added a caveat noting that future work should explicitly benchmark these approaches. That said, if the Reviewer and/or the Editor incline to add these additional findings into the manuscript, we are open to doing so in a revision.

      We have added the following sentence to the Discussion.

      “While our focus was to investigate whether the brain–cognition gap provides information about factors contributing to cognitive resilience, we acknowledge that benchmarking BCG against the brain-age gap in predicting lifestyle and vascular risk factors would be valuable. However, addressing this question lies beyond the scope of the present study, and future work should systematically compare these approaches.”

      (18) Why was only the working memory score used to create brain cognition, and not episodic memory as well? Including both could provide a more comprehensive measure.

      We initially attempted to predict both episodic memory (EM) and working memory (WM). However, EM prediction was only reliable within and across samples for the resting state, whereas WM prediction generalized most strongly from the movie-watching condition. Because COBRA does not include a movie-watching paradigm, we could not evaluate WM prediction across datasets. For this reason, we focused on EM when examining the brain–cognition gap.

      (19) The PET mediation analysis seemed to come out of the blue. Is there existing literature showing the relationship between striatal D1DR and cognition? If so, did the authors find a similar relationship in the current data? I also suggest rewriting this section to strengthen the justification for the PET mediation analysis.

      We have previously conducted studies in which DA found to be associated with memory (Johansson et al., 2023, Nyberg et al., 2016).

      The third aim of our study was to examine whether DA integrity is implicated in brain–cognition gaps (BCG), which we propose as a marker of cognitive resilience. In line with this aim, we found that lower DA receptor availability was associated with larger BCGs (Figure 4). We then asked whether this relationship is mediated by functional signal variability, such that lower DA is linked to reduced signal-to-noise ratio (i.e., greater entropy in functional connectivity), which in turn contributes to less reliable prediction of cognition and, consequently, larger BCGs. Our mediation analysis supports this pathway (see also our reply to Reviewer 1, Comment 6).

      Thus, our mediation was not designed to test whether DA predicts episodic memory performance directly, nor whether BCG mediates such a relationship. Instead, we specifically investigated whether the effect of DA on BCG operates through functional variability. We agree that future work could extend our approach by directly examining whether BCG mediates the link between DA and cognitive outcomes. However, in the present study, our primary focus was on testing the mechanistic pathway of DA → entropy → BCG.

      Minor recommendations:

      (1) Task-based connections are not truly task-based, as they are around 70-80% related to the resting state, capturing non-task-specific functional connectivity. Task-based connections should refer to techniques that derive task-related connectivity, such as psychophysiological interaction and beta-series correlation. Perhaps use terms like "functional connectivity during tasks."

      Thank you. This has been corrected throughout the manuscript.

      (2) Are there really two studies? The same MRI was used with the same configurations, and participants were from the same city. The only difference is the age range. It may be more appropriate to refer to this as "across age groups" rather than "cross-datasets."

      Thank you for this comment. While the two samples share some similarities, there are also several marked differences beyond age range. For example, Movie-watching was administered in DyNAMiC but not collected in COBRA. The resting-state fMRI sequence was 12 minutes in DyNAMiC but only 6 minutes in COBRA. Moreover, DyNAMiC included dopamine D1-receptor PET, whereas COBRA assessed dopamine D2-receptor availability. Even the questionnaires used to measure physical activity differed between the two studies. Given these methodological and measurement differences, we believe that referring to them as “cross-datasets” rather than “across age groups” more accurately captures the distinction.

      (3) What kind of movie is "Cockpit"? Can you explain? Different movies may elicit different patterns of connectivity.

      We apologize for not providing information about the movie, which has been presented in our recent work (Johansson et al., 2023).

      The participants’ reactions to the content of the movie were not monitored, but the clips were selected to be as neutral in their content as possible. The content of the movie: Following his termination as a pilot and the end of his marriage, Valle embarks on a quest to secure new employment. Faced with desperation in the job market, he resorts to disguising himself as a woman with the intention of obtaining a position at a company specially seeking a female pilot.

      This information is added to the method section.

      “During the fMRI session, participants viewed a 12-minute segment from the Swedish comedy film Cockpit (2012). We did not monitor participants’ responses to the movie, and the chosen clips were selected to be relatively neutral in emotional content. The storyline follows Valle, a recently fired pilot whose marriage has ended, as he struggles to find new employment. In a desperate attempt to secure a job at an airline specifically recruiting a female pilot, he presents himself as a woman.”

      (4) There is a typo in the equation numbering (i.e., two equations are designated as #1).

      We have now corrected the typo.

      (5) From the discussion: "Importantly, this prediction generalizes across conditions." This is not surprising given the similarity between conditions, with around 70-80% variance.

      We agree with the reviewer that the high similarity of FC across states likely increases the chance of cross-condition generalizability. However, this generalization is not guaranteed for all models. For example, the model trained on FC during movie-watching successfully predicted episodic memory during rest, but it did not generalize to episodic memory during the n-back condition, although movie-watching and n-back FC patterns are themselves highly correlated. Thus, the observed generalization is meaningful in demonstrating that not all models transfer equally well across states.

      That said, we have added the following sentence to the Discussion:

      “Importantly, this prediction generalizes across conditions and datasets, suggesting that features derived from resting state FC serve as a relatively stable marker of individual differences in EM, though with reduced strength in COBRA. While such generalization is partly facilitated by the similarity of functional connectivity across states, it is not a trivial outcome. For instance, the model trained on movie-watching data generalized to EM prediction during rest but failed to do so for the n-back condition, even though movie-watching and n-back connectivity patterns are themselves highly correlated. This indicates that successful generalization depends not only on shared variance across states but also on the cognitive processes most relevant to the target behavior.”

      (6) It might be helpful to include some figures for the cognitive tasks used. The description is a bit hard to follow without visual aids.

      Thanks for the comment. We have had a figure describing this in the initial paper about DyNAMiC (Nordin et al., 2022). We have added the Supplementary Figure (Fig S3) in the manuscript.

      Fig S3. Overview of the cognitive tests included in the DyNAMiC study. Adopted from Nordin et al. with permission.

      (7) It may not be appropriate to use the term "cross-validation" here, as one dataset was used for testing and the other for training, but not vice versa (so no "cross" per se).

      We thank the reviewer for pointing this out. We agree that the term “cross-validation” is not precise in this context, since we trained the model in one dataset and tested it in another without performing the reverse. We have revised the manuscript to use the term “external validation” instead of “cross-validation” to more accurately describe our cross-dataset approach.

      (8) I don't have access to the supplementary materials or code/data, so all of the comments here are based on the main text.

      We have added the supplementary materials and inserted the GitHub link to the code.<br />

      Reviewer #3 (Recommendations for the authors):

      I suggest benchmarking against other simpler algorithms and controlling for memory in the brain cognition gap analyses.

      The authors might also want to simplify some aspects of the paper. There is a lot going on, which leaves less space to go into enough details for some analyses to warrant claims in the discussion. For example, the authors only compare the deep net to CPM and kernel ridge based on the literature. Direct comparisons would be needed.

      Thanks for the comment. We have made an attempt to address the concerns outlined in the public recommendation. Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back), with the aim of identifying the states that best capture individual differences across domains. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach. We have revised the manuscript text to make this focus clearer and to avoid any misinterpretation of our aims. Specifically, we removed statements in the Discussion that could be read as suggesting that our deep learning approach outperforms prior machine learning methods. While we compared our model with the connectome predictive modeling (CPM) approach and observed better performance with our deep learning framework, we did not conduct a comprehensive benchmark across all available machine learning methods, nor was this the aim of the present study. Accordingly, we have adjusted the text to avoid implying methodological superiority beyond the scope of our analyses. Furthermore, we have controlled for memory as suggested by the reviewer and outlined in response to reviewer 1.

    1. eLife Assessment

      This important study used whole-genome data to investigate Beefalo ancestry for the first time, providing insight into the genetics of Beefalo cattle and challenging the long-held claim of 37.5% bison ancestry reported by the American Beefalo Association. Despite some limitations regarding sequencing depth and sampling, the expert use of a comprehensive set of population-genomic methods allowed the authors to demonstrate convincingly that Beefalo and bison hybrid ancestry profiles are consistent with repeated backcrossing to either parental species. The work will be of significant interest to evolutionary biologists, population geneticists, animal breeders, and those involved in the conservation genetics of bovine species.

    2. Reviewer #1 (Public review):

      Summary:

      This study used whole genome data to investigate Beefalo ancestry for the first time, filling the gap in the field of Beefalo ancestry. The authors used preserved semen samples to generate genomic data on 47 registered Beefalo and 3 bison hybrids, further questioning the ABA's stated goal of ⅜ bison ancestry. In addition, the authors also show that ancestry profiles of Beefalo and bison hybrid genomes are consistent with repeated backcrossing to either parental species, demonstrate the value of genomic information in examining gene flow between species in the genus Bison. Overall, these data thus demonstrate the utility of genomic information in validating specific breeding claims for a more complete understanding of gene flow and genetic variation among bovine species. This is an interesting study, but there are still some major weaknesses that exist.

      Strengths:

      Numerous genetic analysis methods such as PCA, ADMIXTURE, F4 ratios, and local ancestry inference techniques revealed that no single Beefalo set meets the ancestry requirements set by the American Beefalo Association (ABA) and some beefalo had detectable indicine cattle ancestry.

      Comments on revised version:

      The authors have made further revisions in the revised manuscript, and these revisions have undoubtedly helped improve the article. No further comments.

    3. Reviewer #2 (Public review):

      Summary:

      Shapiro et al. set out to verify the American Beefalo Association's claim that Beefalo cattle possess 37.5% bison ancestry. They employ a comprehensive range of well-established population genomics methods to estimate ancestry in these hybrid populations, including PCA, ADMIXTURE, D and F statistics, and local ancestry inference. Their findings conclusively demonstrate that most Beefalo lack the claimed bison ancestry, with only 8 out of 47 samples showing any detectable bison ancestry, ranging from 2-18%.

      Strengths:

      The primary strength of this analysis lies in the comprehensive dataset available to the authors, which includes important foundational Beefalo individuals and various reference populations. The rigorous and multi-faceted methodological approach employs several well-established techniques in population genomics for detecting and measuring admixture. Each method used has a firm basis in the field, providing consistent and robust results. The authors' approach of using PCA to initially assess the data within a global context, followed by more specific analyses using ADMIXTURE and D-statistics, provides a clear and logical progression of evidence. The presentation of these results in figures is particularly effective, clearly illustrating the key findings of the study. Additionally, the examination of both autosomal and sex chromosome ancestry offers a more complete understanding of Beefalo genetic composition and the mechanics of bison-cattle hybridisation.

      Weaknesses:

      One limitation of this analysis is the relatively low coverage (~2x) of many Beefalo samples. However, the authors have taken steps to mitigate biases that may arise from this, and their downsampling experiment demonstrates that this level of coverage is appropriate for summarising species-level ancestry across Bos. Another potential weakness is the limited sampling of contemporary Beefalo populations, as the study focuses primarily on historical samples. The authors have justified this choice on the grounds that contemporary Beefalo breeding involves no further bison input, so founder-era individuals are the most informative samples for addressing the study's central question.

      Appraisal:

      The authors have clearly achieved their primary aim using a rigorous and comprehensive methodology. Their extensive dataset and multi-faceted analytical approach provide strong support for their conclusions. The study not only addresses its main research question but also reveals unexpected insights into Beefalo genetics, particularly the presence of zebu ancestry, predominantly from Brahman cattle.

      Discussion:

      This study is valuable for several reasons beyond its primary findings. First, it definitively addresses and refutes the claim of 37.5% bison ancestry in Beefalo, providing crucial information for those studying these interspecies hybrids and the viability of their offspring. Second, it reveals the unexpected presence of zebu ancestry, predominantly from Brahman cattle, in many Beefalo, raising intriguing questions about the breed's development and the potential role of zebu cattle in achieving desired traits. This finding suggests that the distinctive appearance of Beefalo may be due in part to zebu admixture rather than bison ancestry. Third, the study highlights the significant barriers to admixture between bison and cattle, both in controlled breeding programs and potentially in wild populations. This has important implications for conservation genetics and our understanding of gene flow between these species. Lastly, the study demonstrates the power of genomic analysis in verifying breed claims and understanding the complex history of domestic animal breeds. These findings open new avenues for research in bovine genomics, breed development, and the dynamics of interspecies hybridisation.

      Comments on revised version:

      Thanks for the responses, which address my comments in full. I have no further concerns.

    4. Reviewer #3 (Public review):

      Summary:

      The American beefalo cattle breed was developed as a mixture of 5/8 domestic cattle and 3/8 (or 37.5%) bison ancestry. The authors sequenced 50 genomes from bison and hybrids (historical and present-day). They found that most animals did not carry any detectable bison ancestry, with only a few between 2-18%, while other beefalo had taurine/zebu cattle ancestry, which may explain morphological traits. Breeding design was likely each time to a parental instead of to other admixtures.

      The authors utilize whole genome sequence data to explore the ancestry of beefalo with respect to expected and possible contributions from cattle lineages. Using molecular and analytical methods central to questions exploring genomic ancestry and identity, the authors very nicely show evidence that calls into question ability of ancestry to be deduced from breed club documentation without considering reproductive challenges that are known in hybridization between cattle lineages.

      Comments on revised version:

      The authors have addressed all my comments to help improve presentation of specific details, results, and readability. Thank you!

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study used whole genome data to investigate Beefalo ancestry for the first time, filling the gap in the field of Beefalo ancestry. The authors used preserved semen samples to generate genomic data on 47 registered Beefalo and 3 bison hybrids, further questioning the ABA's stated goal of ⅜ bison ancestry. In addition, the authors also show that ancestry profiles of Beefalo and bison hybrid genomes are consistent with repeated backcrossing to either parental species, demonstrating the value of genomic information in examining gene flow between species in the genus Bison. This is an interesting study that still has some major weaknesses that exist, but overall, the work demonstrates the utility of genomic information in validating specific breeding claims for a more complete understanding of gene flow and genetic variation among bovine species.

      We thank the reviewer for their thoughtful assessment of our work.

      Strengths:

      Numerous genetic analysis methods such as PCA, ADMIXTURE, F4 ratios, and local ancestry inference techniques revealed that no single Beefalo set meets the ancestry requirements set by the American Beefalo Association (ABA) and some beefalo had detectable indicine cattle ancestry.

      Weaknesses:

      While this study contributes to our knowledge of Beefalo ancestry, there are some key issues that need to be addressed in terms of analysing the specific results as well as writing the article.

      We have followed the reviewer’s suggestions for improving our study in detail (specified below), and appreciate their close reading of the manuscript.

      Reviewer #2 (Public review):

      Summary:

      Shapiro et al. set out to verify the American Beefalo Association's claim that Beefalo cattle possess 37.5% bison ancestry. They employ a comprehensive range of well-established population genomics methods to estimate ancestry in these hybrid populations, including PCA, ADMIXTURE, D and F statistics, and local ancestry inference. Their findings conclusively demonstrate that most Beefalo lack the claimed bison ancestry, with only 8 out of 47 samples showing any detectable bison ancestry, ranging from 2 - 18%.

      We thank the reviewer for their thoughtful assessment of our work.

      Strengths:

      The primary strength of this analysis lies in the comprehensive dataset available to the authors, which includes important foundational Beefalo individuals and various reference populations. The rigorous and multi-faceted methodological approach employs several well-established techniques in population genomics for detecting and measuring admixture. Each method used has a firm basis in the field, providing consistent and robust results. The authors' approach of using PCA to initially assess the data within a global context, followed by more specific analyses using ADMIXTURE and D-statistics, provides a clear and logical progression of evidence. The presentation of these results in figures is particularly effective, clearly illustrating the key findings of the study. Additionally, the examination of both autosomal and sex chromosome ancestry offers a more complete understanding of Beefalo genetic composition and the mechanics of bison-cattle hybridisation.

      Weaknesses:

      One limitation of this analysis is the relatively low coverage (~2x) of many Beefalo samples. However, the authors have taken steps to mitigate biases that may arise from this. Another weakness is the limited sampling of contemporary Beefalo populations, as the study focuses primarily on historical samples. This may limit our understanding of how Beefalo genetics may have changed over time.

      The reviewer is correct that the low coverage obtained for many Beefalo is one potential limitation, although we believe that the downsampling experiment we performed (Fig. S4) shows that this level of coverage is appropriate for summarizing species-level ancestry across Bos, as the reviewer notes.

      Sampling contemporary Beefalo individuals would be valuable, though as the focus of our study was to understand the origins of bison ancestry in Beefalo, we prioritized sampling individuals which played an important role in establishing the breed. We also note that contemporary Beefalo breeding involves crossing between Beefalo individuals or backcrossing to cattle, with no additional bison ancestry input since the formation of the Beefalo. As such, sampling individuals that existed close to the breed’s founding should provide the most insight into bison ancestry in Beefalo.

      Appraisal:

      The authors have clearly achieved their primary aim using a rigorous and comprehensive methodology. Their extensive dataset and multi-faceted analytical approach provide strong support for their conclusions. The study not only addresses its main research question but also reveals unexpected insights into Beefalo genetics, particularly the presence of zebu ancestry.

      Discussion:

      This study is valuable for several reasons beyond its primary findings. First, it definitively addresses and refutes the claim of 37.5% bison ancestry in Beefalo, providing crucial information for those studying these interspecies hybrids and the viability of their offspring. Second, it reveals the unexpected presence of zebu ancestry in many Beefalo, raising intriguing questions about the breed's development and the potential role of zebu cattle in achieving desired traits. This finding suggests that the distinctive appearance of Beefalo may be due in part to zebu admixture rather than bison ancestry. Third, the study highlights the significant barriers to admixture between bison and cattle, both in controlled breeding programs and potentially in wild populations. This has important implications for conservation genetics and our understanding of gene flow between these species. Lastly, the study demonstrates the power of genomic analysis in verifying breed claims and understanding the complex history of domestic animal breeds. These findings open new avenues for research in bovine genomics, breed development, and the dynamics of interspecies hybridisation.

      Reviewer #3 (Public review):

      Summary:

      I really like this topic and study. But I think much can be more focused and tightened up. All the components are here - just some more refining to really make the storyline clear, the journey of discovery, and the impact of such knowledge.

      We thank the reviewer for their thoughtful assessment of our work.

      Strengths:

      The authors dive directly into the question of genomic ancestry as compared to the breed club's reported ancestry with heavy, quantitative data and critical analytical methods. The questioning line is direct and does not meander. The reader learns about the challenges of breeding associations, and values of understood ancestry, and presents a clear need of re-evaluating the breed standards and expectations of beefalo (if ancestry is indeed the primary goal instead of a phenotype-driven breed mission).

      Weaknesses:

      Much of the quantitative results are only referred to in the main text with qualitative language. Please incorporate more written quantitative results to highlight evidence that underlines the study narrative because it is quite an interesting study!

      The reviewer highlights an important point, and we agree that the qualitative language used to describe the results was generally lacking. We have now described the results quantitatively throughout the manuscript where possible.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) This study is not the first to question claims surrounding bison ancestry in the breed and is the sample size too small to be representative of the entire genetic structure of Beefalo?

      The reviewer correctly points out that this study is not the first to address uncertainty in the amount of bison ancestry present across beefalo. All earlier studies, to our knowledge, have been highlighted in the introduction and discussion (Lenoir and Lichtenberger, 1978 and Stormont et al, 1986). However, these studies examined a narrow range of Beefalo sources and used older methods (karyotyping and blood typing), such that comprehensive statements about the proportion of bison ancestry in Beefalo could not be made.

      We also agree that an appropriate sampling scheme is crucial for making definitive statements about Beefalo ancestry across the breed. As Beefalo breeding typically involves breeding select “full-blood” individuals with cattle, the ancestry across contemporary Beefalo is likely complex, with the cattle component coming from a wide range of breeds. Therefore, our sampling emphasized “full-blood” representatives, especially those that were involved in the founding of the breed and from which later Beefalo descend. This involved an exhaustive survey of the Beefalo individuals contained within the USDA’s National Animal Germplasm Program. Although we did not extensively evaluate current Beefalo diversity, we believe this approach is most suited for characterizing bison ancestry within Beefalo, as bison ancestry is maintained primarily through the continued use of genetic material from these “full-blood” individuals rather than repeated hybridization between bison and cattle.

      (2) Although genomic information is important for breeding research, this requires quality of data. The coverage of the data used in this study was mainly ~2X, and although multiple methods of analysis gave similar results, the ability to identify rare variants (e.g. insertions or deletions of long segments of the genome) may be limited at low coverage, affecting the confidence of the results.

      This is an important consideration, and we agree with the reviewer that the sequencing depth obtained for most individuals in our study precludes accurate genotype calling. Therefore, we did not attempt to perform traditional genotype calling. Rather, we used a pseudohaploid calling approach in which a random base was selected to represent the genotype at each position for each individual, using a pre-ascertained set of variants discovered in gaur, a closely related outgroup to bison and cattle. This pseudohaploid approach is common in other situations where coverage is low, for example in analyzing ancient DNA.

      Furthermore, our ancestry analyses focused on biallelic SNPs which were discovered in gaur and we did not attempt to call structural variants, given the limitations in coverage. As this outgroup ascertainment approach seeks to target SNPs which were polymorphic in the ancestor of both bison and cattle, which should yield unbiased results in population genetic analyses, we were less interested in discovering rare variation within the species and populations we examined here.

      Finally, we performed downsampling experiments comparing low coverage read data to genotypes called from high coverage data, and obtained consistent results between low and high coverage analyses using read-level data and called genotypes (Fig. S7).

      (3) Missing from the conclusions is the very important presentation of the results of genomic calling, the basics of what these data look like, coverage histograms, number of SNPs, categorization, annotations, and so on. These are necessary prerequisites for subsequent population analysis.

      The reference to “5.29M” on page 14 has been replaced with the exact number of SNPs used in analyses (5,291,534). The average sequencing depth for each sample is also included in Table S1.

      (4) The manuscript mentions "most" in a number of places, but can the authors give an accurate number based on the current data? "Most" is not a rigorous description. Based on the simulations of genomic data, how many Beefalo cattle were not detected as hybridized? This may be related to both sample size and where the authors sampled.

      We thank the reviewer for this important suggestion. We have now replaced vague summaries of results with precise numbers. However, we are unsure what “simulations” means in this context, as all results were obtained by analyzing empirical data from Beefalo, bison, cattle, and other bovines, rather than simulations.

      (5) The information in the third and fourth paragraphs of the Introduction is not sufficiently coherent and could be further consolidated into a more logical presentation.

      We have now condensed these paragraphs and edited them for clarity.

      (6) "For some analyses we also incorporated published genomes from outgroups". The description here is unclear as to what criteria were used to select these data, and it is possible that the choice of outgroups could lead to different conclusions from the analyses. In addition, ancient DNA data from cattle may be useful for this study and the authors are encouraged to consider it.

      Outgroup choice can certainly have a large impact on population genetic analyses. For the species examined in our study, we considered other Bos species, including yak, gaur, and banteng, as suitable outgroups, along with water buffalo, which is the closest outgroup outside of Bos. We have added comparisons of D-statistics using yak as an outgroup as a supplementary figure (Fig. S4), in addition to those using water buffalo as the outgroup which were presented in Figure 2.

      As we were examining species-level ancestry, and given the high level of divergence between bison and cattle, relative to that between published ancient and modern cattle genomes, we believed that it was most appropriate to use high quality modern cattle data, rather than poorer quality ancient cattle genomes, for analyses. Additionally, as any hybridization which took place between bison and cattle in the formation of Beefalo would have occurred within the past ~50 years, modern cattle are likely to be the most appropriate proxy for the cattle ancestry in Beefalo, especially given the lack of published historical North American cattle genomes.

      (7) The coordinates of the PCA plot need to be further supported by providing values.

      We have now updated axis labels for the PCA in Fig. 1A to include the proportion of variance explained for the first two components.

      (8) In Figure 1, Beefalo has one individual, NAGP9109, which belongs exclusively to the indicine group. For this individual, wouldn't it be nicer to label it separately in the PCA and ADMIXTURE plots, like Joe's Pride (JP), to make the presentation of the results clearer?

      This individual was one which was determined to be mislabeled as Beefalo within the NAGP and is actually a Brahman cattle. Therefore, we have relabeled it as zebu, rather than Beefalo, throughout the figures.

      (9) As the sex chromosome data do not fully support the authors' claims, some caution may be needed in describing the results.

      We interpret the sex chromosomal results as being fully consistent with patterns seen in the autosomes. However, they do shed some light on the dynamics of bison-cattle hybridization, and suggest male-mediated gene flow in which bison ancestry in Beefalo was introduced primarily through bison bulls.

      (10) Would it be appropriate to analyse the results at K = 3 only? The admixture analysis of all bison, cattle, bison hybrids, and buffalo individuals at different K values should further refine the results.

      We now also show ADMIXTURE results at K=2 and K=4 (Fig. S2) and present the cross-validation results from ADMIXTURE (Fig. S3).

      (11) The conclusions of this article about bison ancestry in Beefalo individuals are completely inconsistent with the American Beefalo Association, and should a description of possible reasons for this discrepancy be added to the discussion?

      Our analyses make it clear that there was much less hybridization between bison and cattle leading to the formation of the Beefalo that was previously believed. As the genetic data does not provide insight into exactly why this might be the case, we can only speculate on the precise reasons bison-cattle hybridization did not take place, which we have avoided here.

      Reviewer #2 (Recommendations for the authors):

      The manuscript is well written, the figures are easily understandable, and the claims made are justified by the results obtained.

      It is need to clarify cattle breeding terminology, particularly concerning breeds like the Brahman. While often described as zebu-taurine hybrids, Brahman cattle typically show over 90% zebu ancestry when analysed using ADMIXTURE against panels including European Bos taurus, African Bos taurus, and Bos indicus animals. This context would help explain why "NAGP9109" clusters with the Zebu group.

      We thank the reviewer for this useful context, and agree that most Brahman cattle have a high proportion of zebu ancestry. In fact, the zebu group we included primarily consists of Brahman individuals, which we have now clarified in the text, which now reads:

      “The reported pedigree in the NAGP for this animal lists its composition as 1/2 Brahman, 1/4 Charolais, 1/8 bison, 1/16th Hereford, and 1/16th Shorthorn, but the American Brahman Breeders Association records this animal (#309519) as purebred Brahman, which is a zebu breed (5 of the other 6 zebu individuals analyzed here are Brahman cattle).”

      I suggest three other improvements:

      (1) Standardise terminology: The manuscript alternates between "zebu" and "indicine" when referring to these cattle. While both terms are correctly defined in the introduction as "indicine (zebu; Bos indicus)" using one term consistently throughout would improve readability. I prefer "zebu" but leave this choice to the authors.

      We agree that this mixed terminology was confusing and have replaced all instances of “indicine” with “zebu.”

      (2) Add PCA metrics, including the percentage of variance explained by each principal component would demonstrate the genetic distinctiveness between bison and cattle, and between Taurus and zebu cattle. This would also support the selection of K=3 for the ADMIXTURE analysis.

      The axis labels for the PCA have been updated to include the proportion of variance explained for each component. We now also show ADMIXTURE results at K=2 and K=4 (Fig. S2) and present the cross-validation results from ADMIXTURE (Fig. S3).

      (3) Improve quantitative precision: The authors could improve precision by replacing qualitative statements with exact counts. For example "39 of 47 Beefalo showed no detectable bison ancestry." The same suggestion applies when describing how many Beefalo had zebu ancestry.

      We thank the reviewer for this useful suggestion, and agree that the manuscript used imprecise language in describing the results of certain analyses. We have now added quantitative detail throughout the Results section.

      Reviewer #3 (Recommendations for the authors):

      (1) Introduction

      The introduction sets a tone that is heavily focused on the genetic revelation that the economics of beefalo are somewhat of a facade. Beefalo are indeed not part-buffalo (bison). It is unclear to me if the introduction also could benefit from motivating this with more of a theoretical framework based on evolution, inheritance, or trait transmission. If this is really meant to be an economics-focused article, then lean more heavily into that. As it stands, it straddles a bit of economics, a bit of legacies that appear false (beefalo are not part bison at all!), and a bit of admixture genetics theory.

      We intended the focus of this study to be on documenting the species-level ancestry of Beefalo, and concentrated the information presented in the Introduction on this topic. Given that less hybridization between bison and cattle appears to have taken place to form the Beefalo breed than was previously described, we believe that broader theoretical statements about admixture are less relevant here, beyond highlighting examples of successful and failed interspecies hybridization in Bos. We also avoided speculating on the history of the establishment of the breed beyond what could be understood from the genetic data.

      Can the authors give a bit more details about beefalo breeding? Did the breeders select for any quantitative traits and is there a targeted phenotype for beefalo they used as a standard?

      Limited information exists about the precise origins of Beefalo, which were never publicly shared—possibly in part for reasons this manuscript addresses. The only criteria defining Beefalo is the proportion of bison ancestry, and so no quantitative traits or specific phenotypes are related to breed standard.

      Can the authors provide a few examples of what is known about the incompatibilities and reproductive challenges? What is known from past research or from the Beefalo Association documenting the breeding history?

      We provided a general summary of hybridization and incompatibility across Bos, but unfortunately cannot provide details about incompatibilities in Beefalo specifically. Though there is a long history of challenges interbreeding bison and cattle (referenced in the third paragraph of the Introduction), to our knowledge no examination has been carried out of Beefalo specifically and little is known about Beefalo pedigrees (again, perhaps for reasons related to information presented in this study).

      (2) Results Section Sequencing Beefalo genomes

      Please report the number of polymorphic sites to accompany the genomic read depth averages. It seems the authors could include a larger summary of the genomic data that was used for downstream analyses (like the PCA in the next section). Also, does this dataset include the sex chromosomes? How many variants that are retained for analyses are autosomal, sex-linked, or haploid? Please provide more characteristics of the data that was generated after QC and filtering.

      We have now replaced “5.29M” on page 14 with the exact number of SNPs (5,291,534) and added a description of genotype calling to the Results section. We have also included the number of SNPs used for sex chromosomal analyses.

      (3) Results section Estimating bison ancestry in beefalo

      What is a "foundational" individual? Is this a beefalo pedigree founder, a common sire, or an individual with remarkably high bison content? I see in the introduction Joe's Pride was the "most expensive cattle" but there are surely other aspects of "foundational" that the reader should understand as the results are presented.

      We agree that this terminology was imprecise, and have now clarified that we use foundational to mean an early individual that was important in the founding of the Beefalo breed, such as those that were first bred by Bud Basolo.

      For the sentence "The reported pedigree in the NAGP for this animal [NAGP9109] lists its composition as 1⁄2 Brahman, 1⁄4 Charolais, 1⁄8 bison, 1/16th Hereford, and 1/16th Shorthorn, but the American Brahman Breeders Association records this animal (#309519) as purebred Brahman.", this is difficult for a reader with limited cattle breed knowledge to infer significance of this. What is the origination of Brahman breed cattle? Does Brahman ancestry come from another mixed origin that could explain this discrepancy? Does the PCA have references to resolve the origin of Brahman? I realize this may sound extraneous but if membership to a breed that is recently formed from several other lineages or breeds, could you be seeing the deeper parts that compose Brahman cattle? How could one validate that the contributors erroneously labeled this individual as a beefalo?

      We have now noted that the Brahman breed has primarily zebu ancestry. The placement of this individual in the PCA supports the American Brahman Breeders Association metadata, and suggests that the NAGP labeling is incorrect:

      “The reported pedigree in the NAGP for this animal lists its composition as 1/2 Brahman, 1/4 Charolais, 1/8 bison, 1/16th Hereford, and 1/16th Shorthorn, but the American Brahman Breeders Association records this animal (#309519) as purebred Brahman, which is a zebu breed (5 of the other 6 zebu individuals analyzed here are Brahman cattle). We believe NAGP9109 was erroneously labeled as Beefalo by the contributors.”

      Figure 1A: Please add % explained by each PC.

      We have now updated axis labels for the PCA to include the proportion of variance explained for each component.

      Figures 1B and 1C are identical except for the Y axis. Please combine them into a graph with 2 Y-axes (one for PC1 and one for ADMIXTURE). Also, please include the bison in this panel as well.

      We have now updated these panels to include bison, although have kept the labeling so that they may be referenced separately in the text.

      I see that the authors did both unsupervised and supervised. Can the main text have the supervised graphical result instead of the unsurprised? That is more relevant for ancestry proportions via an assignment probability to ancestry groups. Or, if possible, could the authors consider STRUCTURE to also obtain the probability of assignment to a prior defied parental up to 2-generations back? This is by far the best way to leverage the ancestry information of the cattle and bison parental references in addition to the known F1/bison hybrids. Swap the Supplementary Figure 1 with Figure 1D!

      The supervised and unsupervised ADMIXTURE results are highly consistent, as could be expected given the high levels of divergence between species. We prefer to show the unsupervised results in the main text, as this makes the fewest assumptions about the ancestry of the examined individuals, and so also shows that the panels used to represent each species (taurine cattle, zebu cattle, and bison) do not contain individuals which were themselves highly admixed, which could have influenced the supervised ADMIXTURE analyses.

      For the unsupervised ADMIXTURE analyses, what were the cross-validation values per K value tested? How did the authors decide that K=3 was the best one to show?

      We now also show ADMIXTURE results at K=2 and K=4 (Fig. S2) and present the cross-validation results from ADMIXTURE (Fig. S3).

      Regarding "D-statistics ..... are consistent with 0 for most individual Beefalos....", I have two comments. First, by "consistent with", do you mean "are not significantly different from 0", indicating that (explain what this means in your words). Next, "most individual beefalos" means how many? Please provide numbers and values to highlight points or specific findings.

      The interpretation of the D-statistics has been clarified and Z-scores and numbers of individuals to quantitatively describe these results have been added. The text now reads:

      D-statistics of the form D (taurus, Beefalo; bison, water buffalo), which test whether Beefalo share more alleles with bison than taurine cattle, again show 39 Beefalo have no excess affinity with bison compared to taurine cattle (-13.04 < Z < 3.14), although the same eight Beefalo identified in PCA and ADMIXTURE as having bison ancestry also have an excess of bison alleles (6.16 < Z < 34.86), confirming their bison ancestry (Fig. 2A).”

      "In Beefalo with bison ancestry, that ancestry tends to be present in large contiguous blocks, often tens of megabases in size, indicative of recent admixture (Figure 3A, B)". Please display the quantitative results (mean, max, range, standard deviation, etc.) in the main text and point the reader to the table that contains the values for each individual. The rest of this paragraph also uses the words "most' or "always" - please provide numbers. Is most 30/46 beefalo? Is it always exactly all 47 beefalo? Readers want to see numbers!

      The reviewer is correct that this section lacked specificity. We have now provided the exact number of individuals identified with bison and zebu ancestry.

      The section starting "Several lines of evidence attest to the efficacy of using these source panels..." could realistically come first in the Results section and before beefalo results are presented. This would build confidence for the reader that this panel of samples passes a QC and will indeed be able to resolve ancestry-based questions.

      This section specifically refers to the local ancestry analyses, which we have now clarified in the text.

      Figure 3A-C: Please include on each of these figure panels the documented (breeder association) ancestry percentage and the percentage of bison ancestry you obtained from your genomic analyses. Moving it from the legend to the figure is more immediately powerful for the reader. If the authors dated the admixture events as well, please include the meta-data of the association pedigree reporting when bison entered the target individual's genome versus the genome-estimated number of generations since admixture.

      Figure 3 has now been updated to include the reported bison ancestry. No attempt was made to date the admixture event or compare with reported pedigrees, as documented Beefalo pedigrees are typically very sparse (and may be unreliable, as our results suggest).

      Figure 3 legend: Move the following text from the figure legend to the Results section: "Three bison hybrids are inferred to have ~75% bison ancestry, while eight Beefalo have detectable bison ancestry, ranging from 2-18%. Indicine ancestry is detected in most Beefalo at variable levels, ranging from 2-38%, with most Beefalo having between 2-18%.".

      This sentence has been removed from the legend and is now worked into the main text. The corresponding paragraph in the results now reads:

      “Local ancestry inference across individual Beefalo and bison-cattle hybrid genomes provides similar estimates of overall Beefalo ancestry, inferring an absence of bison ancestry across the 37 Beefalo that lacked evidence for such ancestry in previous analyses (Fig. 3). Three bison hybrids are inferred to have ~75% bison ancestry, while eight Beefalo have detectable bison ancestry, ranging from 2-18%. Zebu ancestry is detected in 38 Beefalo at variable levels, ranging from 2-38%, with all but two of Beefalo having between 2-18%.”

      (4) Results section Beefalo sex chromosome ancestry

      Check that the authors do not reference Figure 4B before Figure 4A.

      Thank you to the reviewer for noticing this, it has now been corrected.

      Figure 4A: Could this panel be considered to merge with the autosomal admixture plot? It helps with comparison. Not a firm request - but it is nice to see what is consistent versus what is discordant.

      To avoid cluttering the figure with two highly similar plots, we preferred to separate the autosomal and sex chromosomal results.

      Figure 4C: Could this panel be merged with the autosomal ancestry bar graph to help the reader with visual comparisons?

      We thank the reviewer for this suggestion, but do not understand exactly which figures they are suggesting to be merged.

      (5) Materials and Methods: Modeling Beefalo ancestry:

      The language used in this sentence "This approach allows for directly understanding the ancestry of Beefalo individuals relative to these three groups while mitigating the effects of the low sequencing depth obtained for many Beefalo." conflicts with a sentence later in this paragraph which called PCA a model-free analysis. Please correct.

      Unfortunately, we are unsure what the reviewer refers to here and believe that this sentence does not conflict with the characterization of PCA as a model-free analytical approach.

    1. eLife Assessment

      This study provides a detailed anatomical and functional framework for understanding CO₂ processing and behavioral flexibility in Drosophila. The significance of the work is important, as it identifies how specific neural circuits, such as LN23, modulate innately aversive signals across different contexts. The strength of the evidence is convincing, supported by a robust combination of connectomics, anatomical reconstructions, and targeted behavioral manipulations.

    2. Reviewer #1 (Public review):

      Summary:

      The authors set out to better understand how Drosophila responses to CO2 can be aversive or attractive depending on context (especially presence of food odors, temperature, humidity). While some aspects of this circuit had been previously identified, the authors uncovered additional, critical aspects of the circuit to more fully explain these phenomena. One important discovery was the identification of the LN23 interneuron, which receives input from the V glomerulus. LN23 relays sensory input via an extraglomerular CO2 pathway, and manipulation of LN23 activity revealed a dominant role in CO2-induced avoidance behavior.

      Through a careful series of experiments, the authors demonstrate important aspects of these parallel (and sometimes converging) circuits - differential sensitivity to CO2 concentration changes, synaptic plasticity, circuit connectivity, developmental origins, and the effect of chemo and optogenetic manipulations on behavior. Together, they piece together a complex and interconnected circuit diagram for CO2-dependent behaviors that can be modulated by external factors. This finding will be impactful not only for the fly olfactory/gustatory field but also for many others in the sensory neuroscience community who are very interested in understanding state-dependent modulation of sensory circuits.

      Strengths:

      The experiments were well described and controlled. The addition of the developmental trajectory of the LN23 neurons was interesting. The inclusion of multiple levels of analysis from synaptic contacts and activity-dependent labeling of synapses, circuit analysis guided by connectomes, and detailed behavior analysis for each part of the circuit were all strengths.

      Weaknesses:

      The circuit is very complex and interconnected. This is important for its function, but it makes reading through the manuscript a challenge. The diagrams are helpful, but still somewhat confusing, and some of the experimental findings do not completely support the model outlined in the final figure.

      The main difficulty is visualizing the "default/predominant aversive" LN23 circuit - in the final diagram, there is no "stop" sign on that side, although it's depicted as an inhibition of a "go".<br /> Also, importantly, the findings shown in Figure 5 demonstrate pretty convincingly that LN23 inhibition reduces CO2 avoidance "almost entirely". Also supporting a central role for LN23 is the opposite effect of silencing LN23, with chronic CO2 inducing attraction. If this is the case, then where is the contribution of the other canonical aversive pathway? How does the silencing of LN23 override the PNvbi/uni pathways to aversion? Incorporating this into the figure more prominently would improve the understanding of this contribution to the circuit.

      A minor weakness is that CO2 levels were not reduced below ambient air. For the first part of the paper addressing the activation of these circuits, there seemed to be a ceiling effect for the LN23 neurons at ambient CO2 levels. It would be interesting to see if there would be some change to the activity labeling experiments if CO2 were reduced or eliminated from the air.

    3. Reviewer #2 (Public review):

      Summary

      The authors investigate how parallel olfactory pathways contribute to CO₂ valence processing in Drosophila. By combining multiple approaches, the study identifies LN23 as a previously unrecognized component of the CO₂ circuit and proposes a model in which distinct downstream pathways contribute to aversive and attractive behavioral responses. More broadly, the work aims to connect circuit organization with context-dependent sensory processing and behavioral valence.

      Strengths

      A major strength of the study is the integration of multiple complementary approaches spanning anatomy, circuit analysis, and behavior. This combination provides a rich and valuable framework for understanding how CO₂ information may be processed across different levels of the olfactory system. The identification of LN23 as an important component of the CO₂ pathway is particularly interesting and will likely be useful for future studies investigating olfactory processing, behavioral state modulation, and valence coding. The connectomic and anatomical analyses also provide a valuable resource for the community.

      Another strength of the manuscript is its conceptual ambition. The work moves beyond a simple labeled-line view of olfactory processing and proposes that flexible behavioral responses may emerge from interactions between parallel downstream pathways and multimodal integration centers. The behavioral manipulations further support an important role for LN23 in CO₂-related behaviors.

      Weaknesses

      Several aspects of the conceptual interpretation would benefit from additional clarification or more cautious framing relative to the current experimental evidence. In particular, the distinction between atmospheric versus experimentally elevated CO₂ conditions, as well as the interpretation of chronic exposure in terms of habituation, remains somewhat unclear throughout the manuscript.

      Some conclusions regarding valence coding and multimodal integration also appear more inferential than directly demonstrated experimentally, especially when moving from anatomical connectivity to functional interpretation.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Javorski and colleagues investigate how CO2 valence is processed in the Drosophila olfactory system. Although CO2 is classically associated with an aversive labeled‑line pathway, its behavioral significance can be modulated by environmental context, such as the presence of food‑related cues. The circuit‑level mechanisms underlying this flexibility remain incompletely understood. The authors address this gap by examining how CO2 sensory information diverges at early stages of olfactory processing and how distinct neural pathways contribute to opposing behavioral outcomes. By identifying the local interneuron LN23 as a relay for CO2‑induced aversion, the study suggests that CO2 valence processing may begin to diverge at the level of the antennal lobe, prior to synaptic integration in higher‑order brain regions such as the lateral horn.

      Strengths:

      A major strength of this study is its comprehensive, multi-level experimental design that effectively links neuronal identity, synaptic organization, and behavior. The authors combine calcium‑based anatomical mapping, activity‑dependent reporters, optogenetic and thermogenetic manipulations, and connectomic analyses with behavioral readouts under genetically defined neuronal activation or silencing conditions. Specifically, the identification of LN23 as a component of the CO2 avoidance pathway is supported by anatomical, genetic, and behavioral evidence. Both silencing and activation experiments indicate that LN23 plays an important role in mediating CO2‑induced aversive responses. In contrast, manipulation of the projection neurons (PNv bi and PNv uni) produces more modest behavioral effects, suggesting a degree of specificity for LN23‑associated circuitry within the avoidance pathway. Moreover, the use of previous reported connectome to identify downstream third‑order neurons strengthens the proposed circuit model and provides anatomical support for early divergence of CO2 valence processing.

      Weaknesses:

      While the study provides a strong mechanistic framework for CO2 aversion, some aspects of context‑dependent valence modulation are less directly addressed and may benefit from further experimental exploration.

    1. eLife Assessment

      This study presents analyses of single neuron activity in the subthalamic nucleus (STN) of monkeys performing a decision-making task that manipulates both perceptual evidence and reward. The study shows convincing evidence of distinct subpopulations of neurons in STN that differ in their representations of key quantities related to decision formation. These findings reveal important functional heterogeneity within the STN that helps provide new insights into its contributions to decision processing.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript offers a careful and technically impressive dissection of how subpopulations within the subthalamic nucleus (STN) support reward-biased perceptual decision-making. The authors recorded STN neurons in monkeys performing an asymmetric-reward visual motion discrimination task, then combined single-unit analyses, regression modeling, and drift-diffusion model (DDM) fitting to identify functionally distinct neuronal clusters. Each subpopulation shows unique relationships to computational decision variables - evidence accumulation rate, decision bound, and non-decision time - as well as to post-decision evaluative signals including choice accuracy and reward expectation. The revised manuscript substantially strengthens the original submission by improving both the objectivity of neuron selection and the robustness of the clustering solution.

      Strengths:

      The asymmetric-reward paradigm cleanly separates perceptual and motivational contributions to STN activity, allowing the authors to characterize how neurons blend these distinct sources of information. The dataset is extensive and well-controlled, and the behavioral and neural analyses are tightly integrated. Relating cluster-specific activity to DDM parameters provides an interpretable computational link between population signals and behavior. The clustering solution is now validated across two algorithms, two monkeys, and subsets of trials - establishing that the three-cluster structure is robust. The new Figure 9 offers a conceptually useful, if necessarily speculative, synthesis connecting the identified subpopulations to distinct basal-ganglia pathways (hyperdirect versus indirect). The new Figure 8 documenting the anatomical intermingling of subpopulations is also important, as it directly informs the interpretation of prior and future STN stimulation studies.

      Weaknesses:

      The inferred relationships between neural clusters and DDM parameters remain correlational - the authors now appropriately flag this throughout, and the causal inference gap is acknowledged in the Discussion with concrete proposals for future targeted perturbation strategies. While a generative multi-cluster model would further strengthen mechanistic interpretation, the conceptual framework in Figure 9 provides a reasonable intermediate step given the scope of the study and the absence of simultaneous population recordings, which preclude direct inter-cluster covariation analyses. These remaining limitations are inherent to the experimental design rather than analytical oversights.

    3. Reviewer #2 (Public review):

      This study uses monkey single-unit recordings to examine the role of the STN in combining noisy sensory information with reward bias during decision-making between saccade directions. Using multiple linear regressions and clustering approaches, the authors overall show that a highly heterogeneous activity in the STN reflects almost all aspects of the task, including choice direction, stimulus coherence, reward context and expectation, choice evaluation, and their interactions. The authors report in particular how three classes of neurons map to different decision processes evaluated via the fitting of a drift-diffusion model. Overall, the study provides evidence for functionally diverse and anatomically intermingled populations of STN neurons, supporting multiple roles in perceptual and reward-based decision-making.

      This study follows up on work conducted in previous years by the same team and complements it. Extracellular recordings in monkeys trained to perform a complex decision-making task remain a remarkable achievement, particularly in brain structures that are difficult to target, such as the sub-thalamic nucleus. The authors conducted numerous analyses of STN activities, using sophisticated statistical approaches and functional computational modeling.

      One criticism that I would still make in the revised version of the paper concerns the description of the behavior of the two monkeys which is still minimal, while acknowledging differences in their choice and RT performance that reflect "individual differences in sensitivity to motion stimulus and a common heuristic-based satisficing strategy". This sentence is not clear to me. Moreover, the potential consequences of these differences on neuronal activity are only considered in the cluster analysis done for each of the two animals separately and for which it turns out there is no notable difference.

      Compared to the first version of the paper, the cluster analysis in this revised version yields three distinct populations instead of the previous four. While the authors suggest that these subpopulations play important roles in encoding different aspects of decision-making, the identification of three rather than four subpopulations seems to me an important update that warrants discussion.

      Finally, I think it would have been interesting to identify the level of collinearity in the model proposed by the authors (equation 7). Indeed, one can expect significant collinearity between some of the proposed explanatory factors of neuronal activity, such as choice and coherence level, for example. Similarly, for the analysis relating neuron activity to decision evaluation signals (p 16), firing rates calculated using sliding averages with 1-ms steps are compared, but the method does not specify controls for multiple comparisons or for non-independent data.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The inferred relationships between neural clusters and specific drift‑diffusion parameters (e.g., bound height, scaling factor, non‑decision time) are intriguing but inherently correlational. The authors should clarify that these associations do not necessarily establish distinct computational mechanisms.

      We agree and have revised the text to avoid any mention of a causal relationship.

      (2) While the k‑means approach is well described, it remains somewhat heuristic. Including additional cross‑validation (e.g., cluster reproducibility across monkeys or sessions) would strengthen confidence in the four‑cluster interpretation.

      We took several steps to increase our confidence in the clustering results. First, we made improvements in how we used the k-means method, primarily by using activity vectors with finer time resolution and filtering out “outlier” neurons (details in Methods) that were dissimilar to other neurons to reduce spurious clustering results. Second, we performed a new set of clustering procedures based on the linkage method, in addition to the k-means method that we originally used. The two clustering methods generated very similar neuron groupings, with a Rand index of 0.93. We now present k-means results in the main figures and linkage results as supplements (e.g., compare Fig 5 and Fig 5-S2). Third, following the reviewer’s suggestion, we performed clustering based on the two monkeys’ data both combined and separately (new Fig 5-S3). Clustering of data from both monkeys combined, compared to each monkey considered separately, had rand index values of 0.94 and 1 for monkeys C and F, respectively (i.e., neurons from one monkey tended to be assigned to the same cluster regardless of whether the clustering was based on data from that monkey alone or both monkeys together), indicating comparable cluster boundaries for the two monkeys. Lastly, we performed clustering based on pseudo-vectors derived from sampling a subset of trials for each neuron and found that the clustering results were stable and robust based on as low as 40% of the trials (new Fig 5-S4).

      Because most neurons were recorded in separate sessions, we cannot perform session-based cross validation.

      (3) The functional dissociations across clusters are clearly described, but how these subgroups interact within the STN or through downstream basal‑ganglia circuits remains speculative.

      We agree and have made sure any speculative claims we make are clearly described as such.

      (4) A natural next step would be to construct a generative multi‑cluster model of STN activity, in which each cluster is treated as a computational node (e.g., evidence integrator, bound controller, urgency or evaluative signal).

      (5) Such a low‑dimensional, coupled model could reproduce the observed diversity of firing patterns and predict how interactions among clusters shape decision variables and behavior.

      (6) Population‑level modeling of this kind would move the interpretation beyond correlational mapping and serve as an intermediate framework between single‑unit analysis and in‑vivo perturbation.

      We agree that such a model would be extremely useful. However, given that designing, implementing, and testing a model like that would require a good deal of speculation about functional and anatomical interactions that we did not measure, it is also well outside the scope of the current study.

      That said, we appreciate the suggestions, which spurred us to go further in terms of providing a summary of our findings (new Figure 9) with a bit of informed speculation about how the different functionally defined subgroups of STN neurons that we characterized might relate to not only different computations but also different pathways through the basal ganglia (i.e., the hyperdirect versus indirect pathway, both of which include the STN). We hope that this summary, along with our more detailed findings, will inform new modeling studies by us and others.

      (7) Causal inference gap - Without perturbation data, it is difficult to determine whether the identified neural modulations are necessary or sufficient for the observed behavioral effects. A brief discussion of this limitation - and how future causal manipulations could test these cluster functions - would be valuable.

      As suggested, we have added the following to the Discussion (line 365): “The exact contributions of these subpopulations are challenging to elucidate, as their intermingled localization make common perturbation techniques, such as electrical microstimulation or optogenetic manipulations, not suitable. It would be interesting to examine if these subpopulations differ in molecular or connectivity properties (e.g., as we speculated above) that can be capitalized to precisely target each subpopulation.”

      Reviewer #1 (Recommendations for the authors):

      (1) Develop or outline a generative multi‑cluster model:

      Consider constructing, even at a conceptual level, a generative network model in which the identified STN clusters serve as interacting computational nodes (e.g., evidence integration, bound modulation, urgency, or evaluative nodes).

      Such a framework could reproduce the simultaneous presence of ramping, transient, and context-sensitive activity patterns observed across clusters.

      Even a simulated or schematic implementation - showing how parameter coupling among these clusters gives rise to the reported firing diversity and behavioral effects - would help clarify the mechanistic implications of your findings.

      As noted above, we believe that a full modeling study is well outside the scope of the present work. However, we have provided a conceptual framework, shown in Figure 9, summarizing our findings and providing some informed speculation about how different subgroups of STN neurons could provide different functions along distinct anatomical pathways.

      (2) Strengthen the link between cluster activity and computation:

      Use cross‑validated or hierarchical regression models to verify the robustness of correlations between cluster‑specific firing measures and fitted drift‑diffusion parameters. This would make the mapping between neural activity and model components more statistically grounded.

      We appreciate the suggestion and thought hard about how we might implement it but ultimately decided our approach is most appropriate, given the strengths and limitations of our dataset. The fundamental issue is that it takes many trials to obtain reliable estimates of DDM parameters. Our approach of creating twelve “pseudo-sessions” for each neuron (half of those for trials with high firing rates, half for trials with low firing rates) balances our ability to obtain those estimates while testing for relationships with firing rate. Any further subdivision of the data for cross validation yields unreliable parameter estimates (i.e., with big error bars). We also chose not to use a hierarchical model and instead took a more unbiased approach by considering how all of the DDM parameters relate to firing rate.

      Despite the simplicity of our approach, we believe that these results are statistically grounded. It is possible that more complex regression models may reveal additional (e.g., non-linear) relationships, but those results would also be less intuitive to interpret. We therefore decided to retain our analysis choice.

      (3) Assess cluster reproducibility:

      Report or include in the supplement the degree of correspondence of cluster identities across monkeys or across independent subsets of trials. Cluster stability metrics (e.g., bootstrap or split‑half analysis) would reassure readers that cluster structure is not dataset‑specific.

      Please see our response above to the main comment #2 regarding the robustness and stability of clustering results.

      (4) Explore population interactions directly:

      You could analyze pairwise or population‑level covariations (e.g., principal components or canonical correlation analysis) to test whether inter‑cluster interactions correspond to model‑predicted dynamics such as competition or normalization.

      Because most of the neurons were recorded in separate sessions and not simultaneously, the suggested population analyses are not feasible.

      Discuss briefly how the proposed generative or dynamical multi‑cluster model could be empirically tested-e.g., using selective perturbation (microstimulation, optogenetic, or pharmacological) in future studies-to evaluate interactions inferred from the current dataset. If feasible, mention how this framework might generalize to other decision contexts beyond oculomotor tasks, such as effort‑reward tradeoffs or inhibitory control, reinforcing the broad relevance of STN computations.

      As suggested, we have added the following to the Discussion (line 366): “The exact contributions of these subpopulations are challenging to elucidate, as their intermingled localization make common perturbation techniques, such as electrical microstimulation or optogenetic manipulations, not suitable. It would be interesting to examine if these subpopulations differ in molecular or connectivity properties (e.g., as we speculated above) that can be capitalized to precisely target each subpopulation.”

      Reviewer #2 (Public review):

      One criticism I would make is that the authors sometimes seem to assume that readers are familiar with their previous work. Indeed, the motivation and choices behind some analyses are not clearly explained. It might be interesting to provide a little more context and insight into these methodological choices. The same is true for the description of certain results, such as the behavioral results, which I find insufficiently detailed, especially since the two animals do not perform exactly the same way in the task.

      We apologize for the lack of detail regarding the behavioral results and analysis choices. To address this issue, we substantially revised the text, particularly in Results and Methods.

      The differences in behavior for the two monkeys were the subject of an entire published study (Fan Y, Gold JI, Ding L, 2018, Ongoing, rational calibration of reward-driven perceptual biases. Elife 7: e36018.). That study showed that these differences most likely arose from the monkeys’ individual sensitivity to the motion stimulus, combined with a heuristic-based strategy to gain satisficing rewards that they all seem to use. We revised the text to acknowledge the individual differences and refer readers to our previous study (line 78): “Both monkeys showed consistent biases toward the large-reward choice (Figure 1B, C). The individual differences in their choice and RT performance reflect individual differences in sensitivity to motion stimulus and a common heuristic-based satisficing strategy, as we demonstrated in a previous study (Fan et al., 2018).”

      Another criticism is the difficulty in following and absorbing all the presented results, given their heterogeneity. This heterogeneity stems from analytical choices that include defining multiple time windows over which activities are studied, multiple task-related or monkey behavioral factors that can influence them, multiple parameters underlying the decision-making phenomena to be captured, and all this without any a priori hypotheses. The overall impression is of an exploratory description that is sometimes difficult to digest, from which it is hard to extract precise information beyond the very general message that multiple subpopulations of neurons exist and therefore that the STN is probably involved in multiple roles during decision-making.

      In response to the three reviewers’ comments on data inclusion and the clustering analysis we presented, we have substantially improved the objectivity and robustness of our approaches, by: 1) applying a data-driven criterion for identifying neurons with robust task-relevant modulation (Figure 4C), 2) removing “outlier” neurons that appear not to share activity profiles with any other neurons in our sample (note that these outlier neurons would be at the outskirts in the cluster space instead of between clusters), 3) increasing the temporal resolution for generating firing rate vectors, and 4) comparing clustering results based on two methods (k-means and linkage). These improvements both sharpened the cluster boundaries and allowed us to observe more robust and distinctive subpopulation-specific relationships between neural activity and computational components in the DDM framework (new Figures 5–7 and their supplementary figures). We believe these updated results clearly demonstrate that: 1) there are different STN subpopulations, and 2) each of the subpopulations encodes a distinct set of functions.

      It would also have been interesting to have information regarding the location of the different identified subpopulations of neurons in the STN and their level of segregation within this nucleus. Indeed, since the STN is one of the preferred targets of electrical stimulation aimed at improving the condition of patients suffering from various neurological disorders, it would be interesting to know whether a particular stimulation location could preferentially affect a specific subpopulation of neurons, with the associated specific behavioral consequences.

      We have added a new Figure 8 to show the localization of neurons with and without task modulation and of neurons from different subpopulations. Consistent with our previous demonstration of intermingled distribution of STN subpopulations, we did not observe any activity pattern-based segregation.

      To relate the activity patterns to previously reported stimulation effects, we added the following to the Discussion (line 307): “This functional diversity, along with a lack of clear anatomical organization, is consistent with the multiple effects of STN stimulation in patient populations on decision-making and out previous results in monkeys, including reductions in response times, a weaker dependence on evidence, and changes in the maximal value and trajectories of the decision bound (Frank et al., 2007; Cavanagh et al., 2011; Coulthard et al., 2012; Green et al., 2013; Zavala et al., 2014; Herz et al., 2016; Pote et al., 2016; Branam et al., 2024).”

      Therefore, this paper is interesting because it complements other work from the same team and other studies that demonstrate the likely important role of the STN in decision-making. This will be of interest to the decision-making neuroscience community, but it may leave a sense of incompleteness due to the difficulty in connecting the conclusions of these different studies. For example, in the discussion section, the authors attempt to relate the different neuronal populations identified in their study and describe some relatively consistent results, but others less so.

      We hope that our revised Results and Discussion clarify the conclusions that can be drawn from this and other related studies.

      Reviewer #2 (Recommendations for the authors):

      (1) Introduction, l. 47-48: It would be interesting to provide more details on these three populations in order to better understand why we need additional experiments to more comprehensively define their roles.

      We now give more details in the Introduction about the remaining questions we aimed to address in this study (line 50): “However, the specific computational roles that these different subpopulations play in decision-making and other cognitive functions remain not well understood. For example, two of the subpopulation had overall activity patterns that were consistent with two different models in which the STN modulated the decision bound (Ratcliff and Frank, 2012; Wei et al., 2015), but the exact nature of this modulation is not known. The other subpopulation’s general activity patterns were consistent with a model of STN mediating evidence accumulation (Bogacz and Gurney, 2007), but it is unclear if and how this activity contributes to how evidence is weighed, biased, or accumulated.”

      Our previous attempt to distinguish these alternatives using electrical microstimulation was unsuccessful because that manipulation likely affected highly intermingled subpopulations with different functions.”

      (2) Results, l. 71-73: A slightly more detailed description of the behavioral results would be appreciated, especially since the two monkeys do not behave exactly the same way in the task, particularly in terms of reaction times (Figure 1B top-right versus bottom-right).

      We revised the text to acknowledge the individual differences and refer readers to our previous study (line 78): “Both monkeys showed consistent biases toward the large-reward choice (Figure 1B, C). The individual differences in their choice and RT performance reflect individual differences in sensitivity to motion stimulus and a common heuristic-based satisficing strategy, as we demonstrated in a previous study (Fan et al., 2018).”

      (3) Figure 2G-I: Were the multiple linear regressions performed only in the asymmetric reward condition?

      Yes. We added in Methods (line 487): “All analyses were performed on activity from the asymmetric-reward task.”

      (4) Very often in the text, the authors use terms that refer to concepts or methods that are difficult to grasp on the first reading, especially if we are not familiar with the team's previous publications. This is the case, for example, with "joint modulation," "reward context," "reward expectation," "k-means clustering," "tSNE," "Silhouette score for neurons," "Rand index," etc. All the explanations are minimal, and it would be helpful to clearly define these terms and provide some justification and insight to support the use of the analyses and the resulting variables, all of which would facilitate the reading of the manuscript.

      We now define these terms explicitly in the text (emphasis added here for clarity):

      (Results, line 129): “Using a previous definition of “joint modulation” (Doi et al., 2020), including modulation separately by motion coherence and reward context or reward size and modulation by the interaction of motion coherence and reward size, we found that ~40% of the neurons showed joint modulation during motion viewing.”

      (Results, line 71): “… for which we separately manipulated the noisy evidence (motion direction and strength) and reward context (a larger juice reward for a correct choice associated with one of the two directions).”

      (Results, line 250): “Choice accuracy describes the probability that a choice is correct given the evidence. Reward expectation describes the the expected reward given a choice.”

      (Methods, line 550): “To quantify the consistency between two runs of clustering, we computed the Rand index as the number of neuron pairs with consistent grouping (i.e., they were placed in the same cluster for both runs or they were placed in different clusters for both runs), normalized by the total number of possible neuron pairs. A value of 1 indicates that the two clustering runs produce identical results, and a value of 0 indicates that the two runs do not agree on any pairs of neurons.”

      To quantify the separation of clusters, we computed silhouette scores as the difference between mean intra-cluster distance and the mean nearest-cluster distance, normalized by the maximum of the two values. A positive score indicates that the member is closer to its same-cluster neighbors than different-cluster neighbors. Clustering runs with high mean silhouette score were considered to have better cluster separation.

      We no longer use tSNE visualization.

      (5) Figure 5A, caption: A quick description of the parameters would be useful.

      We added the description of DDM parameters in the caption of new Figure 4.

      (6) Results l. 222: Why does the analysis only concern epoch 5? I suggest justifying this choice. Also, the text indicates a "trend" but Figure 5C shows a significant result (p=0.0129).

      These statements have been removed from the updated manuscript.

      (7) Methods, l. 443: The authors should report more details about how they decided that neurons were task-related or not. "Visual inspection" sounds like a very vague and subjective criterion.

      We now apply a more objective criterion for identifying neurons with task-relevant modulation:

      (Results, line 145): “To focus on neurons with the most robust task-relevant activity, we measured firing rates during a baseline period (300 ms before motion onset) and sliding 100 ms windows from motion onset to 150 ms after saccade onset in 50 ms steps. We identified the maximal and minimal z-scores, representing the peak activation and suppression, respectively, for each neuron across all trial conditions (Figure 4C). We applied a threshold of z-score >1.5 for either activation or suppression and focused further analyses on the 87 neurons that met this selection criterion (n = 62 and 25 for monkeys C and F, respectively).”

      (8) A map of the location of the different STN neuron clusters found in this study within the structure would be very interesting.

      We have added a new Figure 8 to show the localization of neurons with/without task modulation and of neurons from different subpopulations.

      (9) Unless I am mistaken, there is no mention of data availability in this manuscript.

      The data availability statement was/is on the submission form.

      Data Availability: All electrophysiological data and the code for the analyses presented in the paper will be deposited in a publicly accessible domain when the paper is published.

      Previously Published Datasets: Source data for Figure 3-S2 in eLife paper:

      https://doi.org/10.7554/eLife.60535.: Fan, Doi, Gold, Ding, 2020,

      https://cdn.elifesciences.org/articles/60535/elife-60535-fig3-data1-v1.csv,

      https://cdn.elifesciences.org/articles/60535/elife-60535-fig3-data1-v1.csv

      Reviewer #3 (Public review):

      The primary weakness of the paper lies in the claim that STN contains multiple sub-populations with distinct involvements in decision making, which is inadequately supported by the paper's methods and analyses.

      First, while it is clear that the ~150 recorded neurons across 2 monkeys (91, 59 respectively) display substantial heterogeneity in their activity profiles across time and across stimulus/reward conditions, the claim of sub-populations largely rests on clustering a *subset of less than half the population - 66 neurons (48, 15 respectively) - chosen manually by visual inspection*. The full population seems to contain far more decision-modulated neurons, whose response profiles seem to interpolate between clusters. Moreover, it is unclear if the 4 clusters hold for each of the 2 monkeys, and the choice of 4-5 clusters does not seem well supported by metrics such as silhouette score, etc, that peak at 3 (1 or 2 were not attempted). From the data, it is easier to draw the conclusion that the STN population contains neurons with heterogeneous response profiles that smoothly vary in their tuning to different decision variables, rather than distinct sub-populations.

      In response to the three reviewers’ comments on data inclusion and the clustering analysis we presented, we have substantially improved the objectivity and robustness of our approaches, by: 1) applying a data-driven criterion for identifying neurons with robust task-relevant modulation (Figure 4C), 2) removing “outlier” neurons that appear not to share activity profiles with any other neurons in our sample (note that these outlier neurons would be at the outskirts in the cluster space instead of between clusters), 3) increasing the temporal resolution for generating firing rate vectors, and 4) comparing clustering results based on two methods (K-means and linkage). These improvements both sharpened the cluster boundaries and allowed us to observe more robust and distinctive subpopulation-specific relationships between neural activity and computational components in the DDM framework (new Figures 5–7 and their supplementary figures). We believe these updated results clearly demonstrate that: 1) there are different STN subpopulations, and 2) each of the subpopulations encodes a distinct set of functions.

      We performed additional analysis to assess the robustness of the clustering results. First, following the reviewer’s suggestion, we performed clustering based on the two monkeys’ data both combined and separately (new Fig 5-S3). Clustering of data from both monkeys combined compared to each monkey considered separately had rand index values of 0.94 and 1 for monkeys C and F, respectively (i.e., neurons from one monkey were assigned to the same cluster regardless of whether the clustering was based on data from that monkey alone or both monkeys together), indicating comparable cluster boundaries for the two monkeys. Second, we performed clustering based on pseudo-vectors derived from sampling a subset of trials for each neuron and found that the clustering results were stable and robust based on as low as 40% of the trials (new Fig 5-S4). Third, we generated a new figure (Figure 5-S1), using dendrograms to visualize how the neurons relate to each other. The dendrogram in Figure 5-S2 is more consistent with (at least) three distinct subpopulations of neurons than with the null hypothesis of a continuous distribution with smoothly-varying response profiles.

      Second, assuming the existence of sub-populations, it is unclear how their time- and condition-varying relationship with DDM parameters is to be interpreted. These relationships are inferred by splitting trials based on individual neurons' firing rates in different task epochs and reward contexts, and regressing onto the parameters of separate DDMs fit to those subsets of trials. The result is that different sub-populations show heterogeneous relationships to different DDM parameters over time - a result that, while interesting, leaves the computational involvement of these sub-populations/implementation of the decision process unclear.

      The improvements we made of the clustering procedure both sharpened the cluster boundaries and allowed us to observe more robust and distinctive subpopulation-specific relationships between neural activity and computational components in the DDM framework (new Figures 5-7 and their supplementary figures). These updated results demonstrate that: 1) there are different STN subpopulations, and 2) each of the subpopulations encodes a particular set of functions.

    1. eLife Assessment

      This manuscript represents a valuable contribution to understanding motion processing in the visual cortex. Based on a heterogeneous collection of previous empirical findings, the authors show that the diversity of tuning curves in the middle temporal (MT) area, in response to moving center-surround images, can be explained by Bayesian inference combined with neural sampling. The model rests on strong and solid assumptions about the prior and likelihood; independent evidence that neither of these factors is misspecified would strengthen the work.

    2. Joint Public Review:

      Summary:

      Lengyel et al. present a normative model of single-neuron activity in area MT, which is known for its role in processing visual motion. The authors focus on responses to a center and a surround that move at different velocities. Both the center and surround are rigid: picture a set of dots all moving at the same velocity. The center dots are arranged in a disc; the surround dots in an annulus, and in both cases, the velocity of each is time-varying.

      The core proposal is that the brain does not process motion in a fixed coordinate system, but instead infers a latent reference frame, and that MT neurons encode motion either in retinal coordinates or relative to this inferred reference frame. The model is meant to overcome a challenge in the existing literature on area MT: on the one hand, experimental findings are heterogeneous, including both surround suppression and surround facilitation of neural responses; on the other, existing models are either designed ad hoc to capture specific phenomena or they are somewhat general (e.g., divisive normalization), but in either case they can't explain the full range of responses. This manuscript proposes that the full range of responses in MT is explained as Bayesian inference over the reference frame in which center motion speed and direction should be estimated. The model extends one introduced in a previous publication from the same lab (Shivkumar et al. 2025). That publication focused on human perception of motion; this one makes predictions about MT mean responses and across-trial variability.

      Strengths:

      Processing visual motion is important for normal visual function, including for the integration and segmentation of visual objects. This manuscript presents a normative theory, supported by recent human perceptual data, and extends it to make predictions about neural firing rate and variability in area MT. The theory is well motivated and supported by the simulation analysis and comparison to data. It provides new insight into how causal inference of relative motion reference frames can modulate neural activity in MT. The richness of the theory's prediction can guide future experiments. In particular, the theory explains both center-surround suppression and facilitation, unifying disparate empirical observations in MT for which no unified explanation had been proposed. The manuscript also demonstrates a new method to map ideal observer predictions (posterior distributions over speed and direction, which are dependent on the posterior inference over reference frames) onto predicted neural activity for center-surround stimuli, by only considering basic tuning curves measured in the center-alone condition. This is a useful methodological contribution. The manuscript offers a thorough review of CS modulation studies in MT.

      Weaknesses:

      We found this paper difficult to read for two reasons. First, math is generally explained in words. This made it extremely difficult (impossible for some reviewers) to understand the details of the model, which are important. We're not against words, but it's critical that they be accompanied by equations.

      Second, the manuscript is not self-contained in the sense that many of the motivations, assumptions, and limitations of the approach are only evident if one carefully reads the groups' prior work, Shivkumar et al. (2025). Following up on previous work isn't necessarily a flaw, but the introduction of the paper is written from a very broad perspective that does not effectively summarize the prior work and lay out the specific questions that motivate the current study. For example, it is not clear from the introduction whether the authors believe this framework can explain all sorts of center-surround interactions (including in non-motion stimuli and in other areas like the retina), or if the focus is only on area MT.

      Finally, the connection to neural data is confusing and mostly qualitative. The authors create a library of "hypothetical but plausible tuning curves" and show that their modeling framework is flexible enough to capture a variety of center-surround interactions. Although they do state that their model can't explain all possible tuning curves, it's still hard to tell whether they have particularly strong evidence for the Bayesian causal inference hypothesis.

      We also have several technical, but potentially important, comments.

      Line 427: 'Our framework not only reinterprets past findings but also generates new, testable predictions. The model makes directly testable predictions for surround modulation. Facilitation, for instance, is predicted for neurons encoding retinal-centric motion (v_center) under high sensory uncertainty. In contrast, suppression is the hallmark of neurons encoding relative motion (v^relative_center) with respect to a surround-influenced reference frame.' It seems that to test the predictions of the model, one would need to first determine if a neuron encodes retinal or relative motion, without relying on the patterns predicted by this model, and then test if the two types of neurons behave as predicted. It is unclear how one can obtain this labeling of neurons independently of the model predictions.

      Line 492: 'This offers a principled account of how the same population of neurons can support both perceptual states (integration and segmentation)'. However, because the theory assumes each neuron encodes either center velocity or center velocity relative to a moving reference frame, but not both, it does not explain that the same neuron could shift from suppression to facilitation. It may be worth considering another possibility, using V1 surround modulation as an analogy. Different neuron types are required to implement the surround computation: in mouse V1, SST interneurons are surround-facilitated, and they are necessary to implement surround suppression of pyramidal neurons https://pmc.ncbi.nlm.nih.gov/articles/PMC3621107, but their (SST) outputs are not communicated to downstream targets. In that view, facilitation is therefore not a signature of some neurons encoding a type of latent variable; it is only there as an intermediate step in the computation of the other latents (those that require suppression).

      Misspecification of either the prior or likelihood can be a problem for Bayesian inference. Discussion of this point -- and in particular evidence (say from analysis of natural scene statistics in the case of the prior) that both are well-specified -- would strengthen the manuscript.

    1. eLife Assessment

      This valuable study raises the intriguing possibility that crickets use bat-associated odors as cues of predation risk, extending the classic bat-insect arms race beyond its usual acoustic framework. The authors combine fecal metabarcoding, behavioral assays, electrophysiology, chemical analyses, and field observations to show that Loxoblemmus equestris avoids the odor of the insectivorous bat Scotophilus kuhlii, and that synthetic (-)-limonene can elicit antennal responses, avoidance in the laboratory, and reduced calling activity in the field. However, the evidence is currently incomplete because the identity, biological source, natural concentration, and ecological specificity of limonene as a bat-derived predator cue require stronger support, including clearer quantification, contamination controls, individual-level odor data, and evidence that crickets can distinguish bat-associated limonene from common environmental sources. The work will be of interest to researchers in sensory ecology, chemical ecology, predator-prey interactions, and bat-insect coevolution.

    2. Reviewer #1 (Public review):

      The manuscript examines whether insects can use bat odor as a cue of predation risk. The authors focus on the insectivorous bat Scotophilus kuhlii and the cricket Loxoblemmus equestris. They first use fecal DNA metabarcoding to show that crickets are part of the bat's diet, and field surveys to show that L. equestris is abundant at local foraging sites. In laboratory Y-tube assays, the authors show that crickets strongly avoid air carrying bat body odor. Gas chromatography coupled with electroantennographic detection showed that cricket antennae respond to components of bat odor. Chemical analyses identified several volatile compounds, with 2,2-dimethylheptane and (−)-limonene associated with antennal responses. Further analyses suggested that snout secretions are likely to contribute to the bat's body odor. The authors then tested individual compounds. Among the commercially available candidates, (−)-limonene elicited a strong antennal response and was sufficient to cause avoidance in the olfactometer. In field plots, spraying (−)-limonene reduced cricket calling activity relative to pre-exposure levels, whereas calling increased in control plots treated with hexane. Overall, the study argues that crickets can detect a vertebrate predator through olfactory cues and that a single bat-associated volatile can trigger antipredator behavior.

      This is an interesting and enjoyable study that addresses an understudied aspect of predator-prey interactions. The manuscript is clearly written, the experiments are presented in a logical sequence, and the figures are crisp and easy to follow. I really appreciated the combination of behavioral assays, electrophysiology, chemical analysis, and field observations.

      My main issue concerns the identity and biological origin of the proposed bat odor cue, (−)-limonene. Limonene seems like an unusual compound to be emitted endogenously by a mammal, particularly by an insectivorous bat. It would be helpful if the authors could clarify whether mammals are known to synthesize this compound de novo, and, if not, what the likely source of this plant-associated terpene would be in S. kuhlii. Possible sources could include environmental exposure, diet, roosting material, handling, or temporary housing conditions.

      I do not doubt that crickets avoid synthetic (−)-limonene. Indeed, this result is quite plausible given that limonene is widely used in insect repellent or repellent-associated fragrance products. However, this also makes contamination an important issue to address explicitly. How did the authors exclude the possibility that limonene entered the samples from human-associated sources, such as insect repellents, soaps, cleaning products, field equipment, cloth bags, cages, gloves, or other materials used while handling wild-caught bats? It would strengthen the manuscript to report limonene levels for individual bat odor collections, all relevant blanks, and any handling or housing controls.

      More broadly, given the common occurrence of limonene in plants and human-associated products, I am not yet convinced that it would function as a reliable "keystone kairomone" as suggested around line 253. How would crickets distinguish bat-associated limonene from limonene emitted by a mint leaf, citrus peel, pine material, or other non-threatening environmental sources? The authors may wish to soften this interpretation or provide additional evidence that crickets respond to limonene in a bat-specific context, perhaps through concentration, temporal patterning, co-occurring volatiles, or enantiomeric composition.

    3. Reviewer #2 (Public review):

      Summary:

      Many insects possess extremely sensitive olfactory systems that can detect chemical signals from distances of several kilometers. For decades, the arms race between bats and insects has served as a prime example of acoustic co-evolution. The auditory adaptations of insects to echolocation have been well documented. Cricket has a multi-sensory predator recognition system with keen olfactory, tactile, and auditory senses. However, whether crickets can use the scent of bats to avoid them remains unknown at present. The authors hypothesized that cricket prey (Loxoblemmus equestris) might eavesdrop on predator bat (Scotophilus kuhlii) VOCs as an early warning. L. equestris is one of the prey species of S. kuhlii, and the authors demonstrated that the body odor of the insectivorous bat S. kuhlii triggers robust avoidance and electrophysiological responses in the cricket L. equestris, and that a single compound, (-)-limonene, is sufficient to elicit this avoidance in the laboratory and suppress calling in the field. Overall, this paper has a complete chain of evidence and should be a highly praised study.

      Comments:

      (1) Olfactory eavesdropping can transcend the evolutionary divide between vertebrate predators and invertebrate prey, enabling invertebrates to trigger defensive avoidance behaviors in response to predator-derived volatile odors. This phenomenon is empirically well-documented and requires no excessive emphasis.

      (2) Without quantitative analysis and without knowing the relative content of this key substance limonene, I don't quite understand how to determine the concentration of limonene standard for EAD, as well as the concentration in field experiments. How is the concentration of limonene determined in field spraying, and is this actually the case in the wild environment?

      (3) Figures 1C and D should compare the GC-EAD response of L. equestris to the odor of bat body and the odor of bat nasal secretions. It should not be compared with the air control group. Figure 1D has the same problem.

    4. Author response:

      eLife Assessment

      This valuable study raises the intriguing possibility that crickets use bat-associated odors as cues of predation risk, extending the classic bat-insect arms race beyond its usual acoustic framework. The authors combine fecal metabarcoding, behavioral assays, electrophysiology, chemical analyses, and field observations to show that Loxoblemmus equestris avoids the odor of the insectivorous bat Scotophilus kuhlii, and that synthetic (-)-limonene can elicit antennal responses, avoidance in the laboratory, and reduced calling activity in the field. However, the evidence is currently incomplete because the identity, biological source, natural concentration, and ecological specificity of limonene as a bat-derived predator cue require stronger support, including clearer quantification, contamination controls, individual-level odor data, and evidence that crickets can distinguish bat-associated limonene from common environmental sources. The work will be of interest to researchers in sensory ecology, chemical ecology, predator-prey interactions, and bat-insect coevolution.

      We thank the editors for recognizing the novelty and significance of our work.

      The central aim and contribution of this study is to provide direct evidence that an insect can detect a phylogenetically distant vertebrate predator, an insectivorous bat, via olfaction and initiate avoidance behavior. Our dietary analysis of bats and survey of potential prey in foraging habitats established a predator–prey relationship between the Asiatic lesser yellow house bat (Scotophilus kuhlii) and the cricket Loxoblemmus equestris. In addition, behavioral assays showed that the crickets strongly avoid air carrying bat body odor, and electrophysiological recordings using GC-EAD confirmed that volatiles from S. kuhlii body odor are detected by L. equestris antennae. Together, these results provide strong evidence that this insect can perceive and avoid the body odor of its predator, S. kuhlii. We are grateful that the editors and reviewers acknowledged the main conclusions.

      We also investigated the sources of bat body odor, its major volatile components, and the behavioral, physiological, and ecological responses of the crickets to limonene. The purpose of these studies was to test the hypothesis that elemental perception—detection of a single compound—could serve as a mechanism by which crickets perceive bat odor. We found that limonene was present in bat odor, elicited antennal responses in crickets, induced avoidance behavior in olfactometer assays, and reduced calling activity in the field. Together, these results support the idea that elemental perception is a plausible and efficient strategy for initiating anti-predator behavior against bats.

      We appreciate the editors’ constructive comments. The editors and Reviewer #1 suggested that limonene, as a bat-derived predator cue, requires more evidence, mainly for two reasons:

      (1) Limonene is common in plants but rare in mammals; the reviewer raised the possibility that the limonene identified in our study may have been introduced as exogenous contamination during handling.

      (2) The ability of crickets to distinguish bat-associated limonene from limonene originating from common environmental sources (e.g., plants) remains unclear.

      Below we address these points and describe the revisions we will make to strengthen the manuscript.

      On the potential contamination origin of limonene

      We agree that limonene is common in plants and human-associated products, and we carefully considered this possibility. However, several lines of evidence suggest that contamination is highly unlikely. First, we followed strict experimental protocols: all instruments were cleaned with ethanol and oven-dried before each use; bats were held in stainless-steel cages and cloth bags made of degreased bleached cotton washed with purified water. Second, limonene was not detected in any blank controls (empty-chamber air samples for whole-body odor collection, nor blank cotton swabs for secretion analysis), whereas it was consistently identified in multiple bat snout-secretion samples. Third, previous studies have independently reported limonene in the secretions of several bat species (Faulkes et al., 2019, PeerJ; Zhang et al., 2022, Ann. N.Y. Acad. Sci.). Moreover, recent work suggests that skin-associated microorganisms can contribute to bat volatile profiles (Sun et al., 2026, BMC Biology), and some microbes possess enzymes involved in limonene biosynthesis. Therefore, we are confident that the limonene we detected originates from the bats themselves (either endogenously or via their microbiota), not from exogenous contamination.

      On how crickets might distinguish bat-derived limonene from environmental sources

      This is an insightful question. As discussed in our original manuscript (Discussion section), crickets may not rely exclusively on limonene as a standalone cue. First, our GC-EAD analyses showed that cricket antennae respond to multiple bat volatiles beyond limonene, suggesting that additional compounds, either alone or in synergistic blends, may contribute to predator recognition. Elemental perception via (–)-limonene therefore likely represents one effective strategy within a broader olfactory toolkit, rather than the sole mechanism. Second, under natural conditions, crickets could also integrate olfactory information with non-chemical ecological signals, such as temporal patterning (bats are active at night) and spatial patterning (specific foraging habitats), to further reduce false alarms.

      However, fully testing these hypotheses would require substantial additional work. It would be necessary to quantify natural limonene concentrations in bat odors versus various plant sources, conduct choice experiments with ecologically relevant concentrations and blends, and perhaps manipulate the olfactory landscape in the field. It would also be necessary to examine how other volatile compounds in bat body odor interact with limonene, alone or together, to shape cricket recognition. After all, bat body odor contains dozens of compounds, and it is challenging to determine the necessity and sufficiency of each. These kinds of difficulties are not unique to our study; they are widespread in chemical ecology. Problems like how animals distinguish identical compounds from different biological sources are common in chemical ecology, and they are rarely solved in a single study. These lines of investigation, from quantifying natural concentrations to examining compound interactions, are important, but they are not the focus of the present study. So we have put this forward as an important direction for future research.

      Revisions we will make:

      (1) In the Methods section, we will add detailed descriptions of contamination controls and report blank-control results to demonstrate that exogenous contamination is very unlikely.

      (2) In the Discussion section, we will expand the discussion of the possible biological sources of limonene (including microbiota) in light of our results and the literature.

      (3) In the Discussion and Conclusion, we will state more cautiously the role of limonene as a bat-derived cue, acknowledging that while it is sufficient to trigger avoidance, additional work is needed to establish its ecological specificity.

      We believe these revisions will address the editors’ and the reviewers’ concerns while preserving the main conclusion that olfaction can mediate bat detection by crickets.

      Reviewer #1 (Public review):

      The manuscript examines whether insects can use bat odor as a cue of predation risk. The authors focus on the insectivorous bat Scotophilus kuhlii and the cricket Loxoblemmus equestris. They first use fecal DNA metabarcoding to show that crickets are part of the bat's diet, and field surveys to show that L. equestris is abundant at local foraging sites. In laboratory Y-tube assays, the authors show that crickets strongly avoid air carrying bat body odor. Gas chromatography coupled with electroantennographic detection showed that cricket antennae respond to components of bat odor. Chemical analyses identified several volatile compounds, with 2,2-dimethylheptane and (−)-limonene associated with antennal responses. Further analyses suggested that snout secretions are likely to contribute to the bat's body odor. The authors then tested individual compounds. Among the commercially available candidates, (−)-limonene elicited a strong antennal response and was sufficient to cause avoidance in the olfactometer. In field plots, spraying (−)-limonene reduced cricket calling activity relative to pre-exposure levels, whereas calling increased in control plots treated with hexane. Overall, the study argues that crickets can detect a vertebrate predator through olfactory cues and that a single bat-associated volatile can trigger antipredator behavior.

      This is an interesting and enjoyable study that addresses an understudied aspect of predator-prey interactions. The manuscript is clearly written, the experiments are presented in a logical sequence, and the figures are crisp and easy to follow. I really appreciated the combination of behavioral assays, electrophysiology, chemical analysis, and field observations.

      My main issue concerns the identity and biological origin of the proposed bat odor cue, (−)-limonene. Limonene seems like an unusual compound to be emitted endogenously by a mammal, particularly by an insectivorous bat. It would be helpful if the authors could clarify whether mammals are known to synthesize this compound de novo, and, if not, what the likely source of this plant-associated terpene would be in S. kuhlii. Possible sources could include environmental exposure, diet, roosting material, handling, or temporary housing conditions.

      I do not doubt that crickets avoid synthetic (−)-limonene. Indeed, this result is quite plausible given that limonene is widely used in insect repellent or repellent-associated fragrance products. However, this also makes contamination an important issue to address explicitly. How did the authors exclude the possibility that limonene entered the samples from human-associated sources, such as insect repellents, soaps, cleaning products, field equipment, cloth bags, cages, gloves, or other materials used while handling wild-caught bats? It would strengthen the manuscript to report limonene levels for individual bat odor collections, all relevant blanks, and any handling or housing controls.

      More broadly, given the common occurrence of limonene in plants and human-associated products, I am not yet convinced that it would function as a reliable "keystone kairomone" as suggested around line 253. How would crickets distinguish bat-associated limonene from limonene emitted by a mint leaf, citrus peel, pine material, or other non-threatening environmental sources? The authors may wish to soften this interpretation or provide additional evidence that crickets respond to limonene in a bat-specific context, perhaps through concentration, temporal patterning, co-occurring volatiles, or enantiomeric composition.

      We thank Reviewer #1 for the positive evaluation and for recognizing the study as “interesting and enjoyable.” We greatly appreciate the endorsement of our integrative approach combining behavioral assays, electrophysiology, chemical analysis, and field observations. The core conclusion that crickets can detect and avoid bats via olfaction is well supported by our data, and we are pleased that the reviewer has recognized this central finding.

      We are grateful for the reviewer’s constructive comments on the biological source and ecological specificity of limonene. In our response to the editor above, we have already responded to both aspects in detail; here we will briefly restate the key points.

      On the biological origin of limonene and potential contamination

      We agree that limonene is common in plants and human-made products, but relatively unusual for a mammal to emit endogenously. We have carefully examined the possibility of contamination and believe it is highly unlikely for the following reasons:

      (1) Strict experimental protocols: All experiments were conducted in a dedicated space. Instruments were cleaned with ethanol and oven-dried before and after each use. Cloth bags used to hold bats were made of degreased bleached cotton and washed with purified water; holding cages were stainless steel.

      (2) Blank controls: Limonene was not detected in any blank control samples, neither in the empty-chamber air controls for whole-body odor collection nor in the blank cotton swabs used for secretion analysis. In contrast, limonene was consistently identified in multiple bat snout-secretion samples.

      (3) Independent reports: Limonene has been previously identified in the secretions of several bat species (Faulkes et al., 2019, PeerJ; Zhang et al., 2022, Ann. N.Y. Acad. Sci.), indicating that its presence is not unique to our study or handling conditions.

      (4) Potential microbial origin: Even if bats do not synthesize limonene de novo (a capacity for which there is currently no evidence), recent work shows that skin-associated microorganisms can substantially shape bat volatile odors (Sun et al., 2026, BMC Biology). Some of these microbes possess enzymes involved in limonene biosynthesis, making bat-associated microbiota a plausible biological source of this compound.

      (5) Thus, the limonene we detected is highly likely to originate from the bats themselves (directly or via their microbes) rather than from contamination.

      On how crickets distinguish bat-associated limonene from environmental sources

      This is an excellent and important question. As we briefly discussed in the original manuscript, crickets may not rely exclusively on limonene as a bat-specific cue. Under natural field conditions, they could integrate olfactory information with other ecological cues, for example, temporal and spatial patterning (bats are active at night in specific foraging habitats), co-occurrence with other bat-specific volatiles (the full odor blend contains many compounds), or even concentration thresholds that differ between bat emissions and plant sources. We hypothesize that such context-specific integration could minimize false alarms.

      However, we also recognize that fully testing these hypotheses would require substantial additional work: quantify natural limonene concentrations in bat odors versus various plant sources, conduct choice experiments with ecologically relevant concentrations and blends, and perhaps manipulate the olfactory landscape in the field. These are important questions, but they are not the central focus of the present study, whose primary aim is to provide evidence that olfaction—and elemental perception of a single compound—can function in this predator-prey system. We have therefore framed this as an important direction for future research.

      Revisions we will make:

      (1) In the Methods section, we will add detailed descriptions of contamination controls and present blank-control results to demonstrate that exogenous contamination is very unlikely.

      (2) In the Discussion section, we will expand the discussion of limonene’s biological sources (including microbial contributions) and explicitly acknowledge the need for future work on how crickets might discriminate bat-derived from plant-derived limonene.

      (3) In the Conclusion, we will more cautiously characterize limonene’s ecological role, emphasizing that it is sufficient to trigger avoidance but that its natural specificity requires further investigation.

      We thank the reviewer again for these insightful comments, which will help us improve the manuscript.

      Reviewer #2 (Public review):

      Summary:

      Many insects possess extremely sensitive olfactory systems that can detect chemical signals from distances of several kilometers. For decades, the arms race between bats and insects has served as a prime example of acoustic co-evolution. The auditory adaptations of insects to echolocation have been well documented. Cricket has a multi-sensory predator recognition system with keen olfactory, tactile, and auditory senses. However, whether crickets can use the scent of bats to avoid them remains unknown at present. The authors hypothesized that cricket prey (Loxoblemmus equestris) might eavesdrop on predator bat (Scotophilus kuhlii) VOCs as an early warning. L. equestris is one of the prey species of S. kuhlii, and the authors demonstrated that the body odor of the insectivorous bat S. kuhlii triggers robust avoidance and electrophysiological responses in the cricket L. equestris, and that a single compound, (-)-limonene, is sufficient to elicit this avoidance in the laboratory and suppress calling in the field. Overall, this paper has a complete chain of evidence and should be a highly praised study.

      Comments:

      (1) Olfactory eavesdropping can transcend the evolutionary divide between vertebrate predators and invertebrate prey, enabling invertebrates to trigger defensive avoidance behaviors in response to predator-derived volatile odors. This phenomenon is empirically well-documented and requires no excessive emphasis.

      (2) Without quantitative analysis and without knowing the relative content of this key substance limonene, I don't quite understand how to determine the concentration of limonene standard for EAD, as well as the concentration in field experiments. How is the concentration of limonene determined in field spraying, and is this actually the case in the wild environment?

      (3) Figures 1C and D should compare the GC-EAD response of L. equestris to the odor of bat body and the odor of bat nasal secretions. It should not be compared with the air control group. Figure 1D has the same problem.

      We sincerely thank Reviewer #2 for the high praise (“complete chain of evidence,” “highly praised study”) and for the constructive suggestions to further improve the manuscript.

      On the novelty of olfactory eavesdropping across the vertebrate–invertebrate divide

      We agree with the reviewer that “olfactory eavesdropping can transcend the evolutionary gap between vertebrate predators and invertebrate prey” and that such phenomena have been documented. However, we would like to note that empirical examples remain relatively scarce, especially those that combine chemical identification, electrophysiology, behavioral assays, and field validation within a confirmed predator–prey relationship. We will adjust the wording in the Introduction and Discussion to more accurately reflect this current state of knowledge, acknowledging prior work while clarifying the added value of our study.

      On quantitative analysis and concentration choices for limonene in EAG and field experiments.

      EAG concentration gradients: The concentrations used in our EAG experiments (including the 1% and 10% v/v dilutions of (−)-limonene) were selected based on standard practices in insect chemical ecology and on previous studies investigating dose-dependent antennal responses to volatile compounds (e.g., Tang et al., 2024, Int. J. Biol. Macromol.). The goal was to determine whether L. equestris antennae are capable of detecting limonene across a range of concentrations, not to precisely match natural emission levels or to determine behavioral thresholds. Our data clearly show concentration-dependent antennal responses, establishing physiological sensitivity.

      Field spray concentration: We acknowledge that the concentration used in the field experiment (10% v/v limonene sprayed over 25 m²) does not represent the exact amount of limonene naturally emitted by bats. Natural odor plumes are highly complex; the diffusion, dilution, and persistence of volatiles depend on multiple factors (airflow, turbulence, temperature, humidity, vegetation structure, etc.). Accurately reconstructing such dynamics would require detailed quantitative measurements and possibly fluid-dynamic modeling, which were beyond the scope of this study. The aim of the field experiment was functional: to test whether limonene, as a single bat-associated volatile, could alter cricket calling behavior under semi-natural conditions, not to establish the concentration threshold for this effect. Therefore, we did not design experiments to determine the exact concentration at which crickets begin to respond. The positive result supports the ecological relevance of limonene as an avoidance cue, but we do not claim that the applied concentration matches natural levels. We will clarify this point in the revised Methods and Discussion sections and acknowledge that quantitative characterization of natural bat-odor compositions and their diffusion dynamics is an important direction for future research.

      On Figures 1C and 1D comparing bat body odor with air control rather than with snout secretions.

      We thank the reviewer for this suggestion. The comparison between bat body odor and snout secretions is indeed novel and informative, and we agree that it could help identify anatomical sources of active volatiles. However, the purpose of Figures 1C and 1D in the current manuscript is to answer a more fundamental question: whether bat body odor (as a whole) contains volatile components that elicit antennal responses in crickets, compared to an odor-free control. This establishes the basic phenomenon of olfactory detection. The identification of snout secretions as the primary source of body odor is addressed separately in Figure 2, using HS-SPME-GC-MS and PCA. In the revised manuscript, we will clarify this rationale in the Methods and Results sections to avoid confusion. We also note that the reviewer’s idea, directly comparing GC-EAD responses to snout secretions versus whole-body odor, is an excellent suggestion for future experiments and would further strengthen the source attribution.

      Revisions we will make:

      (1) In the Introduction and Discussion, we will adjust the wording to more accurately reflect the current state of knowledge on olfactory eavesdropping across the vertebrate-invertebrate divide, acknowledging prior work while clarifying the added value of our study.

      (2) In the Methods and Discussion, we will clarify the rationale for our concentration choices in the EAG and field experiments, acknowledging that our aim was functional (testing sufficiency) rather than determining quantitative thresholds.

      (3) In the Methods and Results, we will clarify the rationale for comparing bat body odor with air controls in Figures 1C and 1D, and note that the reviewer’s suggestion of comparing with snout secretions is an excellent direction for future work.

      We thank Reviewer #2 again for the thoughtful comments, which have helped us improve the manuscript.

    1. eLife Assessment

      This important study demonstrates that extrachromosomal circular DNA and chromatin-associated proteins are components of stress granules. The data from a range of cellular and microscopy approaches are convincing, but the main conclusions would be further strengthened by demonstrating functional relevance and by extending the analysis to additional cell types. This paper will be of broad interest to cell biologists and those studying stress granule formation.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Demeshkina and Ferré-D'Amaré showed that extrachromosomal circular DNA (eccDNA) and chromatin-associated proteins are present in stress granules, based on proteomic and sequencing analyses. Using HCR-FISH combined with imaging, the authors showed the colocalization of eccDNA with stress granule proteins. Furthermore, they found that CRISPR machinery targeting the eccDNA component of stress granules disrupts stress granule assembly, and that this effect is largely independent of Cas9 endonuclease activity. Notably, expression of cytoplasmic chromatin factors restores stress granule formation in the presence of CRISPR machinery in yeasts. This also rescues the growth defect caused by hypoxic stress, which correlates with impaired stress granule formation. Together, this manuscript provides insight into the presence of eccDNA in cytoplasmic membraneless organelles, specifically stress granules, and suggests a functional role for eccDNA within these structures under stress conditions.

      Strengths:

      The authors used a panel of ribonucleases to demonstrate that stress granule cores isolated from yeast and HEK293 cells are resistant to plasmid-safe DNase, an enzyme that does not degrade circular double-stranded DNA. To further support the presence of extrachromosomal circular DNA (eccDNA) in stress granules, they performed Circle-Seq on stress granule cores. The gel electrophoresis and sequencing experiments complement each other well, providing consistent evidence for eccDNA within these granules. Overall, this study provides insight into potential cytoplasmic roles for eccDNA, an area that remains largely unexplored.

      Weaknesses:

      (1) Figure 1F suggests that stress granule cores are susceptible to DNase I but not to plasmid-safe DNase (psDNase). However, its smearing pattern in the psDNase condition appears similar to that in the DNase I treatment shown in Figure 1E, although psDNase produces more discrete bands. The authors should comment on these differences between Figures 1E and 1F, or consider revising Figure 1F to improve consistency with Figures 1E and 1D.

      (2) The authors should clearly define "colocalization". Does it refer to complete spatial overlap between two signals (i.e., VCP and T30), or partial overlap (i.e., AHNAK DNA and G3BP)? Figure 3 and the associated text are descriptive. Quantitative analysis would strengthen the conclusions. For example, the authors could analyze the fraction of molecules localized to stress granules or provide Pearson's correlation coefficient or similar measurements.

      (3) The authors used a CRISPR-based approach to target the Ty1 LTR retrotransposon, an abundant stress granule eccDNA, and they observed a loss of stress granule formation. However, this phenotype may be specific to Ty1 eccDNA rather than representative of all eccDNA species present in granules. In particular, the title "Cytoplasmic circular DNA is a key constituent of stress granules" implies a broader role. To support this claim, the authors should consider approaches that more globally deplete eccDNA rather than targeting a single eccDNA.

      (4) The authors should provide additional experimental evidence to support the claim that eccDNA is packaged in a chromatin-like state. The rescue of stress granule formation by ectopic expression of modified chromatin-associated proteins (CHD1NES and GCN5NES) following CRISPR treatment does not necessarily demonstrate that eccDNA is packaged like chromatin under basal conditions.

    3. Reviewer #2 (Public review):

      Summary:

      The authors report the presence of extrachromosomal circular DNAs (eccDNAs) within the core of stress granules purified from both yeast and mammalian cells.

      Strengths:

      This study is important for understanding the molecular mechanisms underlying stress granules containing eccDNAs and is likely to have a major impact on future research. A major strength of the study is the extensive experimental validation performed in yeast cells. In particular, cytoplasmic CRISPR-mediated targeting of eccDNAs suppresses stress granule formation and impairs recovery from hypoxic stress in yeast cells.

      Weaknesses:

      The conclusions would be further strengthened by validating the functional findings in an additional model system, such as mammalian cells.

      Comments:

      (1) Section: "Stress granule cores contain eccDNA"

      a) The presence of eccDNAs would be more convincingly demonstrated using an orthogonal validation approach, such as DNA FISH targeting MYC and Centromere 8 (CEN8) on metaphase spreads from HEK293T cells (as performed in PMID: 34819668).

      b) The study would also benefit from assessing the presence of eccDNAs in the extracellular medium. For example, DNA could be extracted from conditioned media and analyzed by PCR using primers spanning eccDNA breakpoint junctions (as performed in PMID: 40074906; PMID: 36123406).

      (2) Section: "eccDNA-CRISPR abrogates stress granules"

      These findings should be further validated under additional stress conditions, such as drug-induced stress (like methotrexate) or nutrient deprivation in the cell medium.<br /> In addition, the same set of experiments should be performed in HEK293T cells to support the broader relevance of the observations.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Demeshkina and Ferré-D'Amaré showed that extrachromosomal circular DNA (eccDNA) and chromatin-associated proteins are present in stress granules, based on proteomic and sequencing analyses. Using HCR-FISH combined with imaging, the authors showed the colocalization of eccDNA with stress granule proteins. Furthermore, they found that CRISPR machinery targeting the eccDNA component of stress granules disrupts stress granule assembly, and that this effect is largely independent of Cas9 endonuclease activity. Notably, expression of cytoplasmic chromatin factors restores stress granule formation in the presence of CRISPR machinery in yeasts. This also rescues the growth defect caused by hypoxic stress, which correlates with impaired stress granule formation. Together, this manuscript provides insight into the presence of eccDNA in cytoplasmic membraneless organelles, specifically stress granules, and suggests a functional role for eccDNA within these structures under stress conditions.

      Strengths:

      The authors used a panel of ribonucleases to demonstrate that stress granule cores isolated from yeast and HEK293 cells are resistant to plasmid-safe DNase, an enzyme that does not degrade circular double-stranded DNA. To further support the presence of extrachromosomal circular DNA (eccDNA) in stress granules, they performed Circle-Seq on stress granule cores. The gel electrophoresis and sequencing experiments complement each other well, providing consistent evidence for eccDNA within these granules. Overall, this study provides insight into potential cytoplasmic roles for eccDNA, an area that remains largely unexplored.

      Weaknesses:

      (1) Figure 1F suggests that stress granule cores are susceptible to DNase I but not to plasmid-safe DNase (psDNase). However, its smearing pattern in the psDNase condition appears similar to that in the DNase I treatment shown in Figure 1E, although psDNase produces more discrete bands. The authors should comment on these differences between Figures 1E and 1F, or consider revising Figure 1F to improve consistency with Figures 1E and 1D.

      We suggest that the appropriate comparisons are between the DNase I and psDNase treatments within each figure panel, and not between panels (e.g., Figures 1E vs. 1F). The electrophoretic gels in the different panels were run for different lengths of time, and therefore the comparison between gels would be spurious. In Figure 1E, electrophoresis after DNase I treatment results in a characteristic smear, while after psDNase treatment yields discrete bands (lanes 2–3 vs. 4–5). Electrophoretic conditions for this figure were optimized to minimize diffusion and allow quantitative evaluation. The electrophoresis shown in Figure 1F, which compares yeast and mammalian stress granule core nucleic acids, was run for a longer period — as evidenced by the greater migration distance from the loading wells — yet still clearly shows the same qualitative difference between DNase I (smear, lane 3) and psDNase (discrete bands, lanes 1–2) treatments for the yeast samples. The apparent discrepancy noted by the referee therefore simply reflects the difference in electrophoretic conditions between the gels shown in the two separate figure panels.

      (2) The authors should clearly define "colocalization". Does it refer to complete spatial overlap between two signals (i.e., VCP and T30), or partial overlap (i.e., AHNAK DNA and G3BP)? Figure 3 and the associated text are descriptive. Quantitative analysis would strengthen the conclusions. For example, the authors could analyze the fraction of molecules localized to stress granules or provide Pearson's correlation coefficient or similar measurements.

      In our considered opinion, categorizing colocalization as either "partial" or "complete" implies a level of molecular precision that is physically unattainable at the resolution limits of any current light microscopy modality, and would therefore be misleading. Our approach employs super-resolution confocal laser scanning microscopy (Airyscan) with hybridization chain reaction fluorescence in situ hybridization (HCR-FISH) or with immunofluorescence. The detection method used offers higher spatial resolution and signal-to-noise ratio than single-point detector/physical pinhole confocal (or widefield epifluorescence) microscopy used in most prior stress granule studies. Despite these enhancements, the system retains inherent diffraction-imposed limits: a lateral (XY) resolution of ~130 nm and an axial (Z) resolution of ~350–400 nm, defining the minimum separable distance between two fluorescent signals. Structures smaller than these thresholds remain unresolved within a single point spread function (PSF) maximum – a volume sufficiently large to simultaneously accommodate multiple stress granule cores or tens of thousands of individual proteins (such as G3BP) and dozens of nucleic acid molecules several thousand nucleotides in length. Consequently, any detected fluorescence signal may represent the superimposition of a large and indeterminate number of individual molecules or particles. True molecular interaction analysis remains for future studies using technologies with angstrom resolution (e.g., cryo-electron tomography, cryo-EM, X-ray crystallography, smFRET, EPR, NMR, etc.). Metrics such as Pearson's correlation coefficient report solely on the degree of signal overlap at the PSF scale (hundreds of nanometers) and would not provide any insight beyond what is already conveyed by our data.

      (3) The authors used a CRISPR-based approach to target the Ty1 LTR retrotransposon, an abundant stress granule eccDNA, and they observed a loss of stress granule formation. However, this phenotype may be specific to Ty1 eccDNA rather than representative of all eccDNA species present in granules. In particular, the title "Cytoplasmic circular DNA is a key constituent of stress granules" implies a broader role. To support this claim, the authors should consider approaches that more globally deplete eccDNA rather than targeting a single eccDNA.

      We respectfully disagree with the referee that further depletion of eccDNA would alter our conclusions. A central finding of our study is that stress granules can be abrogated cytoplasmically by co-expressing a Cas9 endonuclease, active or inactivated by point mutations (D10A /H840A), and a gRNA (which is itself a fusion of the crRNA and trcrRNA, natively separate RNAs in the source bacterium). We show in Figure 4 that when the gRNA targets the Ty1 sequences, endonucleolytically active holoenzyme co-expression in the cytoplasm results in loss of the corresponding eccDNAs, as assayed by sequencing of the relevant cytoplasmic fractions. Critically, when a catalytically inactive Cas9 protein (dCas9) is co-expressed with the gRNA instead of the wild-type endonuclease, depletion of the eccDNAs containing Ty1 sequences no longer takes place (Figures 4D and 4E), but stress granule formation is still abrogated (Figure 4C).

      In our manuscript, we indicated (as "data not shown”) that co-expression with Cas9 of a gRNA "targeting" a sequence that is absent from the S. cerevisiae genome still results in abrogation of stress granule formation. These data are shown in Author response image 1. The gRNA is targeted to the sequence 5’-agaatcgatgcattt, which is absent in the genome of the yeast strain used.

      Author response image 1.

      It follows from our experiments that stress granule abrogation (1) is not a result of the catalytically active Cas9 endonuclease; (2) is not a result of the presence of a gRNA-directed but catalytically inactive Cas9 holoenzyme, but (3) is the result of the presence of a CRISPR holoenzyme (as defined in Author response image 1) in the cytoplasm.

      To reiterate, abrogation of stress granules occurs when a Cas9-gRNA complex is present in the cytoplasm, regardless of whether the nuclease activity exists, or the gRNA targets a sequence that is present in the genome. Importantly, the holoenzyme is required for this phenomenon: presence of the endonuclease or the gRNA alone does not abrogate stress granule formation (Figures S5).

      It is because of this unexpected observation that we next hypothesized that activities of the Cas9-gRNA complex other than sequence-specific gRNA-targeted endonucleolytic activity is driving the suppression of stress granule formation. The best documented such activity is DNA sequence sampling (1-dimensional diffusion). We think that 1-dimensional diffusion of the Cas9-gRNA holoenzyme is displacing from the cytoplasmic eccDNA interactors whose association with the DNA is required to drive stress granule assembly. The fact that the stress-granule suppressive effect of cytoplasmic Cas9-gRNA expression can itself be suppressed by two completely unrelated proteins whose only shared feature is action on chromatin (CHD1 and GCN5) strongly supports this hypothesis (Figures 4G, 4H and S6; also response to point 4, below), in addition to confirming that cytoplasmic eccDNA is packaged by histones in a conformation that CHD1 and GCN5 can both recognize.

      (4) The authors should provide additional experimental evidence to support the claim that eccDNA is packaged in a chromatin-like state. The rescue of stress granule formation by ectopic expression of modified chromatin-associated proteins (CHD1NES and GCN5NES) following CRISPR treatment does not necessarily demonstrate that eccDNA is packaged like chromatin under basal conditions.

      We would like to reiterate the temporal order in our experimental design (detailed in full in Methods and summarized in Results). Cas9<sub>NES</sub>-gRNA and CHD1<sub>NES</sub> (or GCN5<sub>NES</sub>) were expressed simultaneously (not sequentially) in the cytoplasm. This was intentional, so as to give each player ample opportunity to engage its preferred substrate under non-stress conditions, prior to the brief oxidative stress. The referee appears to believe that cytoplasmic eccDNA was pre-exposed to Cas9<sub>NES</sub>-gRNA, and then the bound endonuclease challenged with chromatin-modifying enzymes.

      Our experimental design accounts for the contrasting substrate specificities of CRISPR and chromatin-modifying enzymes. Cas9-gRNA (holoenzyme) binds to nucleosome-free DNA with sub-nanomolar dissociation constant (Kd 0.1–1 nM) but its association with chromatinized DNA is impeded 5- to 100-fold (Isaac et al., 2016; Yarrington et al., 2018; Strohkendl et al., 2021). In contrast, whereas CHD1 binding to DNA is strictly nucleosome-dependent — its chromodomains actively block engagement with protein-free DNA (Hauk et al., 2010), and its productive binding (Kd 10–200 nM) relies on obligate multivalent contacts with the histone octamer, H4 tail, and wrapped DNA (Farnung et al., 2017; Sundaramoorthy et al., 2018).

      Our observation that stress granule formation was unperturbed following oxidative stress is most parsimoniously interpreted as CHD1<sub>NES</sub> outcompeting the CRISPR machinery for cytoplasmic binding to eccDNA by virtue of the latter existing in a histone-bound state that is recognized as chromatin by CHD1 –simultaneously favoring CHD1<sub>NES</sub> engagement and impeding Cas9 access. Thus, our experiment in effect employs stress granule formation as a readout for differential binding to chromatin or chromatin-like eccDNA.

      Farnung, L., Vos, S.M., Wigge, C., and Cramer, P. (2017). Nucleosome-Chd1 structure and implications for chromatin remodelling. Nature, 550(7677), 539–542.

      Hauk, G., McKnight, J.N., Nodelman, I.M., and Bharat, T.A.M. (2010). The chromodomains of the Chd1 chromatin remodeler regulate DNA access to the ATPase motor. Mol Cell, 39(5), 711–723.

      Isaac, R.S., Jiang, F., Doudna, J.A., Lim, W.A., Narlikar, G.J., and Bhatt, D.L. (2016). Nucleosome breathing and remodeling constrain CRISPR-Cas9 function. Nature Struct Mol Biol, 23(12), 1097–1103.

      Strohkendl, I., Saifuddin, F.A., Gibson, B.A., Bhatt, D.L., Russell, R., and Bharat, T.A.M. (2021). Inhibition of CRISPR-Cas9 by bacteriophage-encoded proteins. Mol Cell, 81(8), 1665–1679.

      Sundaramoorthy, R., Hughes, A.L., Singh, V., Wiechens, N., Ryan, D.P., El-Mkami, H., Petoukhov, M., Svergun, D.I., Treutlein, B., Sproll, P., and Owen-Hughes, T. (2018). Structural reorganization of the chromatin remodeling enzyme Chd1 upon engagement with nucleosomes. eLife, 7, e35720.

      Yarrington, R.M., Verma, S., Schwartz, S., Trautman, J.K., and Carroll, D. (2018). Nucleosomes inhibit target cleavage by CRISPR-Cas9 in vivo.PNAS, 115(38), 9450–9455.

      Reviewer #2 (Public review):

      Summary:

      The authors report the presence of extrachromosomal circular DNAs (eccDNAs) within the core of stress granules purified from both yeast and mammalian cells.

      Strengths:

      This study is important for understanding the molecular mechanisms underlying stress granules containing eccDNAs and is likely to have a major impact on future research. A major strength of the study is the extensive experimental validation performed in yeast cells. In particular, cytoplasmic CRISPR-mediated targeting of eccDNAs suppresses stress granule formation and impairs recovery from hypoxic stress in yeast cells.

      Weaknesses:

      The conclusions would be further strengthened by validating the functional findings in an additional model system, such as mammalian cells.

      Comments:

      (1) Section: "Stress granule cores contain eccDNA"

      (a) The presence of eccDNAs would be more convincingly demonstrated using an orthogonal validation approach, such as DNA FISH targeting MYC and Centromere 8 (CEN8) on metaphase spreads from HEK293T cells (as performed in PMID: 34819668).

      The relationship between eccDNA dynamics and stress granule assembly across distinct cell cycle phases remains an important and poorly explored question. To our knowledge, no published data currently describe how stress response mechanisms are regulated during mitotic division, particularly in metaphase. Our identification of eccDNA as a component of stress granule cores can provide a first tractable framework to investigate this relationship. However, a systematic and in-depth characterization of this phenomenon warrants a dedicated future investigation.

      (b) The study would also benefit from assessing the presence of eccDNAs in the extracellular medium. For example, DNA could be extracted from conditioned media and analyzed by PCR using primers spanning eccDNA breakpoint junctions (as performed in PMID: 40074906; PMID: 36123406).

      We agree with the referee that eccDNA biology represents a fascinating and rapidly evolving area of research, particularly given the emerging role of eccDNA in oncogenesis. In this context, our identification of eccDNA as a core structural component of stress granules opens a novel avenue for exploring the connection between stress-dependent translational regulation and disease-associated eccDNA dynamics. While we acknowledge the importance of this direction, a rigorous investigation of this relationship requires extensive multifaceted experimentation that falls beyond the scope of the current study.

      (2) Section: "eccDNA-CRISPR abrogates stress granules"

      These findings should be further validated under additional stress conditions, such as drug-induced stress (like methotrexate) or nutrient deprivation in the cell medium. In addition, the same set of experiments should be performed in HEK293T cells to support the broader relevance of the observations.

      We agree with the referee that the composition and dynamics of stress granules arising from different stressors is an important endeavor. However, given the range of stressors documented to result in stress granule formation, those studies fall well beyond the scope of this manuscript. We will note however that the presence of eccDNA in stress granules of yeast and human cells is strong evidence for conservation of function(s). We think that exploration of the role of eccDNA in stress granule formation across the kingdoms of life (stress granules were first observed in heat-shocked tomato plants), cell cycle stages, stressors, etc. will be important research programs for the future.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Figures 3D and 3I: The use of magenta and red makes it difficult to distinguish between the two labeled signals. Consider using more contrasting colors to improve visual clarity.

      We appreciate the comment regarding color choices in the figures. In our view, magenta and red are sufficiently distinguishable as nucleic acid labels, particularly when combined with the green signal representing G3BP in these panels.

      (2) Figures 3F and 3G: Do the authors have an explanation for why AHNAK or MAPT DNA (white) does not colocalize with the anti-DNA immunofluorescence signal?

      Immunofluorescence (IF) is standard for detecting protein antigens but has limitations when the target is a non-protein molecule such as DNA, owing to its compacted chromatinized state. Anti-DNA antibodies can miss a significant fraction of their targets because the DNA backbone remains largely inaccessible, a limitation that DNA-FISH overcomes by directly hybridizing probes to denatured DNA sequences with high specificity. The fixation step required for both IF and FISH imaging can introduce additional steric barriers that disproportionately restrict antibody access compared to small nucleic acid probes. Even under optimized conditions, the IF signal with anti-DNA antibodies is inherently reflective of a subset of the total cellular DNA content.

      (3) Adding a subtitle on page 12 ("The abundant histones in purified stress granule...") would improve the overall structure and readability of the manuscript.

      We think that an additional subtitle would not substantially improve the readability of what is, admittedly, a very dense manuscript that employs a diversity of experimental approaches.

      (4) It would strengthen the analysis if statistical significance were included for the different time points in Figure 5C.

      We appreciate the reviewer’s suggestion. Figure 5C shows the largest difference at 40–45 hours after stress recovery, which is statistically significant between Cas9NES-gRNA (or dCas9NES-gRNA) and Cas9NES or gRNA only (two-tailed Student’s t-test, *, p ≤ 0.05). All primary experimental data are publicly available (FigShare) so further analyses can be performed by interested future parties.

    1. eLife Assessment

      The authors use convincing methodology to investigate the detachment and reattachment kinetics of kinesin-1, 2 and 3 motors against loads oriented parallel to the microtubule. The conclusions drawn from the valuable experiments as well as the overall interpretation of the results are fully supported by the presented data.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      Noell et al have presented a careful study of the dissociation kinetics of Kinesin (1,2,3) classes of motors moving in-vitro on a microtubule. These motors move against the opposing force from a ~1 micron DNA strand (DNA tensiometer) that is tethered to the microtubule and also bound to the motor via specific linkages (Fig 1A). Authors compare the time for which motors remain attached to the microtubule when they are tethered to the DNA, versus when they are not. If the former is longer, the intepretation is that the force on the motor from the stretched DNA (presumed to be working solely along the length of the microtubule) causes the motor's detachment rate from the microtubule to be reduced. Thus, the specific motor exhibits "catch-bond" like behaviour.

      Strengths:

      The motivation is good - to understand how kinesin competes against dynein through the possible activation of a catch bond. Experiments are well done and there is an effort to model the results theoretically.

      Weaknesses from original round of review:

      The motivation of these studies is to understand how kinesin (1/2/3) motors would behave when they are pitted in a tug of war against dynein motors as they transport cargo in bidirectional manner on microtubules. Earlier work on dynein and kinesin motors using optical tweezers has suggested that dynein shows catch bond phenomenon, whereas such signatures were not seen for kinesin. Based on their data with DNA tensiometer, the authors would like to claim that (i) Kinesin1 and kinesin2 also show catch-bonding and (ii) The earlier results using optical traps suffer from vertical forces, which complicates the catch-bond interpretation.

    3. Reviewer #2 (Public review):

      Summary:

      To investigate the detachment and reattachment kinetics of kinesin-1, 2 and 3 motors against loads oriented parallel to the microtubule, the authors used a DNA tensiometer approach comprising a DNA entropic spring attached to the microtubule on one end and a motor on the other. They found that for kinesin-1 and kinesin-2 the dissociation rates at stall were smaller than the detachment rates during unloaded runs. With regard to the complex reattachment kinetics found in the experiments, the authors argue that these findings were consistent with a weakly-bound 'slip' state preceding motor dissociation from the microtubule. The behavior of kinesin-3 was different and (by the definition of the authors) only showed prolonged "detachment" rates when disregarding some of the slip events. The authors performed stochastic simulations which recapitulate the load-dependent detachment and reattachment kinetics for all three motors. They argue that the presented results provide insight into how kinesin-1, -2 and -3 families transport cargo in complex cellular geometries and compete against dynein during bidirectional transport.

      Strengths:

      The present study is timely, as significant concerns have been raised previously about studying motor kinetics in optical (single-bead) traps where significant vertical forces are present. Moreover, the obtained data are of high quality and the experimental procedures are clearly described.

    4. Reviewer #3 (Public review):

      Summary:

      Several recent findings indicate that forces perpendicular to the microtubule accelerate kinesin unbinding, where perpendicular and axial forces were analyzed using the geometry in a single-bead optical trapping assay (Khataee and Howard, 2019), comparison between single-bead and dumbbell assay measurements (Pyrpassopoulos et al., 2020), and comparison of single-bead optical trap measurements with and without a DNA tether (Hensley and Yildiz, 2025).

      Here, the authors devise an assay to exert forces along the microtubule axis by tethering kinesin to the microtubule via a dsDNA tether. They compared the behavior of kinesin-1, -2, and -3 when pulling against the DNA tether. In line with previous optical trapping measurements, kinesin unbinding is less sensitive forces when the forces are aligned with the microtubule axis. Surprisingly, the authors find that both kinesin-1 and -2 detach from the microtubule more slowly when stalled against the DNA tether than in unloaded conditions, indicating that these motors act as catch bonds in response to axial loads. Axial loads accelerate kinesin-3 detachment. However, kinesin-3 reattaches quickly to maintain forces. For all three kinesins, the authors observe weakly-attached states where the motor briefly slips along the microtubule before continuing a processive run.

      Strengths:

      These observations suggest that the conventional view that kinesins act as slip bonds under load, as concluded from single-bead optical trapping measurements where perpendicular loads are present due to the force being exerted on the centroid of a large (relative to the kinesin) bead, need to be reconsidered. Understanding the effect of force on the association kinetics of kinesin has important implications for intracellular transport, where the force-dependent detachment governs how kinesins interact with other kinesins and opposing dynein motors (Muller et al., 2008; Kunwar et al., 2011; Ohashi et al., 2018; Gicking et al., 2022) on vesicular cargoes.

    5. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      I am not fully convinced about the responses from authors, so I would like to retain my original assessment of the paper. The same may be made available for public viewing, along with the responses of the authors. Readers can go through both and form their opinion.

      Unfortunately, this response from Reviewer 1 impacted the Assessment Statement but did not provide specific points for us to address. In the first round, the concerns of Reviewer 1 were: 1) the validity of the WLC prediction; 2) the claim that catch-bond measurements are generally made with superstall loads; 3) the role of vertical forces for dynein and a question about the orientation of the forces for kinesin; and 4) a request that we repeat the study using dynein. In rereading our responses to points 2-4 following our first revision, we felt that there were no unresolved issues around those points that affect our conclusions in any way. However, for point 1 regarding the validity of the WLC prediction, we had responded only in the reviewer response letter, and both reviewer 2 and the editors felt that there were points that we had addressed in the response letter that should be incorporated into the revised manuscript. Therefore, to clarify Reviewer 1’s question, we revised the text to address why we were justified to approximate the dsDNA force-extension curve using a WLC model with a 50 nm persistence length and why the precise shape of the force-extension curve has no impact on our conclusions.

      Reviewer #2 (Public review):

      The authors extensively entered into a scientific debate with the reviewers in their Response Letter. This led to a few changes and some (limited) new data in the manuscript. This is great and did improve the manuscript.

      However, in the view of this reviewer, (i) a significant number of responses fall short of actually addressing the concerns of the three reviewers (e.g. wrt using the same kinesin-1 neck-coil domains for all motors) and or (ii) a significant number of arguments now only occur in the response letter but not in the manuscript. The authors may check themselves critically for both. In principle, each longer discussion in the response letter warrants mentioning the appropriate facts and arguments in the main text of the manuscript.

      Based on this feedback, the first change we made was to rewrite the section justifying our choice of using a common coiled-coil dimerization domain for the three motors. Secondly, we went through our responses to all three reviewers to identify any instances where we either didn’t fully address the reviewer concerns or we provided arguments in the response letter but did not add corresponding text in the manuscript.

      Reviewer #3 (Public review):

      The authors attribute the differences in the behaviour of kinesins when pulling against a DNA tether compared to an optical trap to the differences in the perpendicular forces. However, the compliance is also much different in these two experiments. The optical trap acts like a ~ linear spring with stiffness ~ 0.05 pN/nm. The dsDNA tether is an entropic spring, with negligible stiffness at low extensions and very high compliance once the tether is extended to its contour length (Fig. 1B). The effect of the compliance on the results is not fully considered in the manuscript.

      In our first revision we added a paragraph in the ‘Geometry Calculations section of the Supplementary Methods addressing the dsDNA stiffness and comparing it to an optical trap. We considered moving this paragraph to the main text but decided against it because we felt it interrupted the flow of the Discussion. Instead, we expanded and clarified this paragraph to more specifically address the stiffness question. The paragraph with revised text now reads as follows:

      “Another consideration when comparing the DNA tensiometer to optical trap measurements is the relative stiffness of the trap and dsDNA. Optical traps stiffnesses are generally in the range of 0.05 pN/nm [13,14]. To calculate the predicted stiffness of the dsDNA spring, we computed the slope of theoretical force-extension curve in Fig. 1B. The stiffness is highly nonlinear and is <0.001 pN/nM below 650 nm extension. We compare motor performance under this low stiffness regime to the unloaded case in Fig. 3. In contrast, at the predicted stall force of 6 pN (960 nm extension), the dsDNA stiffness is ~0.2 pN/nm, which is stiffer than most optical traps, but it is similar to the estimated 0.3 pN/nm stiffness of kinesin motors themselves [13,14]. An 8 nm step at the 0.2 pN/nm stiffness of the dsDNA leads to a 1.6 pN jump in force and at the 0.05 pN/nm stiffness of an optical trap leads to a 0.4 pN jump in force; this is important because it means that in both cases the motors are likely dynamically stepping at stall. Because both experimental approaches allow for dynamic stepping at stall and because the stiffnesses of the instrument in both cases are less than the motor stiffness, there is no reason to expect that differences in stiffness between optical traps and the dsDNA spring lead to different motor detachment kinetics.”

      In the main text, we now address this compliance point in the ‘Comparison to previous work’ section of the Discussion:

      “stiffness differences are an unlikely explanation because at stall the stiffness of the DNA tether (~4 fold stiffer than optical tweezer) is still sufficiently low to allow for dynamic motor stepping at stall, and in any case it is still below the estimated motor stiffness (see Geometry Calculations in Supplementary methods).”.

      There were two points the reviewer felt we had sufficiently addressed. They were presented in the second review as a reiteration of the first review comments with a sentence appended, and are reproduced here. We added no new text based on these two points:

      In the single-molecule extension traces (Fig. 1F-H; S3), the kinesin-2 traces often show jumps in position at the beginning of runs (e.g. the four runs from ~4-13 s in Fig. 1G). These jumps are not apparent in the kinesin-1 and -3 traces. What is the explanation? Is kinesin-2 binding accelerated by resisting loads more strongly than kinesin-1 and -3? In their response, the authors provide an explanation of the appearance of jumps due to limited imaging speeds. The authors state that the qualitative difference in the kinesin-2 traces compared to the kinesin-1 and -3 traces may be due to the specific rebinding kinetics of kinesin-2.

      When comparing the durations of unloaded and stall events (Fig. 2), there is a potential for bias in the measurement, where very long unloaded runs cannot be observed due to the limited length of the microtubule (Thompson, Hoeprich, and Berger, 2013), while the duration of tethered runs is only limited by photobleaching. Was the possible censoring of the results addressed in the analysis? The authors addressed this concern by applying a Markov model to estimate the duration parameter.

      There was one final point from Reviewer 3 in the first round of reviews that we had addressed in the reviewer response (and that the reviewer was satisfied with), but we did not incorporate into the manuscript. Based on the suggestion from Reviewer 2 and the editors that we incorporate more from our responses to reviewers into the manuscript, we added new text on this point. That point (with the new sentence in the second review underlined), our response from first revision, and our response for this second revision are given below:

      The mathematical model is helpful in interpreting the data. To assess how the "slip" state contributes to the association kinetics, it would be helpful to compare the proposed model with a similar model with no slip state. Could the slips be explained by fast reattachments from the detached state? In their response, the authors addressed this question by explaining that a three-state model is required to model the recovery time distributions.

      In the model, the slip state and the detached states are conceptually similar; they only differ in the sequence (slip to detached) and the transition rates into and out of them. The simple answer is: yes, the slips could be explained by fast reattachments from the detached state. In that case, the slip state and recovery could be called a “detached state with fast reattachment kinetics”. However, the key data for defining the kinetics of the slip and detached states is the distribution of Recovery times shown in Fig. 4D-F, which required a triple exponential to account for all of the data. If we simplified the model by eliminating the slip state and incorporating fast reattachment from a single detached state, then the distribution of Recovery times would be a single-exponential with a time constant equivalent to t<sub>1</sub>, which would be a poor fit to the experimental distributions in Fig. 4D-F.

      Reviewer 3 noted that they were satisfied with our explanation of this point. However, based on Reviewer 2’s suggestion that we incorporate more of our responses into the text of the manuscript, we added the following clarification point in the model section of the Results:

      “We note that recapitulating the tri-exponential restart time distribution in Figure 4D-F required this slip/detached formulation and that lumping all events into a single detached state resulted in a single-exponential distribution of recovery times.”

    1. eLife Assessment

      This study characterizes a potentially targetable mechanism by which phosphate scarcity drives polymyxin B resistance in Enterobacteriaceae. The findings are important. While some aspects of the approach are very strong, particularly the diversity of techniques, it is recommended to include genetic controls and antibiotic resistance experiments in order to strengthen the evidence, which is currently solid. The clarity and presentation of the findings could also be improved.

    2. Reviewer #1 (Public review):

      This manuscript by Zhang et al addresses how Pi scarcity/depletion drives PMB resistance in Enterobacteriaceae, because it proposes a mechanistically distinct pathway from the better-known PhoBR-linked phospholipid-remodeling responses in other Gram-negatives. The authors also suggest an intervention strategy based on Mg repletion or Fe chelation. The results are substantial and include genetic analyses, mass spectrometry, reporter assays, phospho-signaling readouts, metal quantification, and comparative analyses across enterobacterial species.

      The paper reads well with the emphasis on the Mg loss followed by Fe mobilization during Pi depletion that induces PmrAB TCS activation for lipid A modification through transcriptional activation of ugd and arn genes. However, PmrAB is a well-known TCS responsible for PMB resistance through lipid A modification in the extensive studies by the Groisman lab. PmrA is a well-known transcriptional regulator to activate the transcription of the ugd gene in Salmonella and Yersinia by Mg depletion and Fe mobilization. Therefore, the current paper should focus more on the upstream signaling to connect the dots between Pi depletion and Mg loss. This is important because Ugd gene expression is not affected by PmrAB in Pi depletion. It should also be considered that Mg loss is temporally associated with Fe mobilization, but the manuscript does not quantitatively show that Mg dissociation/redistribution is sufficient to trigger Fe mobilization in the absence of Pi depletion, considering that Mg is a macronutrient, whereas Fe is a micronutrient.

      Second, the relationship between arn and ugd regulation needs a clearer mechanistic resolution to orchestrate the synthesis of the L-Ara4N during Pi depletion, because the manuscript shows that arn activation is PmrAB-dependent, whereas ugd is only partially PhoBR-dependent and not dependent on PmrAB. Yet the current model and narrative treat the system as a unified "ugd-arn" output. This should be carefully addressed, given that Pi depletion and Mg depletion might trigger different signaling modules.

      Third, the manuscript argues that this is a "conserved" circuit in Enterobacteriaceae. The evidence for conservation is presently strongest in E. coli MG1655 and includes supportive observations in E. coli O157, one UTI strain by lipid A MS, several UTI isolates by killing assay, and S. Typhimurium for key phenotypes. No direct mechanistic validation is shown in other important genera belonging to Enterobacteriaceae, which include Klebsiella, Enterobacter, Citrobacter, Yersinia, Serratia, or other clinically important Enterobacteriaceae.

      Fourth, the reversal and translational claims are a bit stronger than the current evidence supports. The title and Abstract state that identifying and targeting the circuit reverses Pi depletion-driven PMB, and the manuscript suggests a pharmacological intervention framework based on Mg supplementation or Fe chelation. The actual intervention evidence is limited to in vitro killing assays under acute Pi-depleted minimal-medium conditions in E. coli and S. Typhimurium, without in vivo testing, in that the experiments are performed under an acute 3-hour starvation in MOPS medium, not in host-mimicking or infection-relevant environments. The reversal needs to be shown not only at the level of survival curves, but also by the quantitative MIC/MBC measurements.

      More importantly, the authors demonstrated that the signaling module upon Pi limitation in Enterobacteria differs from that in other Gram-negative bacteria such as Pseudomonads. However, they did not discuss why this difference would impact the life of Enterobacteria. The authors should consider the glycolytic pathways (i.e., EMP pathway for enterobacteria vs ED pathway for pseudomonads), in that the ED pathway requires less Pi, whereas the EMP pathway requires more Pi. It should be noted that Pi supply is highly limited in the natural environment for the free-living bacteria, rather than in the host environment for the commensals.

    3. Reviewer #2 (Public review):

      Summary:

      Using E. coli K-12 as a model system, the authors investigated how phosphate (Pi) depletion induces polymyxin resistance in Enterobacteriaceae, which notably lack the canonical phospholipid remodeling pathways commonly associated with phosphate starvation responses. They demonstrated that low-phosphate conditions promote L-Ara4N modification of lipid A, thereby enhancing polymyxin resistance. Proteomic analyses revealed significant upregulation of the arn operon and ugd under phosphate-limited conditions, and promoter activity assays further confirmed that both promoters are strongly induced during Pi depletion. Through gene deletion experiments, the authors showed that arn expression is regulated by the PmrAB two-component system, whereas ugd is controlled by PhoBR under low-phosphate conditions. Using ICP-MS analysis, they further found that phosphate limitation increases cell-associated Fe levels, and that reducing Fe availability abolishes PmrAB-dependent activation of the arn operon. Finally, the study demonstrated that Mg supplementation and Fe chelation can suppress polymyxin resistance, highlighting the critical role of metal homeostasis in phosphate depletion-induced antimicrobial resistance.

      Strengths:

      Overall, I found this study to be well conducted, with convincing results that strongly support the proposed model. Through comprehensive genetic analyses and detailed characterization of metal ion homeostasis and membrane lipid modifications, the authors uncovered a novel regulatory connection among Mg²⁺, Fe³⁺, and the PmrAB pathway, a key driver of polymyxin resistance. These findings are highly interesting and have important implications for understanding the evolution of the Fe-sensing PmrAB system, as well as the broader role of nutrient availability in shaping antibiotic resistance.

      Weaknesses:

      I did not identify any particular weaknesses.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript examines how phosphate limitation primes E. coli and Salmonella for defense against polymyxin antibiotics. Other environmental signals, such as altered levels of extracellular Mg or Fe, were previously shown to induce polymyxin resistance in Enterobacteriaceae, and phosphate limitation was known to augment polymyxin resistance in other organisms such as A. baumannii and P. aeruginosa; however, whether phosphate limitation boosted polymyxin resistance in Enterobacteriaceae was not known. This study shows that this indeed occurs, and the mechanism is distinct from that in A. baumannii and P. aeruginosa. The model proposed is: (1) low phosphate causes bacteria to jettison Mg to balance cellular P/Mg ratio, (2) extracellular Fe3+ associates with the cell envelope to replace Mg as LPS-bridging cation, and (3) envelope Fe3+ activates PmrAB, which mediates a transcriptional response leading to L-Ara4N modification of lipid A and protection from polymyxin B. Flooding with Mg or chelating the surface Fe3+ blocks the protective response to low phosphate in E. coli and Salmonella but not in P. aeruginosa despite Fe still mobilizing in the latter. The differential response between Enterobacteriaceae and P. aeruginosa is connected to the presence/absence of Fe-sensing motifs in the PmrB periplasmic domain.

      Strengths:

      The strengths of the study are the wide array of approaches used and the thorough characterization of a novel stress-response mechanism involving metal mobilization. Combined with the analysis of multiple bacterial families, the results clarify how different strategies have evolved to defend against polymyxins during phosphate starvation.

      Weaknesses:

      Controls are needed in some of the genetic experiments, namely complementation, to verify linkage of defective survival phenotypes to the genes mutated and to rule out protein stability defects for the PmrB variants tested. In addition, the generalizability of the metal mobilization feature of the model would be strengthened by examining media with differing metal composition. Claims about antibiotic resistance would be strengthened by data examining bacterial growth in the presence of an antibiotic.

    1. eLife Assessment

      This study used pupillometry to provide an objective assessment of a form of synesthesia in which people see additional color when reading numbers. It provides convincing evidence that subjective color ratings are matched by changes in pupil size that recapitulate brightness-mediated changes when exposed to the real color. The work provides a valuable contribution to the literature on both synesthetic perception and the use of pupillometry to probe perception and related psychological processes.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      Knowing that small pupil-size variations accompany brightness variations (even when these are illusory), the authors asked whether pupil constrictions would accompany the synesthetic perception of a brighter color (compared with a darker one), induced by the presentation of a black-white character. This grapheme-colour synesthesia is only experienced by few participants, sixteen of whom were enrolled in this study. The results reliably showed that a relative pupil constriction would "betray" the perception of a brighter color in these participants, while no such effect would be observed in control participants who were asked to report a color in association with each grapheme, even though they did not perceive any.

      Strengths:

      The main strength of the study lays in its combination of psychophysics (brightness ratings) and pupillometry, which allowed for showing clear-cut results.

      Impact:

      This work is likely to improve our understanding of synesthesia, providing a new tool to quantify the subjective sensations; an interesting potential extension would be using pupillometry for tracking changes over time of the synesthetic experiences, opening up the possibility to evaluate the importance of learning for this peculiar experience.

    3. Reviewer #2 (Public review):

      Synesthesia is a neurological condition where stimulation of one sensory channel leads to involuntary, automatic, and consistent experience of another, unrelated percept. For example, Sir Francis Galton (1880, Nature) famously described the robust tendency of some individual (synesthetes) to associate numerals with a distinct color. Ever since, synesthesia keeps attracting a broad interest in the cognitive neurosciences in light of its implications for the study of domains such as perception, consciousness, and brain connectivity, among others.

      Strauch, Leenaars, and Rouw measured pupil size in a group of 16 grapheme-color synesthetes and two matched control groups. The participants were presented with gray digits - that is, visual stimuli having identical physical properties in terms of brightness. Each participant subsequently rated the corresponding evoked color and brightness: unlike controls, synesthetes did so in a very consistent and reliable fashion. Accordingly, this was also shown in their pupils: despite the same objective luminance, digits associated with brighter percepts caused their pupils to constrict and digits associated with darker percepts caused their pupils to dilate more than controls. These results highlight how crossmodal correspondences are deeply rooted in synesthetes, and puts forward pupillometry as a particularly appealing biomarker for some phenomenological experience (at least those grounded in "brightness").

      Further strengths of the technique are its temporal resolution and its responsiveness to several constructs. Across several tasks, the authors show for example that responses to synesthetic light are somewhat slower than responses to real light (i.e., they are likely mediated), but at the same time faster than responses to mental imagery. The role of mental imagery can also be reasonably dismissed when considering the second feature of pupil size: its responsiveness to mental effort and cognitive load. The pupils tend to dilate with demanding, challenging tasks, and this was the case when control participants were asked to report the color of a digit for which they did not consistently experience a synesthetic association. The same task was, instead, seemingly effortless for synesthetes, again speaking in favor of the automaticity of number-color correspondences in their case.

      Overall, the findings by Strauch, Leenaars, and Rouw are highly significant for the field and likely to be impactful. The strength of their evidence, when accounting for the relatively small sample size and the inherent variability of both phenomenology (color perception and subjective reporting) and physiology (pupil size), is adequate and sufficiently convincing.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The pupil traces in Figure3 (main results) are heavily pre-processed (per-participant demeaned), loosing any feature besides the effect of interest. As I argued in my first review, I worry that this format gives unrealistic expectations about the effect (the perception of dark/bright colors do not generate a net dilation/constriction of the pupil; perception-related modulations of pupil size are always relative and generally small compared to the numerous other effects registered in pupil size; these include a pupil dilation that is more prominent in the controls and that gets analyzed later on in the manuscript; I do not think that eliminating one of the effects of interests from a main results figure helps the reader understand the results). In the revised manuscript, the authors addressed this concern by adding a Supplementary Figure 4, where a more complete representation of the results is shown (traces from individual trials are baseline corrected and averaged, resulting in more informative timecourses). I would strongly recommend that Supplementary Figure 4 is brought to the main text (Figure 3 could be presented in Supplementary).

      We agree that it is important to counter unrealistic interpretations of the effect. However, figures in the main article are the ones that are depicting the effects. Instead, it seems that additional clarification on these effects is needed. First and foremost, Figure 3 in the main manuscript visualizes the core effect: pupil size reveals that synesthesia is a sensory process and the phenomenology of the synesthetic experience can be measured physiologically. Secondly, this allows to advance synesthesia (and phenomenology) research as a new and powerful method.

      No doubt, our effect is relative in nature (as almost any pupillometry, fmri, eeg effect etc.). Including variation that is unrelated to the effect would increase rather than decrease confusion, as individual differences (i.e., how the pupil of an individual responds irrespective of the synesthetic experience) are unmeaningful to the question we set out to answer. Individual variations in pupil response shape irrespective of synesthetic color brightness are removed in Figure 3 but still present in Supplementary Figure 4. Thus, Figure 3 is better suited to illustrate our core effect than Supplementary Figure 4, as individual average responses (illustrated on the right) cannot be meaningfully related to the core effect anymore, only the difference can be.

      At the same time, the reviewer is correct that this may, not so much among researchers as among a general audience, create the expectation that the pupil will always net dilate when experiencing a dark synesthetic percept. This is clearly not the case, but only over its counterfactual (i.e., not seeing that dark synesthetic percept). We now counter such an unrealistic expectation:

      “Note that the effects here are visualized as counterfactuals. So while the pupil dilated for dark relative to bright experienced colors in synesthetes, this does not mean that the pupil net dilates and constricts to dark and bright experienced colors relative to baseline, but only relative to the counterfactual (see Supplementary Figure 4 for net pupil size changes).”

      We updated the caption of Supplementary Figure 4 as follows:

      "Supplementary Figure 4: Pupil size change to graphemes, split by 0.5 reported color lightness (dark gray = low lightness; light gray = high lightness) without demeaning (i.e., removing the average pupil response shape in the 4s stimulus interval per individual irrespective of brightness perception). (…)"

      Responses to physical brightness modulations were only measured in the synesthethes group, not in controls. The authors point out that pupillary light responses have been thoroughly characterized in previous studies, and conclude that synesthethes' responses were in line with the expectations both in terms of amplitude and latency. However, as we are not dealing with standardized measurements, subtle differences in pupil reactivity across the two populations remain a possibility. I recommend that this possibility is mentioned in the discussion.

      We agree with the reviewer, if there were any differences in the PLR between the two groups, they must be minor given that the responses follow those reported in the literature so closely. Yet, subtle differences cannot be ruled out fully unless tested and it doesn’t hurt mentioning this in the discussion, which we now do as follows:

      Finally, pupil light responses in Block 2 were only assessed in synesthetes. While these closely match such of control populations [50,51], subtle between-group differences cannot be excluded and could ideally be assessed in future and replication work.

      Reviewer #2 (Public review):

      Synesthesia is a neurological condition where stimulation of one sensory channel leads to involuntary, automatic, and consistent experience of another, unrelated percept. For example, Sir Francis Galton (1880, Nature) famously described the robust tendency of some individual (synesthetes) to associate numerals with a distinct color. Ever since, synesthesia keeps attracting a broad interest in the cognitive neurosciences in light of its implications for the study of domains such as perception, consciousness, and brain connectivity, among others.

      Strauch, Leenaars, and Rouw measured pupil size in a group of 16 grapheme-color synesthetes and two matched control groups. The participants were presented with gray digits - that is, visual stimuli having identical physical properties in terms of brightness. Each participant subsequently rated the corresponding evoked color and brightness: unlike controls, synesthetes did so in a very consistent and reliable fashion. Accordingly, this was also shown in their pupils: despite the same objective luminance, digits associated with brighter percepts caused their pupils to constrict and digits associated with darker percepts caused their pupils to dilate more than controls. These results highlight how crossmodal correspondences are deeply rooted in synesthetes, and puts forward pupillometry as a particularly appealing biomarker for some phenomenological experience (at least those grounded in "brightness").

      Further strengths of the technique are its temporal resolution and its responsiveness to several constructs. Across several tasks, the authors show for example that responses to synesthetic light are somewhat slower than responses to real light (i.e., they are likely mediated), but at the same time faster than responses to mental imagery. The role of mental imagery can also be reasonably dismissed when considering the second feature of pupil size: its responsiveness to mental effort and cognitive load. The pupils tend to dilate with demanding, challenging tasks, and this was the case when control participants were asked to report the color of a digit for which they did not consistently experience a synesthetic association. The same task was, instead, seemingly effortless for synesthetes, again speaking in favor of the automaticity of number-color correspondences in their case.

      Overall, the findings by Strauch, Leenaars, and Rouw are highly significant for the field and likely to be impactful. The strength of their evidence, when accounting for the relatively small sample size and the inherent variability of both phenomenology (color perception and subjective reporting) and physiology (pupil size), is adequate and sufficiently convincing.

      Comments on revisions:

      I thank the authors for addressing all my comments in a satisfactory way. I think that the paper has improved, especially in terms of transparency of the reporting and clarity of the results.

      We thank R1, R2, and R3 for their very useful input to improve our manuscript.

    1. eLife Assessment

      This study presents valuable findings on the high prevalence of pain in women with polycystic ovary syndrome and its association with distinct future health risks across different racial groups. The evidence supporting the conclusions is compelling, utilizing a massive global dataset and rigorous propensity score matching to identify pain as a critical, yet underexplored, clinical marker. The work will be of interest to reproductive endocrinologists, medical biologists, and clinicians involved in the diagnosis and management of polycystic ovary syndrome.

    2. Reviewer #1 (Public review):

      Summary:

      This retrospective study provides a new data regarding the prevalence of pain in women with PCOS and its relationship with health outcomes. Using data from electronic health records (EHR), the authors found a significantly higher prevalence of pain among women with PCOS compared to those without the condition: 19.21% of women with PCOS versus 15.8% in non-PCOS women. The highest prevalence of pain was conducted among Black or African American (32.11%) and White (30.75%) populations. Besides, women with PCOS and pain have at least a 2-fold increased prevalence of obesity (34.68%) at baseline compared to women with PCOS in general (16.11%). Also, women with PCOS had the highest risk for infertility and T2D, but women with PCOS and pain had higher risks for ovarian cysts and liver disease. Regarding these results, authors suggested the critical need to address pain in the diagnosis and management of PCOS due to its significant impact on patient health outcomes.

      Strengths:

      The problem of pain assessment in PCOS patients is well described and authors provided a clear rationale selection of the retrospective design to investigate this problem.

      A large number of analyzed patient's records (76,859,666 women) and its uniformity increases the power of the study. Using the Propensity Score Matching makes possible to reduce the heterogeneity of the compared cohorts and influence of comorbid conditions.

      Analysis in different ethnic cohorts provides actual and necessary data regarding the prevalence of pain and its relationship with different health conditions that will be helpful for clinicians to make a diagnosis and manage the PCOS in women of different ethnicity.

      Assessment of risk of different health conditions as including PCOS-associated pathology as other common groups of diseases in PCOS women with or without pain allows to differentiate the risk of comorbid conditions depending on the presence of one symptom (pelvic or abdominal pain, dysmenorrhea).

      Weaknesses:

      The significant weakness of the study is the absence of Latin American cohort. Probably the White cohort includes Latin Americans or others, but results of the study cannot be extrapolated to particular White ethnicities.

      Comments on revised version:

      At present, I have no questions or recommendations for the authors, as they have exhaustively addressed the previous comments and incorporated the necessary corrections.

    3. Reviewer #2 (Public review):

      Summary:

      The study offers a thorough analysis of the prevalence of pain in women with polycystic ovary syndrome (PCOS) and its associations with health outcomes across various racial groups. Furthermore, the research investigates the prevalence of PCOS and pain among different racial demographics, as well as the increased risk of developing various conditions in comparison to individuals who have PCOS alone.

      Strengths:

      The study emphasizes pain as a significant comorbidity of PCOS, an area that is critically underexplored in existing literature. The findings regarding the increased prevalence of some of the diseases in the PCOS + pain group provide valuable direction for future research and clinical care. I believe physicians should incorporate pain score assessments into their clinical practice to improve patients' quality of life and raise awareness about pain management. If future research focuses on the mechanisms of pain, it would provide a better understanding of pain and allow for a focus on the underlying causes rather than just symptomatic management. The study also highlights the association between PCOS+pain and various comorbidities, such as obesity, hypertension, and type 2 diabetes, as well as conditions like infertility and ovarian cysts, offering a holistic view of the burden of PCOS.

      Weaknesses:

      Due to the nature of retrospective design, some data may not be readily available in the EHR system. Diagnosis of PCOS, pain is based on ICD codes, which may lead to misclassification and may not capture symptom severity or patient-reported experiences.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This retrospective study provides new data regarding the prevalence of pain in women with PCOS and its relationship with health outcomes. Using data from electronic health records (EHR), the authors found a significantly higher prevalence of pain among women with PCOS compared to those without the condition: 19.21% of women with PCOS versus 15.8% in non-PCOS women. The highest prevalence of pain was conducted among Black or African American (32.11%) and White (30.75%) populations. Besides, women with PCOS and pain have at least a 2-fold increased prevalence of obesity (34.68%) at baseline compared to women with PCOS in general (16.11%). Also, women with PCOS had the highest risk for infertility and T2D, but women with PCOS and pain had higher risks for ovarian cysts and liver disease. Regarding these results, the authors suggested the critical need to address pain in the diagnosis and management of PCOS due to its significant impact on patient health outcomes.

      Strengths:

      (1) The problem of pain assessment in PCOS patients is well described and the authors provided a clear rationale selection of the retrospective design to investigate this problem.

      (2) A large number of analyzed patient records (76,859,666 women) and their uniformity increases the power of the study. Using the Propensity Score Matching makes it possible to reduce the heterogeneity of the compared cohorts and the influence of comorbid conditions.

      (3) Analysis in different ethnic cohorts provides actual and necessary data regarding the prevalence of pain and its relationship with different health conditions that will be helpful for clinicians to make a diagnosis and manage PCOS in women of different ethnicities.

      (4) Assessment of the risk of different health conditions including PCOS-associated pathology as other common groups of diseases in PCOS women with or without pain allows to differentiate the risk of comorbid conditions depending on the presence of one symptom (pelvic or abdominal pain, dysmenorrhea).

      We would like to thank the Reviewer for their positive feedback on this manuscript. Pain assessment in women with PCOS is of paramount interest and because of a gap in this research area, we are trying to address it.

      Weaknesses:

      (1) Although the paper has strengths in methodology and data analysis, it also has some weaknesses. The lack of a hypothesis doesn't allow us to evaluate the aim and significance of this study.

      We would like to thank the Reviewer for their valuable feedback regarding the hypothesis of this study. We understand that the hypothesis may not have been written clearly under the objectives and we have corrected this in the formal revision.

      The primary hypothesis of this study is that women with PCOS experience a higher prevalence to pain (including dysmenorrhea, abdominal pain and pelvic pain) compared to women without PCOS, and this prevalence varies by racial groups. Our hypothesis aims to explore the relationship between PCOS and pain, the associated health risks, and the potential racial disparities in pain prevalence and long-term health outcomes. Additionally, we seek to assess the effect of treatment on reducing pain symptoms in women with PCOS. This study not only examines the immediate burden of pain but also investigates its long-term consequences, including risks of infertility, obesity, and type 2 diabetes.

      To enhance clarity for readers, we explicitly stated this hypothesis in the revised manuscript and have ensured that its connection to the study’s objectives is clearly articulated. We appreciate the Reviewer’s insights and have incorporated these refinements to strengthen the manuscript.

      (2) The exclusion criteria don't include conditions, that can lead to symptoms similar to PCOS: thyroid diseases, hyperprolactinemia, and congenital adrenal hyperplasia. Thyroid status is not being taken into account in the criteria for matching. All these conditions could occur as on prevalence results as on risk assessment.

      We would like to thank the Reviewer for highlighting the need to include these additional conditions that mimic PCOS. After excluding hypothyroidism, hyperprolactinemia, and adrenal hyperplasia from the PCOS and PCOS and pain cohorts, we observed that 7,690 patients (1.65%) with PCOS and 1,854 patients (1.36%) with PCOS were removed. Based on this observation, we added these three conditions to our exclusion criteria and reran all our analysis for disease for our resubmission. The manuscript, figures, and tables have been updated to reflect these exclusions. Additionally, we have added rationale for excluding these conditions to the Discussion. With these major changes to the analysis, we aim to improve transparency and provide more accurate results and precise interpretations of our findings to the field.

      (3) The significant weakness of the study is the absence of a Latin American cohort. Probably the White cohort includes Latin Americans or others, but the results of the study cannot be extrapolated to particular White ethnicities.

      We appreciate the Reviewer’s suggestion to include Latin American cohorts in this study. The TriNetX platform has both self-reported race and ethnicity demographic information. In Table 3 - Figure Supplement 5 and Table 4 - Figure Supplement 6 we include baseline demographic information for both race (Asian, Black or African American, Native Hawaiian or Other Pacific Islander, Other, White, and Unknown Race) and ethnicity (Not Hispanic or Latino, Unknown, and Hispanic or Latino). In this paper we focused our future health outcome sub-analysis on four self-reported race groups: Asian, Black or African American, Other (Native Hawaiian or Other Pacific Islander, Other, Unknown Race), and White. We agree that including Latin American cohorts in the analysis is essential to better understand the health disparities affecting this population. Future work to better define Latin American cohorts in EHR data would significantly aid our ability to investigate this further.

      (4) The authors didn't provide sufficient rationale for future health outcomes and this list didn't include diseases of the digestive system or disorders of thyroid glands, which can also cause abdominal pain.

      We appreciate the Reviewer comment and concern regarding additional rationale for future health outcomes. We originally chose to investigate general future health outcomes like disease of the digestive system, circulatory system, etc. These disease groups were selected based on being general and having high prevalence as future health outcomes for patients with PCOS and Pain.

      Our initial results highlight the prevalence of disorders of the digestive system (Figure 2). However, after considering the Reviewers comments and to further strengthen our analysis, we included the most prevalent digestive system disorder in our relative risk (RR) analysis. Gastro-esophageal reflux disease (GERD) was identified as the most prevalent future digestive condition for women with PCOS and Pain (13.5%). There was also a 10.5% prevalence in women with PCOS overall.

      We were not able to include the same analysis for thyroid dysfunctions as this condition is a part of our exclusion criterion. These updates have been incorporated into the revised manuscript to ensure clarity and completeness.

      Reviewer #2 (Public review):

      Summary:

      The study offers a thorough analysis of the prevalence of pain in women with polycystic ovary syndrome (PCOS) and its associations with health outcomes across various racial groups. Furthermore, the research investigates the prevalence of PCOS and pain among different racial demographics, as well as the increased risk of developing various conditions in comparison to individuals who have PCOS alone.

      Strengths:

      The study emphasizes pain as a significant comorbidity of PCOS, an area that is critically underexplored in existing literature. The findings regarding the increased prevalence of some of the diseases in the PCOS + pain group provide valuable direction for future research and clinical care. I believe physicians should incorporate pain score assessments into their clinical practice to improve patient's quality of life and raise awareness about pain management. If future research focuses on the mechanisms of pain, it would provide a better understanding of pain and allow for a focus on the underlying causes rather than just symptomatic management. The study also highlights the association between PCOS+pain and various comorbidities, such as obesity, hypertension, and type 2 diabetes, as well as conditions like infertility and ovarian cysts, offering a holistic view of the burden of PCOS.

      We sincerely appreciate the Reviewer’s insightful comments. We hope that our findings will encourage further research on the occurrence of pain in women with PCOS and that others will replicate our results to strengthen the evidence in this area. As noted in our introduction, there are currently no standardized abdominal pain score assessments specifically for women with PCOS. We hope that the findings from this study will contribute to efforts toward developing a standardized pain assessment for the PCOS community. In the meantime, further research across more diverse populations will be essential to build a more comprehensive understanding of this issue.

      Weaknesses:

      Due to the nature of the retrospective study, some data may not be readily available in the system. Instead of simply categorizing participants based on whether they experience pain, it would be more useful to employ a pain scale or questionnaire to better understand the severity and type of patients' pain. This approach would allow for a more thorough analysis of pain improvement following treatment with the three widely used medications for PCOS. Additionally, it would be beneficial for the authors to specify subtypes of the disease rather than generalizing conditions, such as mentioning specific digestive system disorders or mental health disorders. The lack of detailed analysis of specific disorders limits the depth of the findings. This may cause authors to make incorrect conclusions.

      We appreciate the Reviewer for highlighting the importance of categorizing pain levels experienced by women with PCOS.  However, there is currently no standardized pain assessment for abdominal pain, and therefore more research is required before such a classification can be made. Additionally, the electronic health record data we leveraged via the TriNextX platform does not include any pain scale data from unstructured notes. Despite these limitations, this study is an important step toward recognizing abdominal and pelvic pain in women with PCOS. Our findings indicate that women with PCOS report abdominal pain independent of digestive conditions such as irritable bowel syndrome— a condition often associated with pain in this population.

      We would like to thank the Reviewer for their thoughtful comment with respect to subtyping future health outcomes. To get at the most impactful future health outcomes affecting women with PCOS and Pain, we have included the top 5 most prevalent health outcomes associated with PCOS and Pain. Specifically, we included analysis for anxiety disorder, depressive episodes, essential hypertension, Gastro-esophageal reflux disease (GERD), and acute pharyngitis. We observed that 17.1%, 11.5%, 10.5%, 10.0% of patients with PCOS and 20.1%, 13.7%, 13.5%, 13.3% of patients with PCOS and Pain were at risk of developing anxiety, depression, acute pharyngitis, and GERD respectively. For our revision, we have included these 5 conditions in our PCOS, PCOS and Pain and self-reported race-stratified future health outcome relative risk (RR) analyses. The revised manuscript, figures, and tables all reflect these changes.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I highly recommend checking all papers and supplements for misprints. There are a lot of missing spaces in the Introduction.

      We would like to thank the Reviewer for bringing this to our attention. We have carefully reviewed the manuscript and all supplementary materials and corrected formatting issues, including missing spaces and typographical errors throughout the Introduction and the rest of the document.

      (2) Supplementary Table 3: numbers from the first line in "%No PCOS" should be in "No PCOS"?

      We thank the Reviewer for bringing this error to our attention. We have identified the source of the problem and values have been added to the appropriate column.

      (3) Why for the matching authors use the categorical data for overweight/obesity and not the entire values? There are different stages of obesity that can be predominant in different cohorts and contribute to the results.

      We would like to thank the Reviewer for their insightful question. While TriNetX does have some BMI values for patient participants, this data is not included for all patients. For example, only 29-30% of women in the PCOS control and case cohorts have BMI recorded. Therefore, we focused on ICD codes for obesity instead to include as much data as possible.

      (4) What criteria were being used to determine hyperlipidemia and obesity? Were these criteria equal for all patients, or did they depend on ethnicity?

      We would like to apologize to the Reviewer for any confusion. The criteria to determine hyperlipidemia and obesity are ICD-10-CM codes as recorded in the TriNetX platform. The ICD-10-CM codes for obesity are E65-E68 and the ICD-10-CM code for hyperlipidemia is E78.5. Please also see the Methods section of this manuscript where all the ICD-10-CM codes are described.

      (5) The section material and methods should provide information regarding quality assurance checks and any steps to eliminate data suspected to be unreliable or invalid, to process missing data, consisting of data or claim duplicates. If quality assurance of data hadn't been conducted, it should have been noticed in the study limitations.

      We thank the Reviewer for this suggestion. We have revised the Methods section to explicitly describe the data quality assurance procedures inherent to the TriNetX platform. Specifically, we clarified that TriNetX applies standardized data mapping to controlled clinical terminologies (ICD, CPT, RxNorm), performs automated quality checks and excludes records that do not meet platform-defined standards.

      (6) It's not clear why the authors didn't include in the analysis the information regarding taking painkillers or anti-inflammatory drugs by patients. Maybe there is no such data in EHR. However, if the patient has some chronic inflammatory or autoimmune disease, she should be prescribed medication. I recommend specifying this issue in the section Material and Methods and/or study limitations.

      We would like to thank the Reviewer for this important suggestion. We have now clarified this point in the limitations section of the discussion. Specifically, we added text explaining that over-the-counter analgesics and anti-inflammatory medications are not reliably captured by EHR or within the TriNetX platform and therefore could not be evaluated in our analysis.

      (7) The authors should provide the Table or complete Supplementary Tables 2 and 3 with the parameters of patients used for matching.

      We apologize to the Reviewer for any confusion. The parameters used for propensity score matching are described fully in the Methods section of the paper. Table 2 – Figure Supplement 5 and Table 3 – Figure supplement 6 display baseline characteristics for patients before and after the 1:1 propensity score matching using these parameters. We have now also added the propensity score matching parameters to the table descriptions to provide fluidity and further clarification.

      (8) The authors found out that women with PCOS and pain have higher RR for ovary cysts and liver diseases compared to women with PCOS who have higher RR for infertility, obesity, and T2D. Discussion includes thoughts regarding a higher risk of ovary cysts and liver disease in women with PCOS and pain, but there is not any suggestion as to why women with PCOS and without pain have a higher risk of infertility, obesity, and T2D. If there is no data explaining this phenomenon, I recommend noting the need for additional research.

      We would like to thank the Revier for this helpful feedback. The Discussion section now includes deeper insights into the pathophysiology behind the two distinct PCOS phenotypes (PCOS overall vs. PCOS and Pain) and their differing risk profiles for future health outcomes.  Specifically, we note that while women with PCOS overall may be more metabolically driven (higher risk of infertility, obesity, and T2D), women with PCOS and Pain show a higher risk of ovarian cysts and liver disease. We clarify that these findings are observational and hypothesis-generating and emphasize the need for future longitudinal and mechanistic studies.

      (9) The authors suggested that systematic contraceptives, metformin, or spironolactone reduce pain in PCOS women. The reduction is significant, but the number of patients with beneficial effects is low (2.5-7.5%). Is it enough to recommend prescribing this medication not only for PCOS treatment but against pain?

      We thank the Reviewer for this important comment. We agree that although the reduction in pain diagnoses following treatment with COCPs, metformin, or spironolactone was statistically significant, the absolute proportion of patients experiencing benefit was modest. Our intention was not to recommend prescribing these medications solely for pain management, but rather to highlight that standard PCOS therapies may have additional benefits in reducing pain symptoms. We have clarified this point in the Discussion to emphasize that these findings are observational and hypothesis-generating, and that prospective studies are needed before these medications can be considered specifically for pain management in PCOS.

      Reviewer #2 (Recommendations for the authors):

      (1) Including a subtype analysis of specific diseases on digestive, respiratory, and mental health diseases rather than generalizing the system will enhance the content.

      We would like to thank the Reviewer for this helpful suggestion. In the revised manuscript, instead of the generalized disease systems we previously reported on, we have included analysis for the top 5 most prevalent conditions. Specifically, we included analysis for anxiety disorder, depressive episodes, essential hypertension, Gastro-esophageal reflux disease (GERD), and acute pharyngitis. We observed that 17.1%, 11.5%, 10.5%, 10.0% of patients with PCOS and 20.1%, 13.7%, 13.5%, 13.3% of patients with PCOS and Pain were at risk of developing anxiety, depression, acute pharyngitis, and GERD respectively.

      (2) Including the prevalence of dysmenorrhea among healthy populations would allow readers to better compare its impact on the lives of individuals with PCOS.

      We would like to apologize to the Reviewer for any confusion. The prevalence of dysmenorrhea for cases and control cohorts can be found in Table 2 – Figure Supplement 5 and Table 3 – Figure Supplement 6 before and after propensity score matching.

      (3) Introducing an analysis of age subgroups will provide readers with a clearer understanding of the prevalence of pain and specific diseases across different age groups.

      We would like to thank the Reviewer for this helpful suggestion. For this revision, we did a sub-analysis to explore the prevalence of PCOS and PCOS and Pain stratified by 10-year age groups. A barplot of these results can be found in Figure 4 - Figure Supplement 7.

      Thank you again to the Reviewers for the positive and constructive feedback for this manuscript. We have made the appropriate edits and changes to the final revisions of the manuscript.

    1. eLife Assessment

      Rickert and colleagues demonstrate that the host peptidoglycan-binding protein PGLYRP1 has both beneficial and detrimental effects on Bordetella pertussis infection in mice. Using a solid array of techniques, the study provides useful insights into how the peptidoglycan fragment tracheal cytotoxin alters host immune responses, dampening inflammatory responses later in B. pertussis infection. These studies indicate that release of peptidoglycan fragments with particular structures can be used by bacteria to modulate NOD1 versus NOD2 responses to their advantage.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aim to demonstrate that PGLYRP1 plays a dual role in host responses to B. pertussis infection. PGLYRP1 signaling is known to activate bactericidal responses due to recognition of peptidoglycan. Through NOD1 activation and TREM-1 engagement, it appears PGLYRP1 also has immunomodulator activities. The authors present mouse knockout studies and gene expression data to illustrate the role of PGGLYRL1 in relation to B. pertussis peptidogylcan. Mice lacking PGLYRP1 had slightly lower pathology scores. When TCT peptidoglycan was removed from the bacteria, surprisingly IL23A, IL6, IL1B and other pro-inflammatory genes encoding cytokines increased. The relationship to TCT and PGLYRP1 suggest the pathogen uses this strategy to decrease immune activation. The authors when on to show the relationship between PGLRP1 and TREM-1 as mediated with PGN using various versions of peptidoglycan. The study presents multiple angles of data to back up its findings and demonstrates an interesting strategy used by B. pertussis to down-regulate innate responses to its presence during infection.

      Strengths:

      Use of knockout mice of the key factor being considered paired with isogenic B. pertussis strains to reveal the mechanism of immune modulation to benefit the bacteria. The authors used in vivo gene expression paired with in vivo assays to establish each aspect of the mechanism.

      Weaknesses:

      The main focus was on innate responses, but some analysis of antigen specific antibody responses could improve the impact of the findings.

      Comments on revised version.

      I have no further input to add.

    3. Reviewer #2 (Public review):

      Since its original discovery, the mechanistic basis for TCT-mediated pathogenesis of Bordetella pertussis has been a moving target and difficult to uncouple from confounding variables. The current study provides some exciting data that suggest PGLYRP-1 modulates host responses upon 'activation' by TCT. While there are some strengths associated with the unbiased approaches and collective data to support the claims associated with TCT and PGLYRP-1's function in this system, caution should be used when interpreting and extrapolating some the information provided. While many of the initial concerns were addressed, one concern remains: using whole, intact PG sacculi from other species for comparative studies with a fragment of released PG (i.e., TCT).

      Comments on revised version.

      I have no further comments.

    4. Reviewer #3 (Public review):

      Summary:

      This study evaluates the contributions of the mammalian PG-binding protein PGLYRP1 to Bordetella infection. The authors find potential roles for PGLYRP1 in both bacterial killing (canonical) and regulation of inflammation (non-canonical). While these are interesting findings and the idea that PG fragment release has differential impacts on infection depending on fragment structure, the study is ultimately limited by the lack of connection between the in vivo and in vitro experiments and determining the precise mechanism of how PGLYRP1 regulates host responses and bacterial fitness during infection requires further study.

      Strengths:

      (1) The combination of scRNAseq with in vitro and in vivo assays provides complementary views of PGLYRP1 function during infection.

      (2) The use of TCT-deficient B. pertussis provides a useful control and perturbation in the in vitro assays.

      Weaknesses/Areas for future study:

      (1) The study does not ultimately resolve the initial early versus late phenotype divergence. While the in vitro assays suggest explanations for their in vivo observations, further mechanistic links are lacking and necessary for the author's conclusions throughout. To state one example, what is the early and late infection phenotype of TCT- Bp in mice lacking PGLYRP1? RNAseq data is reported from these mice but there are no burden or pathology studies. Furthermore, what are the neutrophil phenotypes (NOD-1/TREM-1 activation) in vivo? And are they dependent on PGLYRP1 and/or TCT? This will be an important topic of future study, as noted by the authors in their response.

      (2) It is unclear whether or how the NOD1 and TREM-1 pathways interact.

      (3) Many of the study's conclusions rely on the use of HEK293 reporter lines in the absence of bacterial infection, which may not be physiologically representative.

      Comments on revised version.

      The authors have responded adequately to my comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors aim to demonstrate that PGLYRP1 plays a dual role in host responses to B. pertussis infection. PGLYRP1 signaling is known to activate bactericidal responses due to recognition of peptidoglycan. Through NOD1 activation and TREM-1 engagement, it appears PGLYRP1 also has immunomodulator activities. The authors present mouse knockout studies and gene expression data to illustrate the role of PGLYRP1 in relation to B. pertussis peptidoglycan. Mice lacking PGLYRP1 had slightly lower pathology scores. When TCT peptidoglycan was removed from the bacteria, surprisingly IL23A, IL6, IL1B, and other pro-inflammatory genes encoding cytokines increased. The relationship to TCT and PGLYRP1 suggests the pathogen uses this strategy to decrease immune activation. The authors went on to show the relationship between PGLRP1 and TREM-1 as mediated by PGN using various versions of peptidoglycan. The study presents multiple angles of data to back up its findings and demonstrates an interesting strategy used by B. pertussis to downregulate innate responses to its presence during infection.

      Strengths:

      Use of knockout mice of the key factor being considered, paired with isogenic B.

      pertussis strains, to reveal the mechanism of immune modulation to benefit the bacteria. The authors used in vivo gene expression paired with in vivo assays to establish each aspect of the mechanism.

      Weaknesses:

      The main focus was on innate responses, and some analysis of antigen-specific antibody responses could improve the impact of the findings.

      The authors thank the reviewer for their careful reading of the manuscript. We agree that understanding the impact of peptidoglycan recognition in adaptive immunity, including antibody responses, would be beneficial. This is particularly apparent due to the pressing need for novel vaccination strategies for pertussis. To this end, we have modified the discussion section to highlight this and are embarking on detailed studies of the adaptive response generated with B. pertussis strains releasing alternative peptidoglycan structures.

      Reviewer #1 (Recommendations for the authors):

      (1) This reviewer is of the opinion that describing the PGLYRP1 as a "bactericidal protein" seems misleading. "To determine whether PGLYRP1 has bactericidal activity against B. pertussis, we performed in vitro and ex vivo killing assays." Bactericidal activity was measured in normal or knockout neutrophils, but this seems to say the PGLYRP1 itself is an antimicrobial peptide. It clearly plays a role in the response,e but it is a regulator and not a killing agent.

      We agree that ‘bactericidal’ is not the most accurate description and have revised the manuscript accordingly to be more accurate throughout results section 1.

      (2) PGN can induce IgM production. Antibody production of any type was absent from this study. Would IgA/IgM/IgG levels to B. pertussis or its TCT change due to PGLYRP1? To this reviewer, it would be good to use the serum and perform some ELISA analysis. It is also likely that T cell responses could be impaired, but that may be out of the scope of this manuscript, but could be acceptable to consider for future studies.

      The authors thank the reviewers for this suggestion. We have added text to the discussion section to highlight the importance and potential of this suggestion.

      (3) Please include sources of mice (vendors) and strain numbers for transparency.

      The authors have added the relevant detail to the methods section to address this valid concern.

      (4) Were female or male mice or both used?

      For PGLYRP1 vs BALB/c comparisons both male and female mice were used. These are presented as combined data. No discernible differences were noted between male and female mice following infection. For single cell RNA sequencing studies only female mice were used, to be consistent with the published pertussis mouse model and avoid sex-based complications in analysis. We have clarified these details in the text.

      (5) It appears B. pertussis was cultured in SSM or BG. What condition was used for the bacteria used for the mouse challenge? SSM or BG?

      For mouse studies, bacteria were grown on BG agar supplemented with 10% defibrinated sheep blood for 48 hours and inoculum prepared by suspending in PBS in accordance with our established protocols. For in vitro studies liquid cultures were grown to mid-log in SSM. This has now been clarified in the methods section

      (6) Are the raw RNAseq and scRNAseq reads deposited in SRA?

      Raw data has now been deposited in the Gene Expression Omnibus (GEO) under number GSE324217

      (7) Is the scRNAseq data from one mouse or a pooled set of mice? If pooled, were the individual mice barcoded?

      scRNAseq data was obtained from barcoded individual samples and replicates were pooled and integrated during analysis, but the individual mouse each cell came from is still noted in the downstream analysis. This is now clarified in methods.

      (8) Why were some studies done by aerosol and others were done by intranasal delivery?

      The authors thank the reviewer for careful reading of the manuscript which erroneously listed aerosol infections. All infections in these studies were intranasal. This has now been rectified in the text.

      Reviewer #2 (Public review):

      Since its original discovery, the mechanistic basis for TCT-mediated pathogenesis of Bordetella pertussis has been a moving target and difficult to uncouple from confounding variables. The current study provides some exciting data that suggest PGLYRP-1 modulates host responses upon 'activation' by TCT. While there are some strengths associated with the unbiased approaches and collective data to support the claims associated with TCT and PGLYRP-1's function in this system, caution should be used when interpreting and extrapolating some of the information provided. For instance, the amount and purity of TCT used in the studies are unclear, and the in vitro activity of PGLYRP1 on B. pertussis is questionable. Different mouse backgrounds are used for various assays throughout, and it is known that the PRRs vary in these systems, so the confounding variables are difficult to uncouple. Additional concerns include the types of statistical tests being performed to support some of the claims and the relevance of using whole, intact PG sacculi from other species for comparative studies with a fragment of released PG (i.e., TCT).

      We thank the reviewer for their insightful suggestions to improve the standard of our manuscript and for highlighting several important considerations regarding our interpretation of TCT mediated host responses. We have addressed the points made in the revised manuscript. In particular, we have amended the Methods section to include a description of the purification and quantification of tracheal cytotoxin. These additions clarify the dosing of TCT used throughout the manuscript. We have revised the Results and Discussion sections to avoid overstating the bactericidal activity of PGLYRP1 against B. pertussis and to more carefully describe in vitro observations. Our revised interpretation emphasizes the role of PGLYRP1 in modulating host immune responses. Additionally, we have clarified experimental design and strain usage descriptions in the Methods section. The reviewer provided valuable and insightful comments on the solubility and structure of muropeptides studies. In response, we have revised the Results and Discussion sections to acknowledge these differences and the limitations they pose. Further, we have removed conclusions regarding the specific role of the 1,6anhydro bond. The statistical analyses have been reviewed and validated as well as clarified throughout the manuscript and Methods and figure legends updated.

      We appreciate the reviewer’s comments and believe the revisions have improved the clarity and rigor of the manuscript while maintaining the central conclusions about how peptidoglycan recognition influences host inflammatory responses during B. pertussis infection.

      Reviewer #2 (Recommendations for the authors):

      Major Points:

      (1) The concentration, purity, etc. of TCT seems like it is entirely unknown. Only a couple of experiments actually state the amount used, and it's unclear how the author determined the concentration because this is not trivial. Given the long-standing concerns with purity and co-purifying contaminants, this issue is paramount and needs to be properly addressed.

      TCT was purified by HPLC in the Goldman lab (UNC). Concentration was determined by comparing the peak area of each preparation to a purified TCT standard quantified by amino acid analysis. We have added these details to the Methods and now report concentrations throughout the manuscript.

      (2) Related to the effects of bacterial PG, studies performed are comparing TCT (a muropeptide) to commercially acquired, insoluble PG sacculi from B. subtilis and S. aureus. One cannot make these comparisons. There are flaws in terms of solubility (one goes into solution, the other does not), the amount used, the molar concentrations, etc. The authors also state that these are non-1,6 anhydro PG samples. That is not true. They contain plenty of 1,6 anhydroMurNAc, the moiety just exists in a different form. Finally, B. subtilis PG is not just mDAP, it's amidated, which is known to have effects on host response(s).

      We thank the reviewer for this important critique. We agree that differences in solubility and structural composition between TCT and PG sacculi limit direct comparisons. We have revised the Results and Discussion to remove statements implying direct equivalence and instead frame these experiments as highlighting how structural and physical properties of PGN fragments influence PGLYRP1-mediated activation of TREM-1. We have also removed statements regarding the 1,6-anhydro bond which were not adequately supported.

      (3) The claim that PGLYRP-1 is bactericidal in vitro is not supported by the data. Figure 1G shows that 24 hours after incubation, there is no difference. The comparison is being made to BSA, which is much higher (possibly because they're catabolizing it?) and thus entirely inappropriate. All other data in Figure 1 suggest no effect in vitro. In fact, it's this reviewer's position that none of the studies in Figures 1G, H, and I are convincing and should be entirely excluded.

      The authors agree that language describing the bactericidal assays is not optimal and have made revisions. The text in this results section has been modified to more carefully describe bacterial killing assays and accurately describe the effects the data suggest, primarily removing claims of bactericidal effects. BSA was chosen as a control protein (concentration matched with PGLYRP1), based on published controls for PGLYRP bactericidal assays (Lu et al 2006, JBC) similar results were obtained with PBS (volume matched with PGLYRP1). Descriptions of Fig1G,H,I have been updated. Data in 1H demonstrates that TCT release does not protect against effects of PGLYRP1, despite free PGN inhibiting PGLYRP1 bactericidal activity in published literature, while 1I suggests that extracellular polysaccharides contribute to protection against PGLYRP1 activity, preventing a more bactericidal phenotype which were not observed in the earlier assays when B. pertussis retained its capacity to produce bps polysaccharide.

      (4) Histology studies are unclear, and the data presented do not support the claims. Not only are the methods and results text describing the analysis contradictory, but nowhere are the actual statistical tests supporting the claims that they are different provided. This might be an oversight, but based on the variation, I would be surprised if they were statistically significantly different if proper tests are being used.

      Significance for pathology scores were initially determined using 2-way ANOVA as we had 4 groups (WT&KO at 4&7DPI) providing p-values of 0.01 for WT vs KO at 7DPI and 0.003 at 4DPI. Following reviewers’ suggestions, we have reanalyzed these data using a Mann Whitney U test, which is more appropriate for comparisons between two groups. This analysis yielded p-values of 0.013 (4DPI) and 0.00316 (7DPI) respectively confirming that the observed differences remain statistically significant. Statistical methods are now described in the methods and figure legends.

      (5) The NOD reporter studies are not well controlled and should include a) mouse vs human for both NOD1 and NOD2; b) defined details in terms of how spent culture media was treated, amount of material normalized, etc., c) concentrations of all materials used.

      We appreciate the reviewer’s comments regarding the NOD reporter assays. In response: (a) We have clarified and articulated the murine/human NOD reporter assays and included both human and mouse NOD1, along with controls. (b) We have supplemented descriptions of how conditioned (spent) culture media were collected, processed, and normalized in the ‘Bacterial strains and infections’ and ‘Reporter Cell Assays’ methods sections; (c) and the final concentrations of all agonists and test materials used in the reporter assays are now specified in the Methods and corresponding figure legends. Together, these additions address the requested controls and clarify the experimental conditions

      (6) The scRNA-seq studies are provocative and informative, but the data shown are selectively included for the purposes of the paper. This is justified in terms of 'telling a story', but it's a disservice to the community not to include all the raw data attained. These should be deposited in an open-source system.

      The complete dataset has now been deposited in GEO (GSE324217) enabling full access for the community. The analyses presented in the manuscript focus on the datasets most relevant to the central conclusions.

      Minor points:

      (1) The authors refer to arthropod PGRPs but call them PGLYRPs. It is best to stick with the established nomenclature and use the proper names to distinguish each. There are a few sentences in the abstract that don't make sense as they're written.

      The authors thank the reviewer for their careful reading of the manuscript and have altered the manuscript to use PGRP for arthropod peptidoglycan recognition proteins.

      (2) The reciprocal result of bacterial burden at different time points in the context of PGLYRP-1 production in mice could be simply explained - it is bactericidal early, and the accumulation of dead/dying bacteria releases large pieces of PG that are not released during growth (anhydro) but rather lysis. It is the latter that causes the inverse relationship later.

      The authors believe this is an interesting and plausible explanation for differences in responses at different stages of disease. Further, we believe that elucidating the mechanism by which ‘large pieces of PG not released during growth” are recognized differently than PG from lysed bacteria is worthwhile. We speculate that the release of TCT could be a mechanism by which B. pertussis takes advantage of host differences in PG recognition. We thank the reviewers for this thought and have included this possible interpretation in the text.

      (3) The results section references Figure 1G while discussing results presented in Figure 1H.

      This has now been corrected.

      Reviewer #3 (Public review):

      Summary:

      This study evaluates the contributions of the mammalian PG-binding protein PGLYRP1 to Bordetella infection. The authors find potential roles for PGLYRP1 in both bacterial killing (canonical) and regulation of inflammation (non-canonical). While these are interesting findings and the idea that PG fragment release has differential impacts on infection depending on fragment structure, the study is limited by the lack of connection between the in vivo and in vitro experiments, and determining the precise mechanism of how PGLYRP1 regulates host responses and bacterial fitness during infection requires further study.

      Strengths:

      (1) The combination of scRNAseq with in vitro and in vivo assays provides complementary views of PGLYRP1 function during infection.

      (2) The use of TCT-deficient B. pertussis provides a useful control and perturbation in the in vitro assays.

      Weaknesses:

      (1) The study does not ultimately resolve the initial early versus late phenotype divergence. While the in vitro assays suggest explanations for their in vivo observations, further mechanistic links are lacking and necessary for the author's conclusions throughout. To state one example, what is the early and late infection phenotype of TCT- Bp in mice lacking PGLYRP1? RNAseq data are reported from these mice, but there are no burden or pathology studies. Furthermore, what are the neutrophil phenotypes (NOD-1/TREM-1 activation) in vivo? And are they dependent on PGLYRP1 and/or TCT?

      (2) It is unclear whether or how the NOD1 and TREM-1 pathways interact.

      (3) Many of the study's conclusions rely on the use of HEK293 reporter lines in the absence of bacterial infection, which may not be physiologically representative.

      (4) The methods lack detail overall, and the experimental procedures should be described more concretely, especially for the scRNAseq datasets.

      We thank the reviewer for their comprehensive and fair assessment of our study and for highlighting both its strengths and areas where clarification could improve the manuscript. As noted in the review the possibility that peptidoglycan fragment structure impacts disease pathogenesis is interesting and the role of PGLYRP1 in regulating host and bacterial fitness during infection requires further study.

      We have addressed the points made by the reviewer in the revised manuscript. We edited the Methods section to provide additional experimental detail, particularly for the scRNA-seq analyses and reporter assays. We also clarified the experimental design and interpretation of the in vitro studies to avoid overstating mechanistic conclusions.

      Studies with TREM-1 and NOD are attempting to assess multiple aspects of PGN/PGLYRP mediated enhancement of inflammatory responses via NFkB/MAPKs. No attempts have been made to assess synergistic, overlapping or compensatory effects between these systems. Other work from our group highlights the role of peptidoglycan in driving inflammatory responses via NOD receptors (doi: https://doi.org/10.1101/2025.08.08.669383) and TREM-1 (doi: 10.1128/IAI.00126-21). Work in this paper assesses the contribution of these pathways to the observed immune modulation noted by PGLYRP1.

      We have clarified figure legends and analyses, including interpretation of neutrophil transcriptional programs identified in scRNAseq datasets and comparisons to known neutrophil phenotypes.

      We appreciate the reviewers feedback and the opportunity to improve the clarity of our manuscript and optimize the conclusions and central findings.

      Reviewer #3 (Recommendations for the authors):

      (1) Please clarify in Figure 1C what the axis means, since the text refers to both uninfected and infected cells. What data allow the conclusion that PGLYRP1 expression "expanded" to other cell subsets?

      We thank the reviewers for catching this oversight. We were relying on data which we had not best represented in Figure 1C, so we updated this figure and corresponding text so that this violin plot demonstrates increased PGLYRP1 expression levels and an increasing or expanding number of cell types following infection. This is now also reflected in the text. Expression of PGLYRP1 is apparent in more cell types and to a greater extent following infection (red) with B. pertussis compared to PBS challenge (black). Expression represents normalized and transformed unique molecular identifier counts per gene per cell.

      (2) Please revise the Figure 1 legend to match the Figure panels, and mention the time point of the mPGLYRP1 killing assay in 1H/I. Were these assays performed at 6 or 24 hours? This could affect the interpretation of the data.

      This has been revised to reflect timing of data.

      (3) The text at the end of the first Results section is overstated, as the data in Figure 1 do not relate to immune-mediated clearance apart from expression levels.

      This text has been revised and reference to immune mediated clearance removed

      (4) More detail is needed in the explanation of Figures 3E-G. Do the neutrophil subsets correspond to known subsets from the literature?

      When we overlaid established neutrophil signatures from the literature onto our dataset the NOD2+ neutrophils most closely resembled inflammatory or activated neutrophil programs described previously (Xie et al. 2020 Nat. Immuno., Veglia et al. 2021 J. Exp. Med)- specifically, high il1a, Ccl3 and Ptgs2 expression. In contrast, NOD1+ neutrophils showed greater overlap with resolving or regulatory neutrophil states- including genes associated with lipid mediator metabolism and NFkB dampening. Importantly, the clustering itself was not driven by NOD1 or NOD2 expression alone. NOD expression segregated within transcriptionally distinct neutrophil programs that are consistent with previously described inflammatory versus regulatory subsets. We included descriptions of these inflammatory neutrophils and related them to previously identified neutrophil populations, supporting our findings and improving the representation and articulation of the single cell neutrophil data analysis. We deeply thank the reviewers for their help in improving this section.

      (5) The Methods section describes qPCR, but this is not presented in the Results.

      This has now been removed. We thank the reviewer for their careful and complete review of the manuscript.

    1. eLife Assessment

      This study provides a fundamental finding regarding the context-dependent roles of the JAK-STAT pathway (JSP) across different cellular compartments within the breast cancer microenvironment, supported by convincing evidence. The comments of the reviewers were sufficiently addressed.

    2. Reviewer #1 (Public review):

      Summary:

      In their manuscript, Zhou and colleagues present a detailed look at how the JSP functions differently in the various cells of a breast tumor. The authors have effectively shown that the JSP acts as a double-edged sword, as it helps T cells fight cancer but also allows tumor cells to grow and avoid ferroptosis. These findings are important because they identify a useful biomarker to predict how TNBC patients might respond to PD-1 inhibitors.

      Strengths:

      This work is important because it provides a clear explanation for the conflicting roles of the JSP in the tumor environment. The evidence is solid, as it combines data from thousands of patients with single-cell analysis and lab experiments to confirm the role of STAT4 in cancer progression and immunity.

      Comments on revised version:

      The authors made a significant effort to improve the manuscript. My comments were sufficiently addressed.

    3. Reviewer #2 (Public review):

      Summary:

      The JAK-STAT pathway (JSP) exhibits cell-type-specific functional heterogeneity in breast cancer. This study investigates the JSP in breast cancer and its response to anti-PD‑1 immunotherapy. JSP displays distinct cell‑type heterogeneity: it promotes malignant phenotypes and immunosuppression in tumor cells, while enhancing cytotoxicity and reducing exhaustion in T cells. Elevated JSP expression correlates with improved immunotherapy responses, especially in triple‑negative breast cancer. These findings highlight the paradoxical roles of JSP, indicating that broad inhibition may compromise anti‑tumor immunity.

      Strengths:

      The major strengths of this study include the comprehensive characterization JSP heterogeneity across epithelial, tumor, and T cells in breast cancer. The identification of JSP and STAT4 as predictive biomarkers for immunotherapy response, particularly in triple‑negative breast cancer, provides clinically relevant insights for patient stratification.

      Weaknesses:

      The corresponding content has been revised.

    4. Reviewer #3 (Public review):

      Summary:

      This multi-omics study by Zhou et al elucidates the context-dependent roles of the Janus kinase-signal transducer and activator of transcription (JAK-STAT) pathway (JSP) across different cellular compartments in the breast cancer tumor microenvironment. While bulk JSP activity is associated with a favorable prognosis, single-cell analysis reveals a paradoxical landscape: high JSP in T cells drives anti-tumor cytotoxicity and reduces exhaustion, whereas high activity in tumor epithelial cells promotes malignancy and immunosuppression via the MIF-CD74 signaling axis. The JSP score (immune-related) serves as a robust predictive biomarker for response to anti-PD-1 immunotherapy, particularly in triple-negative breast cancer (TNBC). Furthermore, the study identifies the STAT4/SLC47A1 axis as a critical mechanism through which tumor cells resist ferroptosis, facilitating disease progression. These findings suggest that broad JAK-STAT inhibition may be counterproductive in cancer therapeutics; instead, therapeutic success depends on precise modulation and carefully timed interventions to preserve its T-cell-associated functions. This study may inspire future studies to explore specific factors that selectively modulate JAK-STAT activity in immune cells to achieve favorable therapeutic outcomes.

      Strengths:

      Significant therapeutics implications

      Weaknesses:

      Limited molecular mechanisms

      Comments on revised version:

      The authors have addressed my comments

    5. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This multi-omics study provides a comprehensive characterization of the context-dependent roles of the JAK-STAT pathway (JSP) across different cellular compartments within the breast cancer microenvironment. The authors present convincing evidence that high JSP activity paradoxically drives anti-tumor cytotoxicity in T cells but promotes malignancy and immunosuppression in tumor epithelial cells, leading to the fundamental discovery that broad JAK-STAT inhibition could be therapeutically counterproductive. Ultimately, the identification of the immune-related JSP score and the STAT4 axis as predictive biomarkers for anti-PD-1 immunotherapy response, particularly in triple-negative breast cancer, offers critical insights for precise patient stratification and targeted therapeutic interventions.

      We greatly appreciate the editor’s insightful and comprehensive summary of our study.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their manuscript, Zhou and colleagues present a detailed look at how the JSP functions differently in the various cells of a breast tumor. The authors have effectively shown that the JSP acts as a double-edged sword, as it helps T cells fight cancer but also allows tumor cells to grow and avoid ferroptosis. These findings are important because they identify a useful biomarker to predict how TNBC patients might respond to PD-1 inhibitors.

      We highly appreciate Reviewer #1’s generous comments and thorough understanding of our study.

      Strengths:

      This work is important because it provides a clear explanation for the conflicting roles of the JSP in the tumor environment. The evidence is solid, as it combines data from thousands of patients with single-cell analysis and lab experiments to confirm the role of STAT4 in cancer progression and immunity.

      Weaknesses:

      However, there are areas for improvement in the scope of the review, the depth of analysis, and the potential for broader clinical implications. The authors are encouraged to address these issues to enhance the scientific and clinical impact of the study.

      We greatly appreciate the positive recognition and insightful comments from the reviewer. We are grateful that you acknowledge our solid evidence and the significance of clarifying the dual roles of JSP and STAT4. We will fully address your suggestions to expand the research scope, deepen the analysis, and strengthen the clinical implications in the revised manuscript.

      Major Issues:

      (1) The authors demonstrate that STAT4 upregulates SLC47A1, but this is currently supported only by expression correlation and western blot data. To confirm a direct link, the authors are encouraged to perform ChIP-qPCR or luciferase reporter assays to show that STAT4 binds directly to the SLC47A1 promoter.

      We highly appreciate this insightful and important comment. Due to time constraints, the first author has left the laboratory for clinical practice, and this manuscript is critical for fulfilling his degree requirements at Sichuan University. We are making every effort to supplement additional mechanistic experiments where feasible. In the meantime, we have performed protein–nucleic acid docking analysis between STAT4 protein and the SLC47A1 promoter region, and the corresponding results have been added to the supplementary figures.

      (2) The conclusion that the MIF-CD74 axis drives immunosuppression is based on computational inference. To support this, the authors could consider mining publicly available breast cancer spatial transcriptomics data to show the co-localization of MIF and CD74. Alternatively, performing simple dual-color immunofluorescence staining on a few clinical sections would effectively demonstrate the physical proximity of these cells.

      We sincerely appreciate your careful review and valuable suggestions. We fully agree that the conclusion regarding the MIF-CD74 axis driving immunosuppression requires further spatial evidence. Although we plan to collect additional clinical specimens for direct co-localization validation, the related ethical approval is still ongoing and cannot be completed in a short time. Therefore, we have supplemented analyses on publicly available breast cancer spatial transcriptomics datasets, which now provide solid bioinformatic evidence to support the spatial co-localization and interaction of the MIF-CD74 axis in the tumor microenvironment in the revised manuscript.

      (3) TNBC is highly heterogeneous and includes subtypes like mesenchymal and immunomodulatory groups. The authors should analyze whether the JSP score or STAT4 levels vary significantly between these subtypes, as this could further refine the selection of patients for JAK1 inhibitors.

      Thank you for this insightful suggestion. We have supplemented the expression levels of JSP score and STAT4 in two independent TNBC cohorts to explore their heterogeneity across the four TNBC subtypes (Fig. S5B-C).

      (4) While the JSP score works well in the current datasets, the authors should consider validating its predictive accuracy in additional independent immunotherapy cohorts, such as the TONIC trial, to ensure the biomarker is robust across different treatment settings.

      We sincerely appreciate this valuable suggestion regarding the validation of the JSP score in independent cohorts. To address your concern about the robustness of our biomarker across different treatment settings, we would like to provide the following clarification and updates:

      Status of TONIC-trial Data Access:

      We fully recognize the significance of validating the JSP score in the TONIC-trial (Nat Med 2019; https://www.nature.com/articles/s41591-019-0432-4), a seminal study exploring immune induction strategies for PD-1 blockade in metastatic TNBC. We have made persistent efforts to obtain these data. However, our previous application to the Data Access Committee (DAC) of the European Genome-phenome Archive (EGA, Study ID: EGAS00001003535) was declined. The official reason provided was a restriction on data sharing imposed by the US Department of Justice, related to Executive Order 14117, which prohibits the transfer of bulk sensitive personal data to certain countries.

      Compensatory Validation in Available Anti-PD-1 cohorts:

      Despite the limitation on the TONIC-trial data, we have rigorously evaluated the predictive accuracy of the JSP score in two additional, independent, and publicly available anti-PD-1 treated breast cancer cohorts to thoroughly demonstrate its generalizability (Fig. S5A):

      GSE194040 (I-SPY2-990, Pembrolizumab, anti-PD-1): A cohort investigating anti-PD-1 therapy in metastatic breast cancer.

      GSE173839 (I-SPY2 trial, Durvalumab, anti-PD-L1): A cohort evaluating neoadjuvant anti-PD-L1 therapy in TNBC.

      We believe these additional validations adequately address your comment.

      Minor Issue:

      The manuscript mentions a U-shaped trajectory of JSP activity during tumor transition. A more detailed biological explanation of why the pathway activity initially drops and then rises would add depth to the discussion.

      We greatly appreciate this constructive comment. The JAK–STAT pathway (JSP) is essential for maintaining normal epithelial growth; its expression is higher in normal epithelium than in tumor tissues and increases during normal epithelial differentiation. In datasets containing both normal and tumor cell populations, JSP activity naturally declines during the transition from normal epithelium to early tumor lesions. In the subsequent tumor differentiation stage, JSP activity gradually rises, which may be driven by intrinsic tumor heterogeneity and pathway-dependency among different subtypes. This dynamic trend is consistent with JSP pathway activity score, which is independent of pseudotime cell trajectory analysis. We have added this explanation in the first paragraph of the Discussion.

      Reviewer #2 (Public review):

      Summary:

      The JAK-STAT pathway (JSP) exhibits cell-type-specific functional heterogeneity in breast cancer. This study investigates the JSP in breast cancer and its response to anti-PD‑1 immunotherapy. JSP displays distinct cell‑type heterogeneity: it promotes malignant phenotypes and immunosuppression in tumor cells, while enhancing cytotoxicity and reducing exhaustion in T cells. Elevated JSP expression correlates with improved immunotherapy responses, especially in triple‑negative breast cancer. These findings highlight the paradoxical roles of JSP, indicating that broad inhibition may compromise anti‑tumor immunity.

      Strengths:

      The major strengths of this study include the comprehensive characterization of JSP heterogeneity across epithelial, tumor, and T cells in breast cancer. The identification of JSP and STAT4 as predictive biomarkers for immunotherapy response, particularly in triple-negative breast cancer, provides clinically relevant insights for patient stratification.

      Weaknesses:

      The findings rely heavily on public dataset analyses.

      We sincerely appreciate the reviewer’s insightful recognition and comprehensive summary of our study, as well as the positive comments on our strengths.

      We fully agree that the current findings are mainly based on multi‑omics analyses of public datasets. In response to this comment, we have supplemented additional validation using independent cohorts (e.g., FUSCC‑TNBC and METABRIC) to reinforce the reproducibility of the cell‑type-specific heterogeneity of the JAK–STAT pathway and the predictive value of JSP/STAT4 for immunotherapy response in TNBC.

      Moreover, we have clearly discussed this limitation in the Discussion section and explicitly proposed further prospective experimental validation and clinical sample verification in our future work.

      We have carefully revised the manuscript in full accordance with all of your valuable suggestions to further improve the quality and rigor of our work.

      Reviewer #3 (Public review):

      Summary:

      This multi-omics study by Zhou et al elucidates the context-dependent roles of the Janus kinase-signal transducer and activator of transcription (JAK-STAT) pathway (JSP) across different cellular compartments in the breast cancer tumor microenvironment. While bulk JSP activity is associated with a favorable prognosis, single-cell analysis reveals a paradoxical landscape: high JSP in T cells drives anti-tumor cytotoxicity and reduces exhaustion, whereas high activity in tumor epithelial cells promotes malignancy and immunosuppression via the MIF-CD74 signaling axis. The JSP score (immune-related) serves as a robust predictive biomarker for response to anti-PD-1 immunotherapy, particularly in triple-negative breast cancer (TNBC). Furthermore, the study identifies the STAT4/SLC47A1 axis as a critical mechanism through which tumor cells resist ferroptosis, facilitating disease progression. These findings suggest that broad JAK-STAT inhibition may be counterproductive in cancer therapeutics; instead, therapeutic success depends on precise modulation and carefully timed interventions to preserve its T-cell-associated functions. This study may inspire future studies to explore specific factors that selectively modulate JAK-STAT activity in immune cells to achieve favorable therapeutic outcomes.

      Strengths:

      Significant therapeutic implications.

      Weaknesses:

      Limited molecular mechanisms.

      We sincerely appreciate the reviewer’s highly positive recognition and insightful summary of our work. Fully addressing your comment regarding limited molecular mechanisms, we have comprehensively supplemented and enriched the mechanistic elaborations in the revised manuscript—including detailed explanations of the dual cell-type-specific roles of the JSP pathway, the downstream MIF-CD74 axis, and the STAT4/SLC47A1-mediated ferroptosis resistance mechanism. All related revisions have been carefully incorporated into the text to strengthen the molecular depth and robustness of our findings.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) The Graphic Abstract in the current version fails to provide brief information about the submission.

      We appreciate your comment on the Graphic Abstract. We have redrawn a new, concise Graphic Abstract that clearly summarizes the key findings, workflow, and core message of our submission. The updated version now provides brief but complete information about the study.

      (2) Information regarding the epidemiology of breast cancer and TNBC is recommended to be included in the Introduction section.

      In response to your comment, we have supplemented up-to-date epidemiological data for both breast cancer and triple-negative breast cancer (TNBC) in the revised Introduction section.

      (3) Attention should be paid to the superscript, particularly for CD8+.

      We have revised the plus sign in CD4/8+ to the standard superscript format (CD8⁺) throughout the entire manuscript.

      (4) Typos are present, such as the error in "2.1" (please verify and correct accordingly).

      We have carefully checked and revised the entire manuscript, especially the section 2.1 Bioinformatical profiling. All typos, grammatical errors, and formatting inconsistencies pointed out in your comment have been fully corrected throughout the text.

      (5) Relevant information about MCF-10A cells in the cell culture protocol is missing.

      We sincerely apologize for the omission of MCF-10A cell culture details. We have supplemented the complete cell culture protocol for MCF-10A cells in 2.2.1 Cell culture.

      (6) For the Western blot experiments, information about the dilution ratios (of primary/secondary antibodies) is required.

      We have supplemented the detailed dilution ratios for all primary and secondary antibodies used in the Western blot experiments.

      (7) The Ethics Approval Number must be provided.

      We have supplemented the official ethics approval number for animal experiments in Section 2.2.6.

      (8) For the IHC staining experiments, information about the dilution ratios (of antibodies) is required.

      We have supplemented the detailed antibody dilution ratios for all primary antibodies used in the IHC staining experiments in Section 2.2.7 Immunohistochemistry (IHC).

      (9) Up-to-date citations are necessary, especially those published in 2026.

      We have thoroughly updated the reference list according to your suggestion in epidemiology of breast cancer.

      (10) Proofreading the language is recommended in order to enhance the fluency and readability of the manuscript.

      We have carefully polished the full manuscript with the help of a native English speaker to improve linguistic fluency, readability, and academic expression. All revisions have been completed strictly following your suggestions, and we deeply appreciate your efforts to help optimize this work.

      Reviewer #3 (Recommendations for the authors):

      Major points for the authors:

      (1) Please provide an overview figure of the datasets and approaches used in this study, as Figure 1.

      We sincerely appreciate your valuable suggestion. We have supplemented an overview figure (designated as Figure 1A) that systematically summarizes all datasets and experimental approaches used in this study, including the detailed workflow of bioinformatic profiling, pseudotime analysis, and functional validation.

      (2) The authors need to improve the organization of figure panels, as they appear cluttered in some regions, which impedes understanding of the figures.

      We sincerely appreciate your constructive comment. To address the cluttered figure panels that impeded understanding, we have redrawn Figures 2, 3, 5, and 6, and fine-tuned the image size, layout, and spacing of the panels.

      (3) The experimental section utilizes female mice for the MDA-MB-231 xenograft models. Given that a central finding of the paper is the pathway's role in T-cell-mediated anti-tumor immunity, the authors should discuss how the absence of a functional T-cell compartment in nude mice affects the interpretation of tumor growth data, or, ideally, provide data from immunocompetent syngeneic models.

      We thank the reviewer’s valuable comment. The MDA-MB-231 xenograft model in nude mice only supports our conclusion that STAT4 promotes tumor growth, given the deficient T-cell immune compartment in this model.

      We are currently constructing an orthotopic breast cancer model with stable STAT4 overexpression in 4T1 cells using immunocompetent mice, which possesses a complete immune microenvironment to further validate our immune-related findings. In addition, we plan to establish conditional STAT4 overexpression via the Cre/LoxP system in the MMTV-PyMT transgenic breast cancer mouse model. However, these elaborate in vivo validations cannot be completed within a short time frame due to experimental duration and technical limitations.

      This manuscript is critically important for the first author to complete their doctoral degree at Sichuan University. We sincerely appreciate the reviewer’s understanding and generous support for accepting our current data and future follow-up validation plans.

      (4) While the study links STAT4 to SLC47A1 upregulation, adding direct mechanistic evidence - such as ChIP-seq or luciferase reporter assays - would confirm that STAT4 directly binds the SLC47A1 promoter rather than acting through intermediary signaling.

      We highly appreciate this insightful and important comment. Due to time constraints, the first author has left the laboratory for clinical practice, and this manuscript is critical for fulfilling his degree requirements at Sichuan University. We are making every effort to supplement additional mechanistic experiments where feasible. In the meantime, we have performed protein–nucleic acid docking analysis between STAT4 protein and the SLC47A1 promoter region, and the corresponding results have been added to the supplementary figures.

      (5) Are there any potential upstream selective regulators of STAT4 in immune cells?

      IL‑12 acts as the upstream activator of STAT4 in immune cells. This cytokine binds to IL12R‑β1/β2, triggering Tyk2/Jak2 signaling to induce STAT4 phosphorylation, dimerization and nuclear translocation, thereby upregulating IFN‑γ transcription and enhancing T cell‑ and NK cell‑mediated antitumor immunity. We have added these details in the Discussion.

      (6) Recent studies have identified CD74+ lipid-associated macrophages (LA-MAMs) as a conserved niche in multi-organ metastasis of breast cancer. Linking the tumor-derived MIF-CD74 axis results to this broader metastatic framework could emphasize the clinical relevance of the findings.

      Recent study defines a conserved MIF-CD74 LA-MAM axis driving T-cell exhaustion and multi-organ metastasis in breast cancer, predicting poor patient survival. Our work further reveals that tumor-intrinsic JAK-STAT signaling reinforces this immunosuppressive cascade, while T-cell STAT4 activation reverses immune suppression. Combining MIF-CD74 blockade with precise STAT4-targeted strategies may synergize to remodel the metastatic niche and enhance immunotherapy efficacy in TNBC. We have supplemented the relevant mechanistic details and literature discussion in the revised Discussion section.

      Minor points for the authors:

      (1) The use of "spokesperson" to describe STAT4's role as a representative of the JAK-STAT pathway is somewhat informal for a scientific manuscript. Adopting more standard academic phrasing, such as "primary mediator" or "key transcriptional orchestrator," would enhance the professional tone.

      Thank you for your valuable comment. We have revised the manuscript accordingly by replacing the informal term "spokesperson" with the standard academic phrase "key transcriptional orchestrator".

      (2) The JSP score achieved a predictive AUC of 0.70-0.76. The authors could improve the work by testing whether combining the JSP score with existing clinical biomarkers, such as PD-L1 IHC or Tumor Mutational Burden (TMB), significantly enhances predictive accuracy.

      We have made every effort to collect publicly available breast cancer immunotherapy datasets for further validation. Unfortunately, none of these datasets provided immunohistochemistry (IHC) data for PD-L1/PD-1 expression. To address your valuable suggestion, we instead integrated mRNA expression levels of PD-L1/PD-1 with the JSP score to predict immunotherapy response.

      In cohorts GSE194040 and GSE173839 (Fig. S5A), this combined model exhibited improved predictive performance with an AUC exceeding 0.8, which is superior to using the JSP score alone. The corresponding results have been added and presented in the supplementary figures.

      (3) There is a potential contradiction in which bulk JSP scores correlate with better survival, whereas tumor-intrinsic JSP scores correlate with poor survival. A clearer discussion or a specific figure reconciling how the dominant immune signal overrides the pro-tumor signal in bulk analysis would be beneficial.

      In survival profiling, higher T-cells- and normal epithelial-specific JSP scores correlate with favorable patient survival, whereas elevated tumor-intrinsic JSP scores are associated with poor prognosis. This can be attributed to the predominant expression of JSP in T cells, which enhances T cell mediated anti-tumor immunity and counterbalances its pro tumor effects within cancer cells. We have added detailed clarification of this dual regulatory mechanism in the Discussion section.

      (4) The authors cite recent publications regarding the benefits of late-stage or intermittent JAK inhibition. Providing a more detailed proposed dosing schedule or "therapeutic window" based on their differentiation data could offer more actionable insights for clinical trial design.

      Based on the above clinical evidence and our findings, administering JAK–STAT inhibitors before or concurrently with immunotherapy may impair T‑cell cytotoxicity and disrupt normal epithelial differentiation in breast cancer patients. Instead, sequential delivery of JAK inhibition following immunotherapy represents a promising immune‑sensitizing strategy, particularly for the TNBC subtype. We have added corresponding descriptions in the third paragraph of the discussion section.

      (5) The authors note that they are unable to refine the analysis for TNBC subtypes, such as mesenchymal-like (MES), due to data limitations. If possible, using the METABRIC cohort (which was already accessed) to perform a secondary validation of JSP activity across these specific molecular subtypes would add significant depth.

      We appreciate this constructive suggestion. To address the subtype heterogeneity of JSP activity in TNBC, we have collected two TNBC datasets (FUSCC-TNBC and 2024_Nat.Comm.) and conducted further validation and analysis across different TNBC molecular subtypes in Fig. S5B-C.

      (6) The discussion evaluates both broad JAK inhibitors (Ruxolitinib) and STAT3-selective inhibitors (TTI-101). Explicitly comparing the potential biological impact of selective STAT3 inhibition versus selective STAT4 activation could clarify the most promising therapeutic direction.

      We greatly appreciate this valuable suggestion. We have supplemented the Discussion (in the penultimate paragraph) by proposing a translational strategy utilizing the specific cytokine IL‑12 to activate STAT4 for immune sensitization, while explicitly comparing the distinct biological effects and therapeutic directions between selective STAT3 inhibition and targeted STAT4 activation.

      In summary, we sincerely thank the editors and reviewers for their constructive comments and valuable suggestions. We have carefully addressed all the comments and revised the manuscript accordingly.

    1. eLife Assessment

      This study presents a valuable framework for the rational design of bacterial probiotics to protect against respiratory infections. The evidence supporting the central claim - that metabolic niche overlap predicts probiotic efficacy - is solid, combining innovative in vitro modeling with in vivo validation, though the model appears less effective for probiotics that rely on antimicrobial metabolite production.

    2. Reviewer #2 (Public review):

      Summary:

      This study aims to establish a rational framework for designing bacterial probiotics against respiratory infections. The central hypothesis is that in vitro antagonism, particularly through metabolic niche overlap with a pathogen, predicts in vivo efficacy.

      Strengths:

      (1) Systematic pipeline: The study integrates bacterial isolation, in vitro characterization, model development, and in vivo validation into a cohesive workflow.

      (2) Quantitative model: The introduction of the Niche Index (NI) and Niche Index Fraction (NIF) provides a novel, quantitative tool for predicting probiotic efficacy based on ecological principles.

      (3) Mechanistic insight: The work dissects different modes of action, clearly demonstrating that inhibition can be driven by specialized metabolite production (CP8) or carbon resource competition (e.g., CP7), with lactate utilization identified as a key factor.

      Weaknesses:

      (1) Limited model generalizability: The predictive power of the NI model is not universal. It fails to account for the in vivo inefficacy of CP8 (a metabolite-dependent inhibitor) and cannot explain the short-term protection conferred by some non-inhibitory CPs in vivo, suggesting unmodeled mechanisms like immune priming are at play.

      (2) Preliminary nature of key findings: The emphasis on lactate consumption as a critical predictor, while interesting, is not sufficiently explored to establish its general importance beyond the specific strains and conditions tested.

      Appraisal:

      The authors successfully achieve their aim of establishing a rational probiotic-design pipeline. The data robustly support the conclusion that metabolic niche overlap predicts efficacy for many strains, while also clearly delineating the model's limitations, as acknowledged by the authors.

      Impact:

      This work provides a valuable methodological framework for hypothesis-driven probiotic discovery. The quantitative Niche Index offers immediate utility to the field and, with further refinement, has the potential to become a fundamental tool for developing respiratory therapeutics.

      Comments on revised version.

      I thank the authors for their meticulous revisions.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      A summary of what the authors were trying to achieve:

      (1) Identify probiotic candidates based on the phylogenetic proximity and their presence in the lower respiratory tract based on phylogenetic analysis and on meta-analysis of 16S rRNA sequencing of mouse lung samples.

      (2) Predefine probiotic candidates with overlapping and competing metabolic profiles based on a simple and easy-to-applicable score, taking carbon source use into consideration.

      (3) Confirm the functionality of these candidate probiotics in vitro and define their mechanism of action (niche exclusion by either metabolic competition or active antibacterial strategies).

      (4) Confirm the probiotic action in vivo.

      Strengths:

      The authors attempt to go the whole 9 yards from rational choice of phylogenetic close lower respiratory tract probiotics, over in silico modelling of niche index based on use of similar carbon sources with in vitro confirmation, to in vivo competition experiments in mice.

      Weaknesses:

      (1) The use of a carbon source is defined as growth to OD600 two SD above the blank level. While allowing a clear cutoff, this procedure does not take into account larger differences in the preferences of carbon sources between the pathogen and the probiotic candidate. If the pathogen is much better at taking up and processing a carbon source, the competition by the probiotic might be biologically irrelevant.

      While the definition of carbon utilization in this work is a commonly used definition, we agree that there are numerous ways that one could define carbon utilization. We also agree that it is possible that inclusion of additional features of carbon consumption such as the order of prioritization of carbon sources by CP could improve the model. Our data in Figure 3H and 3I do suggest that certain carbon sources may be disproportionately important for predicting antagonistic phenotypes. However, given that the objective of this work was to develop a simple model to aid in the design of probiotic communities, we feel that the current definition of carbon utilization allows maximum accessibility and is suitable for our needs. Work is currently underway to identify additional features, such as carbon source processing efficiency, that may improve the model’s utility.

      (2) The authors do not take into account the growth of candidate probiotics in the presence of Bt. In monoculture, three of the four most potent candidate probiotics grow to comparable levels as Bt in LSM.

      Yes, our model only accounts for a one way interaction (effect of pathogen on CP). This is for two reasons (1) We are only interested in characterizing and modeling the antagonistic potential of the CP on the pathogen as this antagonism, we propose, is what gives a CP therapeutic potential (2) The degree to which co-culture with Bt impacts CP activity will be captured by the performed competition experiments and therefore any inhibition of the CP by Bt will be accounted for.

      While further investigation of the effect of Bt on the growth of the CP may not be necessary to achieve our objective, we agree that ecologically it would be interesting to understand this dynamic better. To explore this, we conducted co-culturing studies between each CP and Bt or media-only control and measured the amount of CP after 24 hours of co-culture. From the data it appears that only a small number of organisms (CP4, CP7 and CP19) are significantly inhibited by the pathogen at the 1:1 ratio tested. This result is perhaps unsurprising as these CPs have the highest niche index and therefore have a greater metabolic overlap with the pathogen.

      These data have been incorporated into Figure S1B and additional text has been added to line 157 and the methods at line 712.

      (3) Niche exclusion in vivo is not shown. Mortality of hosts after infection with Bt is not a measure for competition of CP with the pathogen. Only Bt titers would prove a competitive effect. For CP17, less than half of the mice were actually colonized, but still, there is 100% protection. Activation of the host immune system would explain this and has to be excluded as an alternative reason for improved host survival.

      We have revised the manuscript to address these issues as follows:

      (1) We include Bt titer data as suggested, displayed in a new figure (Figure 5F). The results indicate that CP8 fails to reduce Bt titers as compared to the no-CP control, whereas the other CPs tested (CP13, CP17, CP19, CP20, and CP26) do reduce Bt titers to statistically significant degrees (p-value < 0.05 by ANOVA/Tukey). These results support the idea that the CPs competitively exclude Bt in vivo (as they do in vitro), with the notable exception of CP8 (which competitively excludes in vitro but not in vivo, consistent with the mortality results). Further, additional spearman correlation analysis was performed to understand the relationship between the Niche Index value for a given CP and pathogen instantiation when pre-treatment with a given CP is performed. We found that there was a strong relationship between NI<sub>CP</sub> and pathogen load (r = -0.84, P<0.0001, 95% CI [-0.90 to -0.76], N = 77) such that prophylactic treatment with a CP with a high Niche Index value strongly correlated with lower pathogen load following Bt challenge. Text describing these findings has been added at line 471.

      (2) We include survival studies of mice prophylactically treated with non-viable CPs, displayed in a new figure (Figure S7). Viability is required for niche exclusion, so protection conferred by non-viable CPs must be due to other effects such as elicited immune responses. We found that non-viable CPs provide some protection when administered at 3 days prior to Bt challenge, though not to the same degree as viable CPs. Together, our data suggest that with the day 3 dosing schedule there are alternative mechanisms of protection (potentially including immune priming) that our current model does not capture. These results are described in further detail at line 460.

      Appraisal:

      (1) Based on phylogenetic comparison and published resources on lower respiratory tract colonizing bacteria, the authors find a reasonably good number of candidate probiotics that grow in LSM and successfully compete with the pathogenic target bacterium Bt in vitro.

      (2) In vivo, only host survival was tested, and a direct competition of CP with Bt by testing for Bt titers was not shown.

      Impact:

      Niche exclusion based on competition for environmentally provided metabolites is not a new concept and was experimentally tested, e.g. in the intestine. The authors show here that this concept could be translated into the resource-poor environment of the respiratory tract. It remains to be tested if the LSM growth-based competition data in vitro can be translated into niche exclusion in vivo.

      Reviewer #2 (Public review):

      Summary:

      This study aims to establish a rational framework for designing bacterial probiotics against respiratory infections. The central hypothesis is that in vitro antagonism, particularly through metabolic niche overlap with a pathogen, predicts in vivo efficacy.

      Strengths:

      (1) Systematic pipeline: The study integrates bacterial isolation, in vitro characterization, model development, and in vivo validation into a cohesive workflow.

      (2) Quantitative model: The introduction of the Niche Index (NI) and Niche Index Fraction (NIF) provides a novel, quantitative tool for predicting probiotic efficacy based on ecological principles.

      (3) Mechanistic insight: The work dissects different modes of action, clearly demonstrating that inhibition can be driven by specialized metabolite production (CP8) or carbon resource competition (e.g., CP7), with lactate utilization identified as a key factor.

      Weaknesses:

      (1) Limited model generalizability: The predictive power of the NI model is not universal. It fails to account for the in vivo inefficacy of CP8 (a metabolite-dependent inhibitor) and cannot explain the short-term protection conferred by some non-inhibitory CPs in vivo, suggesting unmodeled mechanisms like immune priming are at play.

      The NI model is not able to identify antagonism of metabolite-dependent inhibitors as their inhibitory activity is unrelated to the variables for which the model accounts. Based on the NI model, CP8 is predicted to have the least metabolic overlap with the pathogen which may explain its in vivo inefficacy. We do agree that short-term protection is only moderately related to NI (r = 0.48, P<0.0001, 95% CI [0.33 to 0.62], N = 115) and may represent an unmodeled alternative mechanism of protection as discussed at line 445, 466 and 523. We have added additional data in Figure S6 and corresponding text at line 444 which gives additional information about CP8 colonization in the context of infection.

      (2) Preliminary nature of key findings: The emphasis on lactate consumption as a critical predictor, while interesting, is not sufficiently explored to establish its general importance beyond the specific strains and conditions tested.

      Indeed, our model and assertions about critical predictors of antagonism only extend to the specific strains and conditions tested. While we cannot assert that lactate consumption is a critical predictor of antagonism universally, several other studies have indicated the importance of lactate in infection at other body sites [53-57].

      To further characterize the role of lactate utilization in the respiratory context, we performed an ex vivo experiment to measure lactate concentrations in respiratory tissue with or without treatment with a key isolate - CP19. After 24 hours of incubation, we found that lactate levels were significantly reduced in the CP19-containing homogenate compared to the PBS-only control (Figure S8A). Additionally, the pathogen was unable to grow in the CP19 conditioned homogenate but was able to grow in the untreated homogenate (Figure S8B). This indicates that CP19 can deplete the total lactate in lung tissue, and that this conditioning can inhibit pathogen growth in the lung tissue. These results are reported in a new supplementary figure (Figure S8) and summarized in corresponding text (line 485), with a description of the experimental procedure in the Methods section (line 924). While this does not prove our theory about the importance of lactate utilization universally, we believe that our work contributes to the growing body of evidence around lactate and its role in infection. Work is ongoing to expand the number of strains screened and determine the generalizability of particular carbon sources and their role in interbacterial antagonism.

      Appraisal:

      The authors successfully achieve their aim of establishing a rational probiotic-design pipeline. The data robustly support the conclusion that metabolic niche overlap predicts efficacy for many strains, while also clearly delineating the model's limitations, as acknowledged by the authors.

      Impact:

      This work provides a valuable methodological framework for hypothesis-driven probiotic discovery. The quantitative Niche Index offers immediate utility to the field and, with further refinement, has the potential to become a fundamental tool for developing respiratory therapeutics.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data or analyses.

      (1) CP titers at the end of the coculture experiment are missing in LSM.

      To quantify pathogen abundance after co-culture, cultures were plated on carbenicillin-100 to select for only colonies of the pathogen. As a result, no data about CP abundances were collected in the original experiments. However, we agree that ecologically it would be interesting to understand this dynamic better. We have added additional data about the impact of the pathogen on CP in co-culture to Figure S1B.

      (2) Bt titers in mice are essential to claim niche exclusion happens in vivo, and immune-mediated effects have to be excluded.

      Please see response to question 3 of the public review.

      (3) The definition of the use of carbon sources should be refined. Qualitative differences between the pathogen and the CP with regard to the usage of a given carbon source might have a substantial impact on the actual competitive effect.

      The definition of carbon utilization is stated at line 811. While we agree that there may be other carbon-consumption related variables (rate of growth on a particular carbon source, amount of biomass generation on that carbon source etc) that could be used in the model, for the purposes of this study a binary (can versus cannot grow on the carbon source) was sufficient. Work is currently ongoing to determine if metrics of growth on carbon sources such as those listed would improve the predictive capability of the model.

      Reviewer #2 (Recommendations for the authors):

      (1) Experimental & Analytical Suggestions:

      (a) To further validate the role of lactate, consider measuring lactate concentration in the airways of mice colonized by key CPs (e.g., CP7, CP19) versus controls. This would directly test if in vivo protection correlates with local lactate depletion.

      Unfortunately due to the funding for this project ending, we weren’t able to perform additional animal experiments. However, we were still able to test lactate utilization by CP19 in the respiratory context via an ex vivo experiment. We inoculated mouse lung homogenates with 10<sup>6</sup> CFU of CP19, or PBS as a negative control, and co-incubated for 24 hours. After 24 hours, we measured lactate levels and found that they were significantly reduced in the CP19-containing homogenate compared to the PBS-only control (A). Additionally, we measured the growth of the pathogen in CP19 conditioned (+CP19) and untreated (-CP19) homogenates and found that the pathogen was unable to grow in the CP19 conditioned tissues (B). This indicates that CP19 can deplete the total lactate in lung tissue, and that this conditioning can inhibit pathogen growth in the lung tissue. These results are reported in a new supplementary figure (Figure S8) and summarized in corresponding text (line 485), with a description of the experimental procedure in the Methods section (line 924).

      (b) The finding that CP8 provides no in vivo protection despite in vitro efficacy warrants further investigation. We suggest quantifying CP8 and Bt loads in co-colonized mice to determine if the probiotic fails to persist during infection or if the pathogen evades inhibition.

      Please see updated Figure S6 and accompanying text at line 444.

      (2) Quantitative Analysis:

      Please consider adding a brief justification in the manuscript explaining why the specific Niche Index formula (based on electron equivalents of shared carbon sources) was selected over alternative ecological metrics for quantifying niche overlap.

      Text was added starting at line 264 explaining our reasoning for choosing this model.

    1. eLife Assessment

      This study identifies apoptotic retinal ganglion cells as a potential source of ATP-mediated activation of PANX1 channels that initiate developmental retinal Ca²⁺ waves and coordinate microglial activation and vascular outgrowth with postnatal maturation. The work is important because it proposed an integrative framework linking programmed cell death, spontaneous neural activity, immune responses, and angiogenesis into a self-regulating developmental loop. The multimodal data are solid, but the mechanistic conclusions would be strengthened by complementary genetic approaches, such as PANX1 or BAX knockout models, to establish direct causality.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents a potentially important integrative model linking spontaneous retinal waves, apoptosis, microglial activity, and vascular development during postnatal retinal maturation. Its significance lies in proposing a mechanistic framework that could reshape understanding of how neural activity and tissue remodeling are coordinated in the developing central nervous system. The evidence is strengthened by the use of multiple complementary techniques, including Ca++ imaging, high-throughput electrophysiology, transcriptomics, histology, and pharmacology.

      Strengths:

      (1) Multimodal Validation: The authors correlate large-scale functional imaging (calcium imaging and MEA) with high-resolution structural and molecular data (scRNA-seq and IHC), providing strong topographical evidence for the "centrifugal expansion" pattern.

      (2) The primary significance lies in identifying apoptotic Retinal Ganglion Cells (RGCs) as the physiological "pacemakers" for stage II retinal waves. By linking programmed cell death directly to neural activity and subsequent angiogenesis, the authors propose a self-regulating developmental loop.

      Weaknesses:

      (1) While the PANX1 pharmacological data provide compelling functional support, extending these conclusions to the broader CNS may be premature. Additional direct mechanistic validation would further strengthen the claim of causality.

      (2) While the manuscript beautifully illustrates the co-occurrence of events during retinal development, strengthening the distinction between correlation and direct causation would enhance the impact of the findings.

    3. Reviewer #2 (Public review):

      Summary:

      Savage et al. investigate the synchronization of retinal Ca2+ waves with developmental cell death, microglia activation, and vascular outgrowth. These developmental processes occur through a mechanism where apoptotic cells release ATP through Panx-1 channels to stimulate both Ca2+ retinal waves and microglia activation. Using scRNAseq, the authors classify autofluorescence cell clusters (ACCs) at the leading edge of vasculature outgrowth as Hmox-1+ microglia. From here, they show microglia engulfment of apoptotic RGCs, and the potential release of ATP may contribute to Ca2+ wave generation. The authors demonstrate these mechanisms through the use of two pharmacological agents to either block the ATP release from Panx-1 or block receptor binding to ATP. Furthermore, while previous studies have described the site of initiation of retinal Ca2+ waves as random, this study shows that the initiation of Ca2+ waves is biased to the leading edge of vascular growth in the developing retina. To do this, the authors use a combination of wide-field Ca2+ imaging and multi-electrode arrays to pinpoint the sites of Ca2+ wave initiation in the developing retina.

      Strengths:

      The authors use several techniques to interrogate these mechanisms, including single-cell RNAseq, wide-field Ca2+ imaging, and multi-electrode arrays. With these experiments, this manuscript proposes several novel ideas, such as ATP as the Ca2+ wave-initiating cue, and the localization of the Ca2+ wave initiation to the leading edge of vascular growth.

      Weaknesses:

      The main weakness of the manuscript is the overreliance on only two pharmacological agents to test the central hypotheses. These conclusions would be strengthened if, in addition to their pharmacological manipulations, they used genetic knockout models to perturb programmed cell death or ATP release (i.e., BAX-KO, Panx-1 KO).

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study presents a potentially important integrative model linking spontaneous retinal waves, apoptosis, microglial activity, and vascular development during postnatal retinal maturation. Its significance lies in proposing a mechanistic framework that could reshape understanding of how neural activity and tissue remodeling are coordinated in the developing central nervous system. The evidence is strengthened by the use of multiple complementary techniques, including Ca++ imaging, high-throughput electrophysiology, transcriptomics, histology, and pharmacology.

      Strengths:

      (1) Multimodal Validation: The authors correlate large-scale functional imaging (calcium imaging and MEA) with high-resolution structural and molecular data (scRNA-seq and IHC), providing strong topographical evidence for the "centrifugal expansion" pattern.

      (2) The primary significance lies in identifying apoptotic Retinal Ganglion Cells (RGCs) as the physiological "pacemakers" for stage II retinal waves. By linking programmed cell death directly to neural activity and subsequent angiogenesis, the authors propose a self-regulating developmental loop.

      We thank the reviewer for their nice summary and for highlighting the strengths of this work.

      Weaknesses:

      (1) While the PANX1 pharmacological data provide compelling functional support, extending these conclusions to the broader CNS may be premature. Additional direct mechanistic validation would further strengthen the claim of causality.

      We agree with the reviewer that the conclusions would be greatly solidified with more direct mechanistic validation. However, we are unable to conduct more experimentation as the grant is finished and the Sernagor lab is in the process of being shutdown, after the unexpected passing of the PI.

      In order to make clearer that this mechanism was found in retinal tissue, not CNS, we have moved any mention of the implications of our work to a broader CNS mechanism to the discussion section. We will add text into the discussion highlighting the need for more mechanistic investigation to uncover the full extent of the developmental processes described herein.

      (2) While the manuscript beautifully illustrates the co-occurrence of events during retinal development, strengthening the distinction between correlation and direct causation would enhance the impact of the findings.

      We have been clear to only present our findings as correlational as we were unable to fully explore the causational nature within the mechanisms presented. In the discussion, we have used published evidence and experimental papers to bolster our understanding of the causal aspects of this research. We will also include sections of text to address what experimentation is required to examine the causal interactions more directly.

      Reviewer #2 (Public review):

      Summary:

      Savage et al. investigate the synchronization of retinal Ca2+ waves with developmental cell death, microglia activation, and vascular outgrowth. These developmental processes occur through a mechanism where apoptotic cells release ATP through Panx-1 channels to stimulate both Ca2+ retinal waves and microglia activation. Using scRNAseq, the authors classify autofluorescence cell clusters (ACCs) at the leading edge of vasculature outgrowth as Hmox-1+ microglia. From here, they show microglia engulfment of apoptotic RGCs, and the potential release of ATP may contribute to Ca2+ wave generation. The authors demonstrate these mechanisms through the use of two pharmacological agents to either block the ATP release from Panx-1 or block receptor binding to ATP. Furthermore, while previous studies have described the site of initiation of retinal Ca2+ waves as random, this study shows that the initiation of Ca2+ waves is biased to the leading edge of vascular growth in the developing retina. To do this, the authors use a combination of wide-field Ca2+ imaging and multi-electrode arrays to pinpoint the sites of Ca2+ wave initiation in the developing retina.

      Strengths:

      The authors use several techniques to interrogate these mechanisms, including single-cell RNAseq, wide-field Ca2+ imaging, and multi-electrode arrays. With these experiments, this manuscript proposes several novel ideas, such as ATP as the Ca2+ wave-initiating cue, and the localization of the Ca2+ wave initiation to the leading edge of vascular growth.

      We thank the reviewer for their nice summary and for highlighting the strengths of this work.

      Weaknesses:

      The main weakness of the manuscript is the overreliance on only two pharmacological agents to test the central hypotheses. These conclusions would be strengthened if, in addition to their pharmacological manipulations, they used genetic knockout models to perturb programmed cell death or ATP release (i.e., BAX-KO, Panx-1 KO).

      We thank the reviewer for their insightful suggestions for further experimentation to bolster the research. Initially, we utilised pharmacological interventions as they provided acute and quick answering of the research question. At the outset of the research, we were not certain that purinergic release through PANX-1 channels was the mediator for the developmental mechanisms described. We tested a wide variety of specific agonists and blockers before seeing any profound effects on wave generation. These agonists and antagonists have been used before and are proven to deliver reliable results. In addition, since the ACCs had never been reported before we were unsure if a knockout animal would display the same anatomical phenotype. Furthermore, it is known that knockout mouse lines, especially connexin and hemichannel pores, do not lose function but rather have other isoforms or compensation mechanisms which can substitute the original function. For the retina, for example, it was shown that Cx36 can functionally replace Cx45 after Cx45 KO (Frank et al, 2010).

      We agree that while direct mechanistic validation would significantly reinforce the arguments, we are limited in conducting further experiments since the grant has been completed and the Sernagor lab is in the process of shutting down following her passing.

      In order to address the omission of mechanistic validation in the paper we will add text into the discussion highlighting the need deeper investigation in the causality of the developmental processes described herein.

      M. Frank et al., Neuronal connexin-36 can functionally replace connexin-45 in mouse retina but not in the developing heart, J. Cell Sci. 123, 3605 (2010).

    1. eLife Assessment

      This important study deepens our understanding of how populations of a given species may diverge in their molecular and physiological patterns as a result of adaptation to different thermal regimes. By approaching this question from multiple directions, the authors provide convincing evidence for adaptive changes in three strains of the diamondback moth after only three years of experimental evolution. This work will be of interest to anyone working on the response of pest species to environmental change and to workers on adaptive evolution in general.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Lei and co-workers aim to uncover the genetic underpinnings of thermal adaptation across three strains of the diamondback moth (Plutella xylostella) through experimental evolution over three years under three different thermal regimes. They identify systematic differences in trait responses (e.g., survival, fecundity), metabolic profiles, gene expression, and in the amino acid sequence of the PxSODC gene, among others. These results suggest that the diamondback moth has a strong potential for rapid physiological adaptation to different thermal regimes. Overall, this is a comprehensive and generally well-executed study that addresses an important question in the face of ongoing climate change.

      Strengths:

      The authors employ multiple approaches to identify signatures of thermal adaptation across the three strains, such as trait performance comparisons, metabolomics, transcriptomics, and amino acid sequence comparisons. All these different angles form a convincing picture of the underlying factors that underpin thermal adaptation in this experimental system. The manuscript is also generally well written and easy to understand.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, the authors set out to better understand the genetic mechanisms underlying thermal adaptation in insects. They experimentally evolved diamondback moth (Plutella xylostella) populations - a pest species with a wide distribution - under both hot (12h:12h 32{degree sign}C/27{degree sign}C) and cold (15{degree sign}C/10{degree sign}C) thermal conditions, and conducted phenotypic assays and metabolic and transcriptomic profiling to analyze how populations changed to deal with this thermal stress compared to the nonevolved ancestral population (constant 26{degree sign}C). Phenotypic assays showed that evolved hot populations had increased survival at high temperatures (42-43{degree sign}C) while evolved cold populations had lower freezing points compared to the ancestral population. When measured at the constant 26{degree sign}C conditions, metabolic and transcriptomic profiles of 3rd instar larvae from the evolved population were distinctive from the ancestral population, with a set of overlapping metabolic and transcriptomic pathways that were significantly differentially expressed in both hot and cold evolved populations compared to the ancestral. The authors narrowed down this set of candidate genes further by focusing on genes with high expression levels overall, whose expression profile was correlated with differentially expressed metabolites, and that contained mutants in both hot and cold strains. From this set, they chose the PxSODC gene for further functional validation, as it has previously been shown to be involved in the response of insects to abiotic stress with its antioxidative role in cellular defense. At the constant 26{degree sign}C, this gene showed lower expression across development in evolved strains compared to the ancestral population, while it showed similar expression patterns under thermal stress. Knockdown of PxSODC resulted in decreased survival rates at high temperatures and higher freezing points compared to the ancestral population. Based on this validation, the authors hypothesize that the non-synonymous mutation in the PxSODC gene that they found in the cold and hot evolved populations might alter the conformation of the PxSODC protein, increasing enzyme capacity. Their experimental evolution experiment furthermore indicates the capacity of the pest species, the diamondback moth, to adapt to a wide range of temperatures, providing insights into its capacity for global dispersal.

      Strengths:

      (1) The authors did a tremendous amount of work to characterize the mechanisms underlying thermal adaptation in the diamondback moth, artificially selecting populations for three years in the lab and characterizing how they evolved as a result at different biological levels: from phenotypes in different life stages, to larval metabolites and gene transcription, to functionally validating how one of the resulting gene candidates influences the capacity to deal with thermal stress.

      (2) The paper identifies and provides further evidence for candidate genetic mechanisms that might be particularly important for thermal adaptation in insects, including lipid metabolism, oxidoreductase activity, and DNA methylation. It is furthermore interesting that the authors found similar mechanisms to be involved in both the adaptation to cold and hot environments. Their functional validation of some of the genes involved in these mechanisms is very useful to understand how these genes might be causally involved in insect thermal adaptation.

      (3) The paper also has applied value: the diamondback moth is a pest species with a wide distribution, so understanding its adaptive capacity to different thermal environments is important for predicting the prevalence and potential further range expansion of this species under future climate change.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This important study deepens our understanding of how populations of a given species may diverge in their molecular and physiological patterns as a result of adaptation to different thermal regimes. By approaching this question from multiple directions, the authors provide solid evidence for adaptive changes in three strains of the diamondback moth after only three years of experimental evolution, and support the causal involvement of the PxSODC gene in thermal adaptation to both cold and hot temperatures. This work would benefit from more sophisticated phylogenetic analyses, better statistical support, and a more detailed discussion of the differences in the three strains at the pathway level.

      We sincerely thank the editors for this positive and constructive assessment. In the revised manuscript, we have addressed the highlighted points by: (1) re-inferring the phylogenetic tree of the PxSODC gene using a model-based Maximum Likelihood method (IQ-TREE) to ensure a robust evolutionary analysis; (2) substantially expanding the description of our statistical methods across all data types to ensure reproducibility and clarify multiple-testing corrections; and (3) adding a more detailed discussion of the pathway-level differences between the hot and cold strains, particularly integrating how their distinct transcriptomic responses align with their shared metabolic adjustments and phenotypic traits.

      Reviewer #1 (Public review):

      (1) The authors identify pathways that are enriched in different strain comparisons (Figure 3E), but do not provide a detailed interpretation of these results. It would be great if the authors could explain in more detail how the physiological processes of a cold-adapted strain of this species may differ from those of a warmer-adapted strain.

      We agree. We have addressed this by directly integrating our pathway enrichment results (Figure 3E) with the observed life-history phenotypes (concurrently addressing Reviewer 2's Comment 36a). We expanded the Discussion to explain that while both strains share convergent adjustments in core pathways (e.g., lipid metabolism for energy reallocation), their specific physiological strategies differ. The cold-adapted strain relies on broader transcriptional reprogramming to maintain homeostasis and support extended longevity/cold hardiness, whereas the hot-adapted strain utilizes broader metabolic rewiring to actively fuel its accelerated development and higher fecundity.

      (2) The authors reconstruct a phylogenetic tree of the PxSODC gene using the neighbor-joining algorithm. The limitations of this algorithm have been known for many years now, especially for sequences separated by long evolutionary distances. According to Wang et al. (2016), the last common ancestor of the species shown in Figure S4C occurred 392-350 million years ago. Given this, I would strongly recommend that the authors infer a phylogenetic tree using model-based methods, such as those implemented in RAxML-NG or IQ-TREE. Also, in the absence of a valid outgroup sequence, I would show the gene tree as unrooted or rooted based on the corresponding species tree.

      Agree. We have re-inferred the phylogenetic tree of the PxSODC gene using the model-based Maximum Likelihood (ML) method implemented in IQ-TREE. As recommended, in the absence of a valid outgroup sequence, the revised tree is now presented as unrooted. Supplemental Figure S4C (Figure 5-figure supplement 1C) and the corresponding text in the manuscript have been updated.

      (3) There is a key piece of the puzzle that is currently missing: the structural mechanism behind the mutational effects described in this study (e.g., Figure 5). The authors could leverage AlphaFold to generate structural models of different mutants and conduct molecular dynamics simulations to examine their conformational dynamics.

      We thank the reviewer for this excellent suggestion. We generated AlphaFold structural models of the wild-type (WT) and mutant (MU) PxSODC proteins and conducted 100 ns molecular dynamics (MD) simulations using GROMACS 2022.3 at three physiologically relevant temperatures: 15°C (cold stress), 26°C (favorable baseline), and 32°C (heat stress). Using 26°C as the physiological baseline, three key structural parameters support enhanced thermostability of the mutant protein (Figure 5–figure supplement 3). First, RMSD analysis revealed that under heat stress (32°C), the WT underwent severe conformational drift (RMSD increased from the 26°C baseline of 1.62 to 2.49, an increase of 0.87), while MU remained remarkably stable (from 1.59 to 1.66, an increase of only 0.07). Second, MU possessed a significantly more compact structure, with lower SASA values at 15°C (118.39 vs. 127.29 nm²) and 26°C (113.82 vs. 125.61 nm²), indicating optimized hydrophobic core packing. Third, the intramolecular hydrogen bond network of MU demonstrated dual stress resistance: under cold stress, MU actively increased hydrogen bonds from its baseline (113→119), whereas WT lost bonds (117→112); under heat stress, MU fully maintained its bond count (113→113). These results provide a direct structural mechanism for the enhanced catalytic efficiency of the mutant SOD at lower expression levels.

      Reviewer #1 (Recommendations for the authors):

      (4) The experimental evolution component of this study is described in the text as lasting for three years. It would help if the number of generations per strain were also reported.

      We have added the number of generations per strain. Over the three-year period, the hot strain completed ~75 generations and the cold strain ~15 generations. The ancestral strain was continuously maintained at 26°C throughout this period. The revised text has been updated in both the Introduction and Materials and Methods.

      (5) In Figure 3B: There is a typo in the word “Statistics”.

      Corrected. The typo in “Statistics” in Figure 3B has been fixed.

      (6) In Figure 3D: “CS” appears twice.

      Corrected. The duplicated “CS” label in Figure 3D has been replaced with the correct label.

      (7) Figure 4: This is not accessible to colorblind readers, who will clearly not be able to tell each color apart. As a non-colorblind person, I, too, have trouble figuring out which color label in panel B corresponds to which color in panel A. For example, I do not know off the top of my head how 'blue' differs from 'midnightblue', 'royalblue', or 'skyblue'. I recommend that the authors replace colors with identifiers, such as 'g1' for group 1 and so on.

      We appreciate this suggestion. We have replaced all color-based module labels with alphanumeric identifiers (M1, M2, M3, etc.) and added a corresponding legend. The main text and supplementary materials have been updated accordingly.

      (8) Lines 246-247: "Its secondary structure mainly consisted of strands, helices and coils." This sentence is redundant. These three are the only possible secondary structural elements, according to most bioinformatics tools such as PSIPRED, which the authors used. This sentence would be more useful if the authors could report the percentage breakdown of each secondary structural element.

      We have removed the redundant sentence and updated the text to report the specific percentage breakdown of the secondary structural elements based on our PSIPRED predictions (approximately 55.24% random coils, 16.19% alpha helices, and 28.57% extended strands). The revised text has been updated in the Results section.

      (9) Lines 260-261: "This suggests that the PxSODC gene can alter its expression pattern and function in response to environmental change...". I find this sentence a bit imprecise. Would it not be more precise to mention that the expression of this gene is regulated by temperature triggers?

      We agree that the original phrasing was imprecise. We have revised the sentence in the manuscript to state: “This suggests that the expression of the PxSODC gene is regulated by temperature triggers, and its altered function contributes to temperature-adaptive evolution in P. xylostella.”

      (10) The data points in Figures S1 and S7 are very small and hard to tell apart without zooming in a lot. Perhaps the authors could change the orientation of those pages to landscape and increase the size of the figures.

      Done. We have changed the orientation of Supplemental Figures S1 (Figure 1-figure supplement 1) and S7 (Figure 5-figure supplement 4) to landscape and increased the size of the figures and individual data points to improve visibility.

      (11) In Figure S2, the panel labeled as 'C' should be 'B' (based on the caption) and vice versa.

      Corrected. The panel labels ‘B’ and ‘C’ in Supplemental Figure S2 (Figure 2-figure supplement 1) have been swapped. The Supplementary Materials have been updated accordingly.

      Reviewer #2 (Public review):

      (1) The paper in its current form is hard to digest and would benefit from improved clarification of the storyline, as well as a tighter integration between the phenotypic, omics, and functional validation data. Currently, it is not always clear what the relevance is of all the reported results, nor why certain decisions were made, or how all the different methods the authors used fit together. For example, the authors functionally validated a second gene, PxDnmt1, but it is unclear why this particular gene was chosen, nor how it relates to their selection regimes when looking at the results obtained with the phenotyping and omics data collection. Seeing how much work the authors did, this makes the paper overwhelming and difficult to read.

      We sincerely appreciate this constructive feedback. In the revised manuscript, we have made significant structural revisions to improve the storyline and logical flow. We have streamlined the Results section (moving extensive descriptive data like life table curves and detailed metabolomics of mutant strains to the Appendix 1-3) to focus on the key findings. Furthermore, we have clarified the logical transitions between experiments. For instance, regarding the choice to validate PxDnmt1, we now explicitly explain in the Results that our untargeted metabolomic analysis of the PxSODC mutant strains revealed consistent alterations in 5-hydroxymethyluracil (involved in DNA demethylation) and 5'-deoxyadenosine (a precursor to the primary methyl donor S-adenosylmethionine) across all developmental stages. This specific metabolic signature provided a strong, data-driven hypothesis linking PxSODC function to epigenetic regulation via DNA methylation, prompting us to functionally validate PxDnmt1. By explicitly stating these rationales, the narrative is now much clearer and cohesive.

      (2) The authors at times stretch their results too far, as the ecological relevance of their study design and results is not clear, limiting the generalizability and value of the results for understanding species' adaptive potential under climate change. For example, the selection regimes used present the minimum and maximum known temperatures at which the species can survive and develop, but it is unclear how the temperatures relate to the natural environment of the source population, to what extent wild populations might experience these temperatures, and whether they would experience them at the extended duration used (12h at max/min temperature). Moreover, I wonder whether the comparisons made would identify the genes that matter under natural conditions, as unevolved populations were kept under constant conditions compared to 12h:12h temperature regimes for the evolved populations, and the metabolic and transcriptomic profiling was done under a constant favorable 26°C rather than under thermal stress in a, as far as I can tell, randomly chosen life stage (larval stage).

      We appreciate the reviewer raising these important points regarding ecological relevance and experimental design. In the revised manuscript, we have added context and acknowledged these limitations in the Methods and Discussion sections. First, regarding ecological relevance: The source population is from Fuzhou, a subtropical region where summer high temperatures frequently exceed 32°C and winter lows can drop below 10°C, making our selection temperatures ecologically relevant extremes for this population. The 12h:12h cycling temperatures were designed to simulate severe but natural diurnal fluctuations.

      Second, regarding constant control vs. cycling regimes: The constant 26°C represents the established optimal developmental temperature and standard laboratory condition for P. xylostella. We acknowledge that comparing cycling selection regimes against a constant control might conflate adaptation to absolute temperature extremes with adaptation to thermal fluctuation itself. We have added this as a caveat in the Discussion. Third, regarding omics profiling conditions: The transcriptomic and metabolomic profiling was conducted under common garden conditions (26°C) specifically to identify constitutive, genetically fixed adaptations resulting from evolutionary selection, rather than immediate physiological plasticity under stress. We have clarified these rationales in the text.

      (3) The paper in its current form does not adequately describe the statistical analyses underlying the results, nor do the authors share their code, making it very hard to judge whether the analyses used are appropriate and the results trustworthy. I have concerns about the inappropriate use of t-tests, the lack of correcting for confounding variables, and the need for multiple testing corrections.

      We sincerely appreciate this concern. In the revised manuscript, we have made substantial improvements to the description of statistical analyses throughout the Methods section:

      (1) Statistical methods for each data type are now described separately and in detail, specifying the tests used, the number and type of comparisons, and sample sizes.

      (2) For metabolomic data, we have clarified that FDR correction was applied alongside multi-criteria thresholds (|log<sub>2</sub>Fold Change| ≥ 1, VIP ≥ 1, FDR < 0.05). For transcriptomic data, FDR correction (Benjamini and Hochberg, 1995) was applied via DESeq2.

      (3) For WGCNA, we have specified the total number of correlation tests (29 modules × 30 metabolites = 870) and the stringent dual threshold (|r| > 0.8, P < 0.05) used to control for false positives, following standard practice.

      (4) For life table parameters, the paired bootstrap method with 100,000 replications was used for all pairwise comparisons among strains.

      (5) For all other experimental data (qRT-PCR, SOD activity, O<sub>2</sub><sup>-</sup> levels, survival rates, supercooling/freezing points, etc.), we have specified that t-tests were used only for two-group comparisons, while one-way ANOVA with Tukey's or Tamhane's T2 test was used for three or more groups, with non-parametric alternatives applied when normality assumptions were not met.

      (6) The raw data have been deposited in public repositories (see Data availability), and all statistical procedures are now described in sufficient detail to enable independent reproduction of the results.

      Reviewer #2 (Recommendations for the authors):

      Title

      (4) I don't feel the title adequately captures the work, I would instead of 'adaptive evolution' use 'experimental evolution' and I would not use the word 'underpins' but instead 'indicates', as it is not clear from your work whether the adaptations to the lab conditions you used would be ecologically relevant nor whether they are involved in thermal adaptation in wild populations.

      Accepted. The title has been revised to: “Experimental evolution to thermal stress indicates climate resilience in a cosmopolitan arthropod.”

      Abstract

      (5a) Please add the phenotype results to the abstract.

      We have added key phenotype results to the abstract. The revised text now reads: “The hot strain showed accelerated development, higher fecundity, and increased survival under extreme heat, while the cold strain exhibited lower supercooling and freezing points, indicating enhanced cold hardiness.”

      (6b) The Abstract doesn't really detail the answer to your research question yet: so what insights into the genetic mechanisms underlying thermal adaptation did you gain that are novel?

      We agree. We have revised the Abstract to explicitly highlight the novel genetic and molecular mechanisms we discovered. Specifically, we now detail that thermal adaptation is driven by a coordinated mutational, metabolic, and epigenetic (1) an energy-efficient genetic mechanism where non-synonymous mutations in PxSODC enhance superoxide scavenging efficiency, enabling effective oxidative stress management at lower gene expression levels; (2) convergent metabolic adjustments, notably a reduction in lipid metabolism to conserve energy; and (3) epigenetic regulation of thermal tolerance via DNA methylation. The revised text has been updated in the Abstract accordingly.

      (7c) Line 3: replace 'ectotherms' with 'arthropods' to match the title?

      Done. “Terrestrial ectotherms” has been replaced with “terrestrial arthropods” in the abstract.

      (8d) Line 9: replace 'demographic' with 'life history'?

      Done. “Demographic” has been replaced with “life history” in the abstract.

      Introduction

      (9a) The storyline is a bit unclear. Do you want to focus on the increased threat from insect pests under climate change or on the threat of climate change on insect persistence? Please pick one and adapt your storyline accordingly. I would suggest focusing on the first and talking more about the range extension of pest species under climate change (which would also require adaptation to cold extremes).

      We agree and have refocused the Introduction on the increased threat from insect pests under climate change, emphasizing that range expansion into new regions requires adaptation to both heat and cold extremes. Both the first and second paragraphs have been revised accordingly.

      (10b) Line 31-33: What do you mean by 'shows a positive relationship between the thermal tolerance range and the level of climatic variability'? Are they able to tolerate a larger range of temperatures?

      This sentence has been revised as part of the restructured Introduction, which now focuses on the range expansion of pest species under climate change. The revised text reads: “Such range expansion requires adaptation not only to warmer conditions in existing habitats but also to cold extremes encountered during colonization of higher latitudes or elevations (Harvey et al., 2020).”

      (11c) Line 33-35: Is this information relevant here?

      Agreed. This sentence has been removed as part of the restructured Introduction, which now focuses on the threat of pest range expansion under climate change.

      (12d) Line 55-56: What exactly do we not know yet about the mechanisms that enable thermal adaptation that you aim to fill in this paper? Please rephrase your knowledge gap to be more concrete (e.g., "but we do not yet know how...").

      We have rephrased the knowledge gap to be more concrete and aligned with the revised storyline. The revised text now reads: “...we do not yet know how long-term thermal selection drives coordinated changes across gene function, metabolic networks, and life history traits to enable thermal adaptation and range expansion in pest species.”

      (13e) Line 57: Also, here, the storyline is unclear. Why did you use the diamondback moth as your model species? You provide many different reasons, but it would help if you emphasized one reason that is in line with whichever storyline you want to focus on: is it because it is an insect pest that can tolerate a wide range of temperatures?

      We have streamlined this paragraph to focus on the primary rationale: P. xylostella is a globally distributed pest that thrives across a wide range of thermal environments, making it an ideal model for studying the genetic mechanisms of thermal adaptation. Supporting details on genomic resources are retained briefly as they enable the multi-omics approach used in this study.

      (14f) Line 65: Demonstrated how? Please give a short summary of the evidence for their genetic capacity to tolerate future climates.

      We have added a brief summary of the evidence. Specifically, genome-wide SNP analysis of field populations from 114 locations across diverse biogeographical zones revealed climate-adaptive genetic variability, indicating that P. xylostella can tolerate projected future climates in most regions (Chen et al., 2021).

      (15g) Line 72: What does 'Age-stage' mean? Should it read 'Aged-staged'?

      “Age-stage, two-sex life table” is an established demographic method developed by Chi (1988) that simultaneously accounts for both age and developmental stage in both sexes. This is a standard term in the field (Chi et al., 2020), so we have retained the original wording but added a brief clarification upon first use.

      (16h) Line 78-80: This needs a bit more explanation. Why does an increased ability to scavenge superoxide anions affect adaptability under extreme temperature environments?

      We have added a brief explanation. Extreme temperatures induce oxidative stress by elevating intracellular reactive oxygen species (ROS), including superoxide anions, which can damage cellular structures. Enhanced scavenging capacity thus helps maintain cellular homeostasis under thermal stress.

      (i) Line 82-86: Please be more precise. What novel insights did you gain about the genetic mechanisms underlying thermal adaptation?

      We have revised this sentence to more precisely summarize the novel insights, encompassing both the multi-omics findings and the functional validation of PxSODC.

      Results

      (18a) The results section is very long and presents an overload of information at the moment, overwhelming the reader. Consider moving some sections to the Supplements (for example, a large part of the phenotypic data that cannot be linked to the omics data and the metabolic profiling of the mutant strains) or leave them out of the paper altogether.

      We agree that the Results section was too dense. We have streamlined it by moving the following content to the Supplementary Materials:

      (1) Detailed age-stage survival and fecundity curve data for the ancestral, hot and cold strains (Supplementary Text S1).

      (2) Detailed life table analysis of the PxSODC mutant strains (Supplementary Text S2).

      (3) Detailed untargeted metabolomic profiling of the SODC-MU mutant strains across developmental stages (Supplementary Text S3).

      The main text now retains only the key life history comparisons, extreme temperature tolerance results, omics-based evidence linking transcriptomics and metabolomics, functional validation of PxSODC, and the DNA methylation findings, with brief summaries and cross-references to the Supplements for supporting details.

      (19b) Please also provide the effect sizes for the different effects you report, for example, how many degrees difference was there between ancestral and cold strains in the supercooling/freezing points, and what was the variation?

      We have added specific effect sizes (mean ± SEM and between-group differences) for all key comparisons throughout the Results section, including preadult duration, stage-specific survival rates under extreme heat, supercooling/freezing points, and SODC-MU mutant strain comparisons. For example, the supercooling points of CS pupae (-23.99 ± 0.18°C) were 0.90°C lower than AS (-23.09 ± 0.26°C), and the freezing points were 2.66°C lower (-14.24 ± 0.61°C vs. -11.58 ± 0.52°C). Please refer to the revised manuscript for all updated values.

      (20c) Line 93-94: "Intrinsic and finite rate of increase" of what?

      Clarified. These are population growth parameters. The revised text now specifies “intrinsic rate of increase (r) and finite rate of increase (λ) of the population.”

      (21d) Line 98-99: Please start the paragraph with this summary of the results and then further detail them.

      We have restructured this paragraph by moving the summary sentence to the beginning, followed by the supporting details.

      (22e) Line 100-109: Why did you look at daily survival and fecundity rates? Please add why this is relevant.

      As part of the overall streamlining of the Results section, this paragraph on detailed age-stage survival and fecundity curves has been moved to Supplementary Text S1. A brief justification for their relevance has been added there, noting that these curves capture stage-specific variation in survival and fecundity that summary life table parameters alone may obscure.

      (23f) Line 106: What do HS, AS, and CS stand for? And please provide the statistics for comparison of daily survival rates between the strains.

      We have defined the abbreviations (HS = hot strain, AS = ancestral strain, CS = cold strain) at their first appearance in the Results section. This paragraph on daily survival and fecundity has been moved to Supplementary Text S1, where the abbreviations are also defined. The survival rates reported are the maximum daily survival rates derived from the age-stage specific survival rate curves (s<sub>xj</sub>), and the statistical comparisons among strains are presented in Supplemental Table S1.

      (24g) Line 144-146: Why are these differential metabolites likely to play a crucial role?

      We agree this statement was speculative. It has been removed from the revised manuscript.

      (25h) Line 159-161: Why is a reduction of lipid metabolites evidence for adaptive evolution?

      We have revised this sentence to clarify the reasoning. The reduction in lipid metabolites in both independently evolved hot and cold strains suggests a convergent metabolic response, indicating that lipid metabolism adjustment is a shared adaptive strategy rather than a random change.

      (26i) Line 184-185: It is difficult to judge from Figure 3E the extent of overlap in KEGG pathways between the hot and cold strains. Can you adjust the figure to emphasize that overlap more?

      Agree. To intuitively emphasize the extent of overlap in KEGG pathways between the hot and cold strains, we have completely redesigned Figure 3E. Instead of presenting two separate panels with unaligned vertical axes, we have consolidated the data into a single back-to-back (mirrored) bar chart with a shared central y-axis.

      (27j) Line 211: Not only the red module, but also the blue and green module correlates with many of the shared differential metabolites.

      We agree. We have revised the text to acknowledge that the blue and green modules also showed strong correlations with shared differential metabolites, while noting that the red module had the highest number of significantly correlated metabolites and was therefore selected for further analysis.

      (28k) Line 215: I would rephrase this as genes being interesting candidates for being involved in thermal adaptation or 'seem to be important for the adaptation of...', as you don't know from these results whether these genes play a critical regulatory role.

      Agreed. We have toned down the language to reflect the correlative nature of these results.

      (29l) Line 233: Do you mean that you further analyzed 15 genes of the 79 identified candidate genes in the previous paragraph?

      Yes, exactly. From the 79 candidate genes, we selected 15 that were both annotated in the genome and had high expression levels (FPKM > 10) for further analysis. We have clarified this in the revised manuscript.

      (30m) Line 238: What does SOD stand for?

      We have spelled out the abbreviation upon first use in this section.

      (31n) Line 254-255: Please provide the stats for this result.

      We have added the specific allele frequencies for each strain. The Leu194-Met194 mutation frequency was determined by direct sequencing of 10 individuals per strain, and the frequencies are now reported in the revised text.

      (32o) Line 303-304: How did you test for enhanced stability to temperature fluctuations? And enhanced compared to what?

      This observation was based on the survival rate data in Figure 5C, where mutant pupae at 43°C showed no significant difference from the ancestral strain, whereas other life stages (eggs, larvae, adults) at 42°C showed significantly reduced survival in the mutant strains. We have revised the text to clarify the comparison.

      (33p) Line 324-326: Why do decreased expression levels demonstrate increased O₂⁻ scavenging capacity? And why is that beneficial for adaptation to thermal stress? Please explain.

      We have revised this sentence to clarify the logic. The non-synonymous mutations in the hot and cold strains likely alter the protein conformation of SOD enzymes, increasing their catalytic efficiency per molecule. This allows effective O<sub>2</sub><sup>-</sup> scavenging at lower expression levels, which is energetically favorable under thermal stress where energy conservation is critical for survival.

      (34q) Line 404-406: I'm confused. Is there a direct link between the gene you knocked out here and the results you presented up until now? How do the reduced levels of 5-methylcytosine relate to the metabolite results you present at the beginning of the paragraph, other than that both could be involved in DNA methylation?

      We have revised this paragraph to clarify the logical chain. Among the three metabolites consistently altered across all developmental stages in the SODC-MU strains, 5-hydroxymethyluracil is involved in dynamic DNA demethylation and 5'-deoxyadenosine is a precursor to S-adenosylmethionine (the methyl donor for DNA methylation). This suggested a link between PxSODC deletion and DNA methylation. To test this, we examined PxDnmt1 expression and activity in the thermally adapted strains and found both were significantly reduced. We then used RNAi to silence PxDnmt1 and confirmed that reduced DNA methylation (lower 5-mC levels) directly impaired thermal tolerance. Thus the connection is: PxSODC deletion → altered methylation-related metabolites → reduced DNA methyltransferase activity → decreased thermal tolerance.

      (35r) Line 410: Saying that your knockdown of a gene that did not directly pop up in any of your other analyses confirms that DNA methyltransferase is associated with the response to thermal selection is a stretch. Please rephrase.

      We agree this was overstated. We have toned down the language to reflect that the RNAi results provide preliminary evidence for a potential role of DNA methylation in thermal tolerance, rather than confirmation.

      Discussion

      (36a) The phenotype data are currently not discussed at all. Please add it to the discussion and try to integrate it more with the omics data you collected.

      We agree. To provide a cohesive narrative and avoid redundancy, we have addressed this comment in conjunction with our pathway interpretation (please see our response to Reviewer 1, Comment 1). In the revised Discussion, we explicitly integrated our specific phenotypic findings (e.g., accelerated development, increased fecundity, and heat survival in the hot strain; prolonged lifespan and lowered supercooling points in the cold strain) with the distinct transcriptomic and metabolomic profiles. This integration demonstrates how molecular and metabolic rewiring directly underpins the divergent life-history traits without engaging in unwarranted speculation.

      (37b) Line 433-434: I don't think this adequately represents the relevance of your particular study. I would suggest changing it to be more in line with the storyline of understanding the capacity for global dispersal in insect pests under climate change.

      We agree. We have revised this sentence to align with the storyline of pest range expansion under climate change.

      (38c) Line 476: This is a very odd statement; don't all species' genomes have genes encoding proteins involved in thermal adaptation? The reference also doesn't seem to be appropriate. I would suggest deleting this sentence.

      Agreed. This sentence has been removed.

      (39d) Line 483: Please write out SOD the first time you use it in a new section.

      Done. SOD has been spelled out at its first use in the Discussion.

      (40e) Line 544-548: This is a bit too specific to be the last sentence of the discussion. Try to formulate it more broadly in terms of what future research should focus on in general, not just your specific research.

      We agree. We have broadened the final sentence to address future research directions more generally.

      Figures

      (41a) Figure 1A: I don't think t-tests are appropriate here since you are not simply comparing two treatments, but testing for the effects of 5-6 different temperatures. And how did you correct for replicate populations in your analysis?

      Clarified. In Figure 1A, our comparisons are independent pairwise tests between exactly two strains (HS vs. AS) at each specific temperature and time point, making t-tests statistically appropriate. We were not testing for a continuous effect across temperatures. Regarding replicate populations, the individuals used in these assays were drawn from across the six replicate populations per treatment, with each biological replicate (n = 6, with 20 individuals per replicate) comprising individuals pooled from across the replicate populations to account for inter-population variation. We have clarified this in the revised figure legend.

      (42b) Figure 1B, Figure 5D, Figure 7: bar graphs are used for count data, so do the data represent the number of individuals with a certain trait value? If they are instead showing the mean of the population/treatment group, please use mean points ± standard errors instead.

      Accepted. The data in these figures represent continuous physiological traits (e.g., supercooling/freezing points) showing the mean of the populations, rather than count data. To align with current data visualization standards for continuous variables and to provide full transparency of the underlying data distribution, we have replaced the bar graphs in Figures 1B, 5D, and 7 with scatter plots. These revised figures now display the mean ± SEM overlaid with all individual biological replicate data points.

      (43c) Figure 3B: There is a typo in the graph, it reads 'Stattistics' instead of 'Statistics'.

      Corrected. The typo ‘Stattistics’ in Figure 3B has been fixed.

      (44d) Figure 3C: I don't understand what the colors of the graph mean here. Is it the average differential expression of each replicate compared to the ancestral?

      Clarified. We have updated the figure legend to explain that the colors represent the Pearson correlation coefficient (r) between pairs of biological replicates, indicating the degree of transcriptomic similarity among samples.

      Methods

      (45a) Please start each new methods paragraph with the purpose of the method/analysis, for example, "To investigate XX, we used method X to measure X". It is at the moment hard to understand why certain things were done.

      We agree. We have revised each Methods paragraph to begin with a clear statement of purpose, so that the rationale for each analysis is immediately apparent. All changes are shown in the revised manuscript.

      (46b) Line 575-578: Why were the selection regimes with cycling temperatures and the control with constant?

      The cycling temperatures in the hot (32°C/27°C) and cold (15°C/10°C) regimes were designed to simulate diurnal temperature fluctuations (12h light/12h dark) that more closely reflect natural thermal environments. The control was maintained at a constant 26°C, which is the established optimal developmental temperature for P. xylostella (Liu et al., 2002) and represents the standard laboratory rearing condition. We acknowledge this asymmetry and have added a justification in the revised manuscript.

      (47c) Line 581: How many generations was the ancestral population kept in the lab before the start of the selection experiment? And for how many generations were the populations selected?

      The ancestral population was maintained in the laboratory for approximately ~170 generations (from July 2012 to the start of the selection experiment) before the thermal selection began. The hot strain was selected for ~75 generations and the cold strain for ~15 generations over the three-year experiment. We have added this information to the revised manuscript.

      (48d) Line 585-586: I don't understand what you mean by randomly selecting six replicate populations per treatment for downstream experiments when you only had six replicate populations per treatment to begin with (as detailed in Line 574)?

      We apologize for the confusion. All six replicate populations per treatment were used for downstream experiments. We have corrected this sentence to remove the misleading “randomly selected” wording.

      (49e) Line 590: Were these 90 eggs also randomly selected, like for the individual life tables? And were these kept at the baseline temperature conditions?

      Yes, the 90 eggs were randomly selected and maintained under the baseline favorable temperature (26°C). We have clarified this in the revised manuscript.

      (50f) Line 606: Which life history and population fitness parameters were calculated?

      We have specified all parameters calculated in the revised manuscript.

      (51g) Line 609: Link to software doesn't work.

      We have updated the software link to the current working URL.

      (52h) Line 611: Please spell out what 'BT' stands for.

      Done. “BT” has been spelled out as “bootstrap” upon first use.

      (53i) Line 612-613: How many tests did you do? Did you correct for multiple testing? Using what method?

      The paired bootstrap method implemented in TWOSEX-MSChart inherently accounts for multiple pairwise comparisons through 100,000 bootstrap replications. We have clarified the scope of comparisons in the revised manuscript.

      (54j) Line 620-621: What does biological replicate mean here? Individual eggs / larvae / pupae / adults, or were all or some life stages pooled? Also, you now only detailed which samples were collected for metabolomic profiling, were the same samples used for transcriptomic profiling, or a subset?

      Each biological replicate consisted of pooled individuals at the same developmental stage. The same sample collection strategy was used for both metabolomic and transcriptomic profiling, but from independent biological replicates (six for metabolomics, three for transcriptomics). We have clarified this in the revised manuscript.

      (55k) Line 637: Also here, how many tests did you do? Were p-values corrected for multiple testing? Using what method?

      Differential metabolites were identified through pairwise comparisons using Student's t-test with FDR correction for multiple testing. A multi-criteria threshold of |log<sub>2</sub>Fold Change| ≥ 1, VIP ≥ 1, and FDR < 0.05 was applied. This approach was used for all metabolomic comparisons, including HS vs. AS, CS vs. AS, and SODC-MU vs. AS. We have clarified this in the revised manuscript.

      (56l) Line 662: And here: how many tests did you do? Did you correct for multiple testing? Using what method?

      In the WGCNA analysis, Pearson correlations were calculated between each module eigengene and each of the 30 common differential metabolites, resulting in a total of 29 × 30 = 870 correlation tests. Following standard WGCNA practice, rather than applying FDR correction, we used a stringent dual threshold of |correlation coefficient| > 0.8 and P < 0.05 to identify significant module-metabolite associations, which effectively controls for false positives (Langfelder and Horvath, 2008). We have clarified this in the revised manuscript.

      (57m) Line 663: How did you select these modules? The ones that significantly correlated with differential metabolites? Why did you not use the phenotype data here?

      Modules were selected based on significant correlations (|correlation coefficient| > 0.8, P < 0.05) with differential metabolites shared between the hot and cold strains. We chose metabolites rather than phenotype data as the trait input for WGCNA because metabolites serve as intermediate molecular phenotypes that bridge gene expression and organismal phenotypes, providing a more direct link to the underlying regulatory mechanisms. This approach allowed us to identify gene modules most closely associated with the metabolic changes driven by thermal adaptation, which could then be connected to the observed life history and fitness divergence.

      (58n) Line 666: move RNA extraction details to before RNAseq methods description.

      Done. The “RNA extraction and cDNA synthesis” section has been relocated to before the “Transcriptomic profiling” section for better logical flow.

      (59o) Line 836: This paragraph describing the statistics is very short, and it is unclear to what data the described analyses apply. As the different types of data are very different, I expect the analyses to differ as well. Please describe the statistical analyses for each data type in more detail, specifying what tests you used, which, and how many comparisons were performed.

      We agree. The statistical methods for life table analysis, metabolomics, and transcriptomics have been detailed in their respective method sections. We have expanded the Data analysis section to specify the statistical tests for the remaining experimental data.

      (60p) Line 837: Please include your SPSS scripts to ensure the reproducibility of your results.

      The statistical analyses in SPSS were performed using the graphical user interface. As all statistical tests, parameters, and comparison groups have been described in detail in the revised Methods section, and the raw data have been deposited in public repositories (see Data availability), we believe the analyses are fully reproducible. We are happy to provide additional details if needed.

    1. eLife Assessment

      This fundamental work demonstrates that ABHD6 regulates AMPAR gating kinetics in a TARP γ-2-dependent manner. The evidence in this study is compelling. This study will be of interest to readers in the field of synaptic transmission.

    2. Reviewer #1 (Public review):

      Summary:

      This research sheds light on the nuanced role of ABHD6 in regulating AMPARs, highlighting its interaction with TARP γ-2 as a critical factor in modulating receptor gating kinetics. It is crucial to understand that although ABHD6 alone does not alter AMPAR kinetics, its presence alongside TARP γ-2 accelerates AMPAR deactivation and desensitization, thereby affecting synaptic transmission dynamics.

      Strengths:

      Important findings in the research include:<br /> - ABHD6 does not affect the gating kinetics of GluA1 and GluA2(Q) homomeric receptors independently.<br /> - In the presence of TARP γ-2, ABHD6 accelerates deactivation and desensitization of these receptors, regardless of their splicing or editing isoforms.<br /> - The effect is consistent for both homomeric GluA1 and GluA2(Q) receptors and heteromeric GluA1i/GluA2(R)i-G receptors.<br /> - The recovery from desensitization of GluA1 with the flip splicing isoform is slowed by ABHD6 in the presence of TARP γ-2.

    3. Reviewer #2 (Public review):

      Summary:

      Cong et al. investigated the regulatory effects of ABHD6 on AMPARs. The authors performed adequate electrophysiology recordings to show the exact pattern of this regulation and covered major critical points.

      Strengths:

      The authors have performed high-quality ephys recordings and examined all potential regulatory aspects of ABHD6 on AMPARs. This is important to understand the AMPAR functions.

      Weaknesses:

      (1) The authors discussed CNIH-2 extensively from line 92-110 in the introduction, however, they did not perform related experiments. I suggest they move this part to the discussion where they also discussed the roles of CNIH.

      (2) The authors need to report the "n" for all the experiments they have presented in this manuscript. How many cells were recorded in each condition? How many batches? This information has to be in all of the figure legends, but it is missing except Fig. 4.

      (3) One question is what the physiological meanings of this regulatory effect are. The authors may consider adding some discussions.

      (4) About statistics. The authors need to add more details and make sure their statistics sound. For example, they also need to check the equality of variances. In their Table EVs, where the P values are reported, the authors need to report which statistics they have used, one-way ANOVA, K-W test, or others, and the exact post-hoc test type for each comparison. For one-way ANOVA, report the F values simultaneously with the P values in all figure legends.

      (5) Fig. 3J, the authors need to correct the label of the Y axis. It is shifted.

      Comments on revised version.

      In the revised manuscript, the authors have addressed all my concerns. The manuscript has been substantially strengthened by additional data and discussion.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This research sheds light on the nuanced role of ABHD6 in the regulation of AMPARs, highlighting its interaction with TARP γ-2 as a critical factor in modulating receptor-gating kinetics. It is crucial to understand that while ABHD6 alone does not alter AMPAR kinetics, its presence alongside TARP γ-2 leads to accelerated deactivation and desensitization of AMPARs, impacting synaptic transmission dynamics.

      Strengths:

      Important findings in the research include:

      ABHD6 does not affect the gating kinetics of GluA1 and GluA2(Q) homomeric receptors independently.

      In the presence of TARP γ-2, ABHD6 accelerates deactivation and desensitization of these receptors, regardless of their splicing or editing isoforms.

      The effect is consistent for both homomeric GluA1 and GluA2(Q) receptors and heteromeric GluA1i/GluA2(R)i-G receptors.

      The recovery from desensitization of GluA1 with the flip splicing isoform is slowed by ABHD6 in the presence of TARP γ-2.

      We are grateful for the reviewer's positive comments. It is really exciting to have one’s comments like “This research sheds light on the nuanced role of ABHD6 in the regulation of AMPARs”.

      Weaknesses:

      However, the study focuses on specific receptor subunits and isoforms, which may not fully represent the diversity of AMPAR compositions found in vivo (e.g. though the authors have claimed that TARP γ-2 failed to increase GluA3-induced currents significantly, the effect on GluA4 or the explanation was missing). Further research is needed to explore the implications of these findings in more complex neuronal environments.

      Thank the reviewer for raising this point. To investigate whether ABHD6 is involved in the kinetic regulation of neurons, we recorded glutamate-induced currents at –70 mV using ABHD6 knockout neurons. We found that ABHD6 knockout neurons exhibited significantly slower deactivation and desensitization kinetics (Fig. 6, Table. EV7.1, EV7.2). Regarding the diversity of AMPAR subunit compositions, we obtained consistent results for GluA4, which is expressed at higher levels in the cerebellum and brainstem (Fig. 7, EV7, Table EV8.1, EV8.2). Specifically, we observed that ABHD6 accelerates the deactivation and desensitization of homomeric GluA4–TARP γ-2 complexes.

      Reviewer #2 (Public Review):

      Summary:

      Cong et al. investigated the regulatory effects of ABHD6 on AMPARs. The authors performed adequate electrophysiology recordings to show the exact pattern of this regulation and covered major critical points.

      Strengths:

      The authors have performed high-quality ephys recordings and examined all potential regulatory aspects of ABHD6 on AMPARs. This is important to understand the AMPAR functions.

      We greatly appreciate the reviewer’s positive comment on our manuscript and recognition of our quality ephys recordings.

      Weaknesses:

      (1) The authors discussed CNIH-2 extensively from line 92-110 in the introduction, however, they did not perform related experiments. I suggest they move this part to the discussion where they also discussed the roles of CNIH.

      We thank the reviewer for the suggestions. Accordingly, we have moved the discussion of CNIH‑2 to the Discussion section (lines 355–372) of the revised manuscript: “Other key modulators include cornichon family AMPA receptor auxiliary proteins (CNIH-2/3) and GSG1L, which generally slow receptor kinetics in heterologous expression systems (Kato et al., 2010; Schwenk et al., 2012), although their effects in neurons can be context-dependent (Gu et al., 2016; Mao et al., 2017). Additional diversity arises from synapse-enriched proteins such as SynDIG4 and CKAMP44, which exert complex and sometimes opposing effects on different kinetic parameters (Matt et al., 2018; Khodosevich et al., 2014). This diversity comes from the known co-assembly of AMPA receptor subunits (the pore-forming GluA subunit) with three classes of auxiliary proteins—collectively comprising 21 components, most of which are secretory or transmembrane proteins. Importantly, multiple auxiliary subunits (e.g., TARP γ-8 and CNIH-2) can co-assemble within a single AMPAR complex, and their combined presence modulates functional outcomes in ways not predicted by individual subunits alone, underscoring a combinatorial regulatory logic (Shi et al., 2010; Yu et al., 2021; Herring et al., 2013). Given that native synaptic AMPARs predominantly exist as GluA2-containing hetero-oligomers (e.g., GluA1/2, GluA2/3), although homo-oligomers have also structurally validated, understanding how novel auxiliary proteins such as ABHD6 integrate into this complex framework becomes paramount (Lu et al., 2009; Wenthold et al., 1996; Zhao et al., 2016; Malinow and Malenka, 2002).”

      (2) The authors need to report the "n" for all the experiments they have presented in this manuscript. How many cells were recorded in each condition? How many batches? This information has to be in all of the figure legends, but it is missing except Fig. 4.

      We appreciate the reviewer for pointing out these weaknesses, we added the cell number and corresponding batches in every figure and table in the revised manuscript.

      (3) One question is what the physiological meanings of this regulatory effect are. The authors may consider adding some discussions.

      We thank the reviewer for the suggestions. In the revised manuscript, we have included a discussion on the physiological implications of this regulatory effect in lines 386–412, as follows: “Although there is no direct evidence indicating that ABHD6 and TARP γ-2 bind to each other, both are known to associate with AMPA receptors, suggesting the possibility of indirect or regulatory interactions. For example, their relationship could be transient, condition-dependent, or mediated through mechanisms such as conformational changes or steric hindrance (Gill et al., 2011b; Sumioka, 2013; Wei et al., 2017). Studies have reported that scaffold proteins participate in the binding, anchoring, maintenance, and removal of AMPA receptors, either through direct interaction with receptors or through indirect binding via auxiliary subunits (Danielson et al., 2014). Additionally, we extended the same experimental approach to AMPA receptors containing the GluA1 flip subtype together with TARP γ-8. Our results demonstrate that this ABHD6-dependent regulatory mechanism also applies to other TARP family members, including TARP γ-8 (Figure 7, EV7, Table. EV9.1, EV9.2). Our findings indicate that ABHD6 plays a critical negative regulatory role on AMPA receptor function. It suppresses synaptic current amplitude and accelerates the deactivation and desensitization kinetics in a TARP γ-2-dependent manner. By shortening synaptic response duration and reducing total charge transfer, ABHD6 may thereby restrain neuronal excitability and narrow the temporal window for synaptic integration. Loss of ABHD6 function—as observed in our knockout neurons, which exhibit slowed kinetics—could promote excitatory hyperactivity. Thus, as a key “molecular brake” on synaptic excitability, dysregulation of ABHD6 may directly contribute to the pathogenesis of neurological disorders. Insufficient braking function may lead to excessive synaptic transmission, strongly correlating with hyperexcitability conditions such as epilepsy. Conversely, overly potent braking might result in synaptic dysfunction, potentially contributing to early synaptic impairment in cognitive disorders like Alzheimer’s disease. Overall, our research highlights ABHD6 as a promising target for novel therapeutic strategies in neurological disorders and provides a solid theoretical foundation for further investigation in this field.”

      (4) About statistics. The authors need to add more details and make sure their statistics sound. For example, they also need to check the equality of variances. In their Table EVs, where the P values are reported, the authors need to report which statistics they have used, one-way ANOVA, K-W test, or others, and the exact post-hoc test type for each comparison. For one-way ANOVA, report the F values simultaneously with the P values in all figure legends.

      We appreciate your thoughtful advice. Accordingly, we have added the description of statistical strategy in the revised manuscript in line 530-536: “Data were first assessed for normality using the D’Agostino–Pearson test (n<50) or the Kolmogorov-Smirnov test (n>50), and for equality of variances using the Brown-Forsythe ANOVA test. Depending on the outcome of these tests, data were analyzed by parametric (one-way ANOVA) or non-parametric methods (Kruskal-Wallis test) followed by Tukey's Honest Significant Difference (HSD) test as a post hoc analysis to determine specific differences among groups. Correlation was evaluated with Pearson correlation analysis. Values of P < 0.05 were considered statistically significant.”

      (5) Fig. 3J, the authors need to correct the label of the Y axis. It is shifted

      Thank the reviewer for raising this point, we have corrected the label of the Y axis of Fig. 3J in the revised manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The manuscript is well-structured and the findings are presented clearly. While the study addresses multiple isoforms, a more detailed explanation of the isoform-specific effects observed, e.g. the unique behavior of the GluA2(Q)i-G isoform in terms of deactivation, would be beneficial.

      We appreciate the reviewer for pointing out these weaknesses. In response, we have added a discussion in the revised manuscript in line 330-345 that addresses RNA editing as a key regulatory mechanism of AMPAR function beyond subunit composition and splicing variants: “Beyond subunit composition and splicing variants, the function of AMPARs is also finely regulated by RNA editing. Q/R editing enables the conversion of neutral to positively charged residues in the ion-selective filter of the channel, causing impermeability to divalent cations such as Ca<sup>2+</sup>. This not only alters channel conductance and current but also contributes to neuronal dysfunction and excitotoxicity (Kawahara et al., 2004; Kwak and Kawahara, 2004). R/G editing markedly influences receptor desensitization and recovery kinetics, and may modulate interactions with auxiliary proteins, thereby playing a critical role in synaptic plasticity and development (Stern-Bach et al., 1998; Coombs et al., 2012; Wright and Vissel, 2012). The conversion from R to G weakens inter-dimer interactions within the binding domains, leading to structurally more flexible receptors (Lomeli et al., 1994). Furthermore, R/G editing exhibits strong developmental regulation and varies across brain regions and cell types (Geiger et al., 1995). Therefore, in this study, we systematically examined the effect of ABHD6 on different flip/flop splice variants and R/G editing subtypes. Our results demonstrate that ABHD6 also suppresses currents in HEK 293T cells expressing flop splice variants and R/G-edited receptors.”

      The authors should consider discussing potential mechanisms underlying the interaction between ABHD6 and TARP γ-2 in greater depth. This could include hypotheses on how ABHD6 might be influencing TARP γ-2's modulation of AMPARs if applicable (though the authors have mentioned either the potential binding domain of ABHD6 to AMPARs or TARP γ-2 to AMPARs, the proposed direct interaction between ABHD6 and TARP γ-2 is unknown). It's also unclear whether the effect of ABHD6 is specific to TARP γ-2 or is general to other TARP family members.

      We appreciate your suggestion and use affinity chromatography to examine the interaction between ABHD6 and TARP γ-2. Our investigation revealed no direct evidence of a physical binding between the two proteins. Accordingly, we have supplemented the discussion in the revised manuscript (lines 386–393) as follows: “Although there is no direct evidence indicating that ABHD6 and TARP γ-2 bind to each other, both are known to associate with AMPA receptors, suggesting the possibility of indirect or regulatory interactions. For example, their relationship could be transient, condition-dependent, or mediated through mechanisms such as conformational changes or steric hindrance (Gill et al., 2011b; Sumioka, 2013; Wei et al., 2017). Studies have reported that scaffold proteins participate in the binding, anchoring, maintenance, and removal of AMPA receptors, either through direct interaction with receptors or through indirect binding via auxiliary subunits (Danielson et al., 2014).”

      Expanding the discussion to include the potential physiological and pathophysiological implications of ABHD6's modulatory effects on AMPAR kinetics would provide a broader context for the findings.

      We thank the reviewer for the suggestions, in the revised manuscript we discussed the physiological meanings of this regulatory effect in line 386-412: “Although there is no direct evidence indicating that ABHD6 and TARP γ-2 bind to each other, both are known to associate with AMPA receptors, suggesting the possibility of indirect or regulatory interactions. For example, their relationship could be transient, condition-dependent, or mediated through mechanisms such as conformational changes or steric hindrance (Gill et al., 2011b; Sumioka, 2013; Wei et al., 2017). Studies have reported that scaffold proteins participate in the binding, anchoring, maintenance, and removal of AMPA receptors, either through direct interaction with receptors or through indirect binding via auxiliary subunits (Danielson et al., 2014). Additionally, we extended the same experimental approach to AMPA receptors containing the GluA1 flip subtype together with TARP γ-8. Our results demonstrate that this ABHD6-dependent regulatory mechanism also applies to other TARP family members, including TARP γ-8 (Figure 7, EV7, Table. EV9.1, EV9.2). Our findings indicate that ABHD6 plays a critical negative regulatory role on AMPA receptor function. It suppresses synaptic current amplitude and accelerates the deactivation and desensitization kinetics in a TARP γ-2-dependent manner. By shortening synaptic response duration and reducing total charge transfer, ABHD6 may thereby restrain neuronal excitability and narrow the temporal window for synaptic integration. Loss of ABHD6 function—as observed in our knockout neurons, which exhibit slowed kinetics—could promote excitatory hyperactivity. Thus, as a key “molecular brake” on synaptic excitability, dysregulation of ABHD6 may directly contribute to the pathogenesis of neurological disorders. Insufficient braking function may lead to excessive synaptic transmission, strongly correlating with hyperexcitability conditions such as epilepsy. Conversely, overly potent braking might result in synaptic dysfunction, potentially contributing to early synaptic impairment in cognitive disorders like Alzheimer’s disease. Overall, our research highlights ABHD6 as a promising target for novel therapeutic strategies in neurological disorders and provides a solid theoretical foundation for further investigation in this field.”.

      Some typos:

      p7L144, might miss a word 'of' after 'properties';

      Thanks for your careful advice, we have corrected “the channel properties TARP γ-2-containing AMPA receptors” to “the channel properties of TARP γ-2-containing AMPA receptors” in the revised manuscript.

      p9L178, remove '.';

      Thanks for your careful advice, we have corrected the subheading “ABHD6 accelerated the deactivation of homomeric AMPAR-TARP γ-2 complexes.” to “ABHD6 accelerated the deactivation of homomeric AMPAR-TARP γ-2 complexes” in the revised manuscript.

      p9L195, might be 'deact' instead of 'deac';

      Thanks for your careful advice, we have corrected “τ<sub>w, deac</sub>” to “τ <sub>w, deact</sub> " in the revised manuscript.

      p12L276, might be a missing 'ABDH6' after 'whether'.

      Thanks for your advice, we have added “ABHD6” after “whether” in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      (1) Line, 366, grammar mistake. The author used the expression "In this study, we systematically studies", which should be “study" instead of :”studies"

      Thanks for your advice, we have corrected “studies” to “study” in the revised manuscript.

      (2) Line 370, the author used the expression "However, previous studies also found poorly expressed but significant population of GluA1 homomeric receptors in the hippocampus". It looks like "poorly expressed" is somewhat contradictory to "significant". I suggest the authors revise this sentence.

      Thanks for your advice, we have deleted the statement in the revised manuscript.

      (3) Line 407-409. The authors stated, "The flip and flop isoforms were cloned into an IRES-GFP expression vector using polymerase chain reaction (PCR). ...editing variants were generated using PCR". It is impossible to use PCR only to finish all cloning, especially with IRES-GFP. This must be done via restriction enzyme, or Gibson assembly, or another method. The author probably PCRed the isoforms and then put them into the vectors using other methods. The authors need to revise their statement and make it complete and clear.

      We thank the reviewer for their suggestion. In response, we have added a description of the expression vector construction to the revised manuscript in line 431-437: “The flip and flop isoforms were cloned into an IRES-GFP expression vector using polymerase chain reaction (PCR). Q/R and R/G editing variants were generated by PCR-based cloning and FastCloning. GluA1 and TARP γ-2 were subcloned using EcoRI and SalI sites (Milstein et al., 2007), GluA2 and GluA3 were inserted with XhoI and SalI, and GluA4 was inserted with EcoRI and BamHI. All constructs were verified by restriction mapping and sequencing of PCR-amplified regions.”

      (4) It would help if the authors could show some WB blots or PCR results or other evidence that their transfection was successful, in particular with these many plasmid combinations.

      We thank the reviewer for raising this point. In response, we have included additional experiments in the revised manuscript in line 138-142: “Immunofluorescence assays and Western blot analysis were performed on cells co-transfected with GluA1, TARP γ-2, and ABHD6. These experiments were conducted to verify co-transfection efficiency and corresponding protein expression. Immunofluorescence results confirmed a high degree of co‑localization among GluA1, TARP γ-2, and ABHD6 (Fig. EV1).”

    1. eLife Assessment

      This is an important study that establishes a zebrafish model of PIK3CA-related overgrowth syndrome. The imaging characterization of the mesodermal, particularly vascular, lesions of the model is compelling. The scRNA-Seq analysis is convincing, revealing key perturbations in the PIK3CA-mutation model, although deeper investigation of the exact mechanism leading to the lesions, as well as validation at different time points, could further strengthen the findings. This work will be of interest to medical biologists working on PROS, and potentially to a broader audience interested in non-cell-autonomous signaling of PIK3CA and its implications in other diseases.

    2. Reviewer #1 (Public review):

      Summary:

      Brunsdon et al. present a zebrafish model of mosaic PIK3CA activation to investigate mechanisms underlying PIK3CA-related overgrowth spectrum (PROS), with a particular focus on non-cell-autonomous mechanisms of tissue overgrowth. The study is timely and addresses an important gap in the understanding of how mosaic activation of PI3K signaling leads to tissue-specific developmental abnormalities.

      Using a Tol2-based mosaic expression system combined with single-cell transcriptomics, the authors provide evidence suggesting that mutant PIK3CA-expressing cells influence surrounding wild-type tissues through indirect signaling mechanisms, contributing to vascular malformations and tissue overgrowth.

      Overall, the work presents an interesting and potentially impactful model for studying mosaic PIK3CA-driven overgrowth and non-cell-autonomous signaling mechanisms. However, several aspects require clarification, additional controls, and improved presentation to strengthen the mechanistic conclusions and overall impact of the study.

      Strengths:

      This study addresses an important and timely question by investigating the mechanisms underlying mosaic PIK3CA activation in the context of PROS, a condition for which developmental mechanisms remain poorly understood. The use of a mosaic zebrafish model is particularly appropriate, as it closely reflects the mosaic nature of PIK3CA mutations observed in patients and allows the investigation of non-cell-autonomous effects.

      Another major strength of the study is the integration of single-cell transcriptomics, which provides valuable insight into potential signaling pathways involved in indirect tissue overgrowth and offers a rich dataset for hypothesis generation. The authors also propose an interesting conceptual framework in which PI3K-activated cells influence surrounding tissues through paracrine signaling, which could have broader implications beyond PROS and contribute to understanding mosaic developmental disorders more generally.

      Finally, the work has potential translational relevance, as identifying mechanisms driving mosaic PI3K activation and non-cell-autonomous signaling could inform future therapeutic strategies for PROS and related conditions.

      Weaknesses:

      Despite these strengths, several aspects of the study require clarification and additional experimentation.

      Major comments:

      (1) The Tol2-based system results in mosaic overexpression of mutant PIK3CA in the presence of endogenous wild-type PIK3CA, making it difficult to determine how co-expression of WT and mutant proteins influences the observed phenotypes. While mosaic expression is relevant to PROS, a complementary approach in which endogenous PIK3CA is knocked out prior to introducing mutant variants would allow clearer interpretation of mutant-specific effects.

      (2) The authors do not clearly describe the validation of editing or integration efficiency. It would be important for the authors to clarify whether sequencing was performed to confirm integration, to quantify the proportion of mosaic expression, and to measure transgene expression levels. These controls would strengthen confidence in the model and interpretation of the results.

      (3) The manuscript would benefit from rescue experiments to strengthen causal conclusions. It remains unclear whether the phenotypes induced by PIK3CA PROS variants can be rescued, either through expression of wild-type PIK3CA, pharmacological inhibition of PI3K signaling, or assessment of developmental reversibility. Such experiments would strengthen the link between PI3K activation and the observed phenotypes.

      (4) The authors propose candidate signaling molecules mediating non-cell-autonomous effects downstream of PI3K hyperactivation; however, these conclusions remain speculative, as no functional validation is provided. Testing selected candidate mediators identified in the RNA-seq dataset would significantly strengthen the mechanistic conclusions.

    3. Reviewer #2 (Public review):

      In this manuscript, Burnsdon et al. aim to study PIK3CA-related overgrowth spectrum (PROS) by establishing a mosaic zebrafish model with overexpression of pik3ca carrying hotspot mutations, coupled with an mScarlet+ reporter. Using fluorescence microscopy, the authors demonstrated that overexpression of pik3ca with a number of hotspot mutations led to mesodermal and particularly vascular malformations in the zebrafish model. Interestingly, they found a paucity of mScarlet+ mutant cells in the vascular lesions, consistent with the finding of low PIK3CA mutation burden in PROS tissue. Such data suggest a non-cell-autonomous effect of PIK3CA mutation. Following this logic, the authors performed single-cell RNA-Sequencing on zebrafish overexpressing WT pik3ca and mutant pik3ca at 19 hpf, and demonstrated widespread transcriptomic perturbations across multiple lineages, including lineage frequencies, key cell pathways, and cell-cell interactions. Importantly, they demonstrate that mScarlet+ cells carrying mutant pik3ca cluster separately from other cell types, do not demonstrate clear lineage identity, and have a general downregulation in signaling components.

      Overall, the conclusions in the manuscript are well-supported by the presented data. The imaging studies are particularly convincing. The transcriptomic analysis generated a list of potential pathways to further investigate and potentially target with future therapeutic interventions. Importantly, this study provides a valuable in vivo model of PROS that: 1) recapitulates key features of PROS (e.g., multiple mesodermal defects, paucity of mutation burden in lesions suggesting non-cell-autonomous interactions); 2) is scalable; and 3) offers direct visualization of lesion development, compatible with time-course live imaging. This model will be valuable to further understand PROS and potentially study other diseases where the PIK3CA pathway is altered (e.g., certain cancers).

      The following are not necessarily weaknesses of the data, but rather suggestions where the manuscript could be further strengthened:

      (1) The model recapitulates the variability of mesodermal lesions in PROS. It would be valuable to utilize this model to further study factors that are associated with the development of more severe lesions (e.g., by comparing samples with more severe lesions to those unaffected despite carrying the mutations, Figure 1F).

      (2) ScRNA-seq analysis could be enriched with a comparison between cells overexpressing mutant pik3ca vs. those overexpressing WT pik3ca.

      (3) In the scRNA-Seq analysis, it is curious that the C0 cluster, enriched with mScarlet+ cells, is found to have downregulated signaling interactions (Fig. 5C), yet exerts a widespread non-cell-autonomous effect. Meanwhile, there is also a noticeable loss of certain lineages (e.g., notochord, Figure 4E) and related cell-cell interactions (e.g., notochord-related interaction, Figure 5A). A deeper exploration of the basis of the non-cell-autonomous effect would be valuable.

      (4) The scRNA-Seq analysis was performed at one time point (19 hpf). Additional analysis (not necessarily by scRNA-Seq) at other time points to study whether findings at 19 hpf are persistent throughout development or undergo dynamic changes (e.g., cell fate/state of mSc+ mutant cells) would be helpful.

      (5) The scRNA-Seq analysis provides a valuable list of perturbed interactions that could be targeted by future therapeutic approaches. Validation of the scRNA-Seq findings with protein-level analysis, and studying the effect of targeting some of the pathways on the disease phenotype, would offer valuable data for the community.

    4. Reviewer #3 (Public review):

      Summary:

      The study "PIK3CA-related overgrowth spectrum (PROS) zebrafish models reveal pan-lineage developmental dysregulation" presents important findings that extend significantly beyond a single subfield, bridging developmental biology, vascular medicine, and cancer-related PI3K signalling. By developing mosaic zebrafish models of PROS and combining live imaging with single-cell transcriptomics, the authors provide compelling evidence for a non-cell-autonomous mechanism of tissue overgrowth, a conceptual shift with meaningful therapeutic implications.

      Strengths:

      The evidence is overall convincing, with methodology appropriate and well-validated relative to the current state of the art; the integration of multiple approaches (in vivo modelling, scRNA-seq, ligand-receptor inference) strengthens the central claims. However, some aspects of the proposed non-cell-autonomous signalling mechanisms remain partly correlative, and direct functional validation of the rewired ligand-receptor interactions would further consolidate the conclusions.

      Weaknesses:

      The transgenic overexpression approach chosen by the authors represents a well-established and effective strategy for generating mosaic models in zebrafish. However, this approach introduces notable limitations: the lack of control over transgene dosage and unknown integration sites may generate non-physiological effects, potentially confounding the interpretation of key findings.

      The authors are certainly aware that alternative approaches (though technically more demanding) could be considered in future studies to further strengthen the model. For instance, a CRISPR/Cas9-mediated knock-in of the pik3ca-PROS allele at the endogenous locus (retaining upstream native regulatory elements with only a minimal promoter in the construct, co-expressed with a fluorescent reporter via P2A) could allow even more physiological, lineage-restricted expression while enabling direct visualisation of mutant cells. Mesodermal specificity could potentially be further refined by driving mosaic Cas9 expression under a pan-mesodermal tbx promoter, restricting editing to the relevant lineage while simultaneously marking mutant cells fluorescently, thus even more closely mimicking the post-zygotic mutational events characteristic of PROS. As a complementary strategy, blastula transplantation experiments using pik3ca-PROS donor cells (ideally co-expressing a distinct fluorescent marker such as mCherry) into fli1:GFP transgenic hosts could provide a powerful and technically consolidated approach to directly visualise and quantify non-cell-autonomous effects on host vasculature, with precise control over mutant cell burden. This combinatorial framework, separating donor mutant cells from host tissue in a two-colour imaging setup, could be particularly compelling for validating the ligand-receptor rewiring predicted by single-cell transcriptomics in future investigations.

      These reflections are offered in the spirit of prospective methodological development and do not diminish the value of the current work, which opens a valuable new avenue for therapeutic investigation, suggesting that targeting indirect overgrowth-propagating signals, alongside PI3K inhibition, deserves serious consideration.

    5. Author response:

      eLife Assessment

      This is an important study that establishes a zebrafish model of PIK3CA-related overgrowth syndrome. The imaging characterization of the mesodermal, particularly vascular, lesions of the model is compelling. The scRNA-Seq analysis is convincing, revealing key perturbations in the PIK3CA-mutation model, although deeper investigation of the exact mechanism leading to the lesions, as well as validation at different time points, could further strengthen the findings. This work will be of interest to medical biologists working on PROS, and potentially to a broader audience interested in non-cell-autonomous signaling of PIK3CA and its implications in other diseases.

      We are delighted that the Editors and Reviewers consider the work of value and that it is interesting to a broad audience. We also appreciate and take on board the areas that the reviewers identify for improvement, and their suggestions on how this could be achieved.

      There are two major pieces of work suggested by the reviewers which we plan to carry out for this manuscript. The first of these is an additional scRNA-seq experiment at a later developmental stage when vascular malformations are established. Through comparison between pik3caPROS, pik3caWT and no-pik3ca injected controls, this would help answer if the global lineage and transcriptional dysregulation observed at 19 hpf persists over time, and if the largely inert ‘C0’ cluster of PROS mScarlet<sup>+</sup> cells changes during development (Reviewer 2 comment 3).

      Secondly, we are already optimising rescue experiments with the specific Pik3ca inhibitor alpelisib, which is currently used as a therapy for PROS. Some troubleshooting has been required for the best delivery method and concentration for this to rescue vascular malformations in embryos, and to cause measurable decreases in PI3K signalling at the protein level through Akt and S6 pathways.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Brunsdon et al. present a zebrafish model of mosaic PIK3CA activation to investigate mechanisms underlying PIK3CA-related overgrowth spectrum (PROS), with a particular focus on non-cell-autonomous mechanisms of tissue overgrowth. The study is timely and addresses an important gap in the understanding of how mosaic activation of PI3K signaling leads to tissue-specific developmental abnormalities.

      Using a Tol2-based mosaic expression system combined with single-cell transcriptomics, the authors provide evidence suggesting that mutant PIK3CA-expressing cells influence surrounding wild-type tissues through indirect signaling mechanisms, contributing to vascular malformations and tissue overgrowth.

      Overall, the work presents an interesting and potentially impactful model for studying mosaic PIK3CA-driven overgrowth and non-cell-autonomous signaling mechanisms. However, several aspects require clarification, additional controls, and improved presentation to strengthen the mechanistic conclusions and overall impact of the study.

      We thank Reviewer 1 for their support of our work, and constructive and helpful comments. 

      Strengths:

      This study addresses an important and timely question by investigating the mechanisms underlying mosaic PIK3CA activation in the context of PROS, a condition for which developmental mechanisms remain poorly understood. The use of a mosaic zebrafish model is particularly appropriate, as it closely reflects the mosaic nature of PIK3CA mutations observed in patients and allows the investigation of non-cell-autonomous effects.

      Another major strength of the study is the integration of single-cell transcriptomics, which provides valuable insight into potential signaling pathways involved in indirect tissue overgrowth and offers a rich dataset for hypothesis generation. The authors also propose an interesting conceptual framework in which PI3K-activated cells influence surrounding tissues through paracrine signaling, which could have broader implications beyond PROS and contribute to understanding mosaic developmental disorders more generally.

      Finally, the work has potential translational relevance, as identifying mechanisms driving mosaic PI3K activation and non-cell-autonomous signaling could inform future therapeutic strategies for PROS and related conditions.

      Weaknesses:

      Despite these strengths, several aspects of the study require clarification and additional experimentation.

      Major comments:

      (1) The Tol2-based system results in mosaic overexpression of mutant PIK3CA in the presence of endogenous wild-type PIK3CA, making it difficult to determine how co-expression of WT and mutant proteins influences the observed phenotypes. While mosaic expression is relevant to PROS, a complementary approach in which endogenous PIK3CA is knocked out prior to introducing mutant variants would allow clearer interpretation of mutant-specific effects.

      PROS/CLOVES patients co-express endogenous wild-type and mutant PIK3CA in affected cells, which in turn constitute only a small proportion of cells in affected tissues (Madsen et al. 2018). As our intent was strictly to model human PROS/CLOVES (an aim informed by support from and close collaboration with the CLOVES Syndrome Community, a key patient advocacy group), we designed our model to reflect this as closely as possible. It is not clear to us what translational end would be served by expressing mutants in a null background, interesting though this may be. Given our transgenic strategy, we did experiment with overexpressing wildtype pik3ca as a control for some experiments to test whether overexpression of pik3ca itself drives overgrowth phenotypes, without the presence of hotspot PROS mutations (Figure 3D, Supplementary Figure 1A). We found that ubiquitous or mesodermal overexpression of pik3caWT did not cause vascular malformations or cause the ectopic fli1:eGFP endothelial cell phenotype observed when overexpressing pik3caPROS variants. While not precisely addressing the reviewer’s comment, this adds to evidence that increased expression of wildtype pik3ca does not confound the observed gain of function phenotype in the PROS model. 

      (2) The authors do not clearly describe the validation of editing or integration efficiency. It would be important for the authors to clarify whether sequencing was performed to confirm integration, to quantify the proportion of mosaic expression, and to measure transgene expression levels. These controls would strengthen confidence in the model and interpretation of the results.

      We used secondary transgenesis markers, such as the cardiac reporter cmlc2:GFP, as a visual readout of integration efficiency and confirmation of integration – for example, embryos with >50% of GFP<sup>+</sup> heart cells indicates that Tol2 transgenesis has occurred efficiently and so these would be included in an experiment, whereas the presence of only 1 or 2 green cardiac cells would suggest the levels of transgene in the embryo would be negligible and so this would be excluded from the experiment. Independently of this reporter, we showed an upregulation of pik3ca transcript in PROS mosaics compared to control by scRNA-seq (Figure 4D, Supplementary Figure 4A) confirming the transgene produces a measurable upregulation of pik3ca. 

      We agree that it would be optimal to quantify the transgene expression and copy number for each individual embryo. However, for experiments where phenotypes are scored, hundreds of embryos are injected each time. Therefore, although it would be valuable to quantify the transgene expression and transgene copy number in terms of finding its correlation to phenotype severity, it is not feasible to do this at this scale. In the future, we would like to refine our model to include more sophisticated inducible transgenic models, with stable integration sites to control for integration site/copy number variation. However, for this manuscript, the priority as set out by our charity funders was to generate and characterise a pik3caPROS model that could rapidly test different patient hotspot alleles as well as tissue-specific promoter drivers. Thus, we chose this simpler model for now, but we would be very interested in continuing this work with a more refined model for one or two mutations (See Reviewer comment 1). 

      This heterogeneity in transgene dosage and expression levels will inevitably have introduced ‘noise’ into our data. We can account for this somewhat by large numbers of embryos injected per experiment and reproducibility across populations of zebrafish between experiments. We also note that this strategy reflects the heterogeneity in human PROS, with disease mosaicism, presentation, and severity being highly variable from person to person. Therefore, we don’t necessarily see this as a drawback for our current approach. 

      (3) The manuscript would benefit from rescue experiments to strengthen causal conclusions. It remains unclear whether the phenotypes induced by PIK3CA PROS variants can be rescued, either through expression of wild-type PIK3CA, pharmacological inhibition of PI3K signaling, or assessment of developmental reversibility. Such experiments would strengthen the link between PI3K activation and the observed phenotypes.

      We agree this is an exciting direction and a great next step for this research to take. This work is currently ongoing, using the specific Pik3ca inhibitor alpelisib, and optimizing treatment conditions to ensure our experimental readouts are meaningful. Through phenotype scoring we do see a significant rescue in the severity of vascular malformations in PROS mosaic embryos. However, we didn’t feel this work was ready for the initial submission because (1) the concentrations we must add to the zebrafish medium by immersion are far higher than the doses needed for inhibition of PI3K signalling in human cell lines and (2) we do not see an obvious decrease in pAkt or pS6 levels by western blot analyses of embryos at alpelisib doses of up to 100 μM, for either short or long term exposure. This drug is poorly soluble in water, and so we are also experimenting with introducing it to embryos intravenously. 

      (4) The authors propose candidate signaling molecules mediating non-cell-autonomous effects downstream of PI3K hyperactivation; however, these conclusions remain speculative, as no functional validation is provided. Testing selected candidate mediators identified in the RNA-seq dataset would significantly strengthen the mechanistic conclusions.

      We thank the reviewer for this suggestion, and it is indeed a long-term aim of our work to find better treatments for PROS by combining inhibition of PI3K signalling with other candidate mediators to treat overgrowth. Our scRNA-seq experiments suggest that Notch, Wnt and Ephrin signalling pathway components may contribute to disease, and so a lot of potential for treatment strategies. After we have optimised treatment with alpelisib to rescue our disease phenotype in line with current mammalian models (see response to Comment 3 above), then we will start to look at other candidate mediators alone or in conjunction with alpelisib. However, given the challenges we are facing with the alpelisib treatment, we may need to develop this work in a subsequent study. 

      Reviewer #2 (Public review):

      In this manuscript, Brunsdon et al. aim to study PIK3CA-related overgrowth spectrum (PROS) by establishing a mosaic zebrafish model with overexpression of pik3ca carrying hotspot mutations, coupled with an mScarlet+ reporter. Using fluorescence microscopy, the authors demonstrated that overexpression of pik3ca with a number of hotspot mutations led to mesodermal and particularly vascular malformations in the zebrafish model. Interestingly, they found a paucity of mScarlet+ mutant cells in the vascular lesions, consistent with the finding of low PIK3CA mutation burden in PROS tissue. Such data suggest a non-cell-autonomous effect of PIK3CA mutation. Following this logic, the authors performed single-cell RNASequencing on zebrafish overexpressing WT pik3ca and mutant pik3ca at 19 hpf, and demonstrated widespread transcriptomic perturbations across multiple lineages, including lineage frequencies, key cell pathways, and cell-cell interactions. Importantly, they demonstrate that mScarlet+ cells carrying mutant pik3ca cluster separately from other cell types, do not demonstrate clear lineage identity, and have a general downregulation in signaling components.

      Overall, the conclusions in the manuscript are well-supported by the presented data. The imaging studies are particularly convincing. The transcriptomic analysis generated a list of potential pathways to further investigate and potentially target with future therapeutic interventions. Importantly, this study provides a valuable in vivo model of PROS that: 1) recapitulates key features of PROS (e.g., multiple mesodermal defects, paucity of mutation burden in lesions suggesting non-cell-autonomous interactions); 2) is scalable; and 3) offers direct visualization of lesion development, compatible with time-course live imaging. This model will be valuable to further understand PROS and potentially study other diseases where the PIK3CA pathway is altered (e.g., certain cancers).

      We thank Reviewer 2 for their careful reading and support of our manuscript, and their helpful suggestions. 

      The following are not necessarily weaknesses of the data, but rather suggestions where the manuscript could be further strengthened:

      (1) The model recapitulates the variability of mesodermal lesions in PROS. It would be valuable to utilize this model to further study factors that are associated with the development of more severe lesions (e.g., by comparing samples with more severe lesions to those unaffected despite carrying the mutations, Figure 1F).

      This is a very interesting question, and something that we have wondered ourselves. The clinical observation that PROS mutations cause pathology in mesodermal-derived tissues suggests that there is a lineage permissivity of PROS mutations. We plan to perform additional scRNA-seq experiments on later stage embryos (aligned with Figure 1) and hope to incorporate comparison of embryos with more severe lesions to those unaffected despite carrying pik3caPROS mutations. 

      (2) ScRNA-seq analysis could be enriched with a comparison between cells overexpressing mutant pik3ca vs. those overexpressing WT pik3ca.

      The scRNA-seq experiment presented in this paper was limited by funding constraints at the time, and so we focussed on choosing samples that were likely to yield the most meaningful data. Ideally, we would have included a WT overexpression control in addition to an injected no-pik3ca control, however as we did not observe any phenotypes associated with mosaic pik3caWT transgenic embryos (Supplementary Figure 1A, Figure 3D), we chose to not include this condition. We are grateful for subsequent funding that will allow us to perform a scRNAseq experiment at a later timepoint, detailed below, where we plan to include this control.

      (3) In the scRNA-Seq analysis, it is curious that the C0 cluster, enriched with mScarlet+ cells, is found to have downregulated signaling interactions (Fig. 5C), yet exerts a widespread noncell-autonomous effect. Meanwhile, there is also a noticeable loss of certain lineages (e.g., notochord, Figure 4E) and related cell-cell interactions (e.g., notochord-related interaction, Figure 5A). A deeper exploration of the basis of the non-cell-autonomous effect would be valuable.

      Thank you for this important comment. We agree that this finding is very interesting and warrants further investigation, although a definitive answer may be too difficult for this current revision. Using conventional differential expression analyses on our scRNA-seq data (such as was used in Figure 4), we could not find significant upregulation of many genes and pathways, and CellChat and NICHES analyses did suggest that signalling between C0 and other clusters was weak. Nevertheless, using the Decoupler package, we did find significant upregulation of some footprint signatures enriched in mScarlet<sup>+</sup> vs - cells in PROS mosaics (Supplementary Figure 4B) including PI3K and EGFR (as one would expect), but also apoptosis and UV response suggesting that overexpression of pik3caPROS may cause cellular stress. Using NICHES, we also found Myc, Notch, Wnt and Ephrin ligand-receptor pairs to be upregulated in PROS mosaic C0 sending and receiving interactions compared to controls, which would be candidates for validating in subsequent studies (Supplementary Figure 4C). We will be interested to determine if C0 like cells are present in older embryos in our scRNA-seq analysis, and if they have similar signalling activity.

      (4) The scRNA-Seq analysis was performed at one time point (19 hpf). Additional analysis (not necessarily by scRNA-Seq) at other time points to study whether findings at 19 hpf are persistent throughout development or undergo dynamic changes (e.g., cell fate/state of mSc+ mutant cells) would be helpful.

      We agree that the inclusion of a later timepoint in our scRNA-seq experiment would be valuable in answering a lot of our questions about the fate of C0 cells and the persistence of the transcriptional dysregulation, including non-cell autonomous interactions that we see at 19 hpf. As mentioned above, we were constrained by time and funding for the original experiment but are now in a position to add to this work and address this point.

      (5) The scRNA-Seq analysis provides a valuable list of perturbed interactions that could be targeted by future therapeutic approaches. Validation of the scRNA-Seq findings with proteinlevel analysis, and studying the effect of targeting some of the pathways on the disease phenotype, would offer valuable data for the community.

      Thank you for this comment. We agree that this an essential next step to take and is also a priority for our patient advocates. As mentioned above (Reviewer 1, point 4), we would like to be confident that alpelisib is on-target in our system first, and then we very much want to identify new therapeutic venues to explore in this pre-clinical space.

      Reviewer #3 (Public review):

      Summary:

      The study "PIK3CA-related overgrowth spectrum (PROS) zebrafish models reveal panlineage developmental dysregulation" presents important findings that extend significantly beyond a single subfield, bridging developmental biology, vascular medicine, and cancerrelated PI3K signalling. By developing mosaic zebrafish models of PROS and combining live imaging with single-cell transcriptomics, the authors provide compelling evidence for a noncell-autonomous mechanism of tissue overgrowth, a conceptual shift with meaningful therapeutic implications.

      We thank Reviewer 3 for their time and thoughtful comments considering our work.

      Strengths:

      The evidence is overall convincing, with methodology appropriate and well-validated relative to the current state of the art; the integration of multiple approaches (in vivo modelling, scRNAseq, ligand-receptor inference) strengthens the central claims. However, some aspects of the proposed non-cell-autonomous signalling mechanisms remain partly correlative, and direct functional validation of the rewired ligand-receptor interactions would further consolidate the conclusions.

      Weaknesses:

      The transgenic overexpression approach chosen by the authors represents a well-established and effective strategy for generating mosaic models in zebrafish. However, this approach introduces notable limitations: the lack of control over transgene dosage and unknown integration sites may generate non-physiological effects, potentially confounding the interpretation of key findings.

      Thank you for this important comment. We agree that there are limitations in our current model, and we are working to refine it such that we have temporal as well as spatial control over the expression of pik3caPROS. 

      Our funding for the start of this study came from the CLOVES Syndrome community charity, and in collaboration with them, we decided that for this work, our priority was to understand more about the disease mechanisms at disease onset, and also to be able to test multiple pik3ca hotspot mutations that affect patients. One question for families is if the pik3ca hotspot mutations contribute differently to patient overgrowths. Our data here suggests that all mutations are able to promote overgrowth equally, and that differences between disease presentation in patients likely reflects the timing and cellular origins of the mutation. 

      As a side note, together with CLOVES Syndrome community, we also felt that we wanted to test actual patient mutations, rather than artificial hyperactivated variants of Pik3ca such as the widely used p110a* allele (Hu et al. 1995; Venot et al. 2018), which can inform important mechanisms about pathway dysregulation, but less about actual patient-specific disease mutations.

      The authors are certainly aware that alternative approaches (though technically more demanding) could be considered in future studies to further strengthen the model. For instance, a CRISPR/Cas9-mediated knock-in of the pik3ca-PROS allele at the endogenous locus (retaining upstream native regulatory elements with only a minimal promoter in the construct, co-expressed with a fluorescent reporter via P2A) could allow even more physiological, lineage-restricted expression while enabling direct visualisation of mutant cells. Mesodermal specificity could potentially be further refined by driving mosaic Cas9 expression under a pan-mesodermal tbx promoter, restricting editing to the relevant lineage while simultaneously marking mutant cells fluorescently, thus even more closely mimicking the postzygotic mutational events characteristic of PROS. As a complementary strategy, blastula transplantation experiments using pik3ca-PROS donor cells (ideally co-expressing a distinct fluorescent marker such as mCherry) into fli1:GFP transgenic hosts could provide a powerful and technically consolidated approach to directly visualise and quantify non-cell-autonomous effects on host vasculature, with precise control over mutant cell burden. This combinatorial framework, separating donor mutant cells from host tissue in a two-colour imaging setup, could be particularly compelling for validating the ligand-receptor rewiring predicted by single-cell transcriptomics in future investigations.

      These reflections are offered in the spirit of prospective methodological development and do not diminish the value of the current work, which opens a valuable new avenue for therapeutic investigation, suggesting that targeting indirect overgrowth-propagating signals, alongside PI3K inhibition, deserves serious consideration.

      Thank you for these excellent suggestions and feedback. We are keen to try to generate fish that more closely align with what is happening in patients. Two challenges we have faced include: 

      (1) In our hands, the pik3ca promoter itself is not strong enough to drive fluorophore expression to an extent that we can observe fluorescent PROS cells in zebrafish. As a control, after we saw no fluorescence attempting to knock-in fluorophores at the 5’ end of endogenous pik3ca, we tried making a transgenic using various lengths of pik3ca promoter regions driving GFP expression. Despite having stable integration of the transgene shown by a secondary transgene reporter inherited through to F1 generation, we could not visualise GFP/mNeonGreen expression at any stage of development.

      (2) A drawback of the IRES approach we used here is that the fluorophore expression levels will be lower than using a short cleavable peptide sequence such as P2A. Unfortunately, the critical kinase region (and location of the orthologous hotspot codon 1048) is located only a few amino acids from the stop codon, and we found that the function of Pik3ca was likely impeded by the addition of several extra amino acids after the P2A cleaves itself.

      Despite these challenges, we hope to be able to generate models in future with more precise control over mutant cell burden. 

      References

      Hu Q, Klippel A, Muslin AJ, Fantl WJ, Williams LT. 1995. Ras-dependent induction of cellular responses by constitutively active phosphatidylinositol-3 kinase. Science 268: 100102.

      Madsen RR, Vanhaesebroeck B, Semple RK. 2018. Cancer-Associated PIK3CA Mutations in Overgrowth Disorders. in Trends in Molecular Medicine, pp. 856-870. Elsevier Ltd.

      Venot Q, Blanc T, Rabia SH, Berteloot L, Ladraa S, Duong JP, Blanc E, Johnson SC, Hoguin C, Boccara O et al. 2018. Targeted therapy in patients with PIK3CA-related overgrowth syndrome. Nature 558: 540-546.

    1. eLife Assessment

      This work presents important findings on quantifying gene coexpression from spatial omics. These quantification methods have been applied to gastruloid to describe how genes are spatialised. The description of the quantifying tools might be incomplete, which also weakens the biological message. Clearer formalization and justification of quantification will improve the study.

    2. Reviewer #1 (Public review):

      Summary:

      The authors performed seqFISH in 26 gastruloids and performed a variety of computational analyses on these novel spatial data sets. Whilst the data is valuable and the computational concepts useful (exposure index, L-metric, ... ), the article falls short on novelty and is written using a very clunky language, often with contradictory conclusions.

      Major issues:

      (1) The authors did well in explaining and detailing the provenance of data and the individual experiments performed. However, their 26 gastruloid data still constitute a very limited sampling from their total organoids: one experiment pooled 4 plates at an 80-94% success rate; 6 different aggregation experiments were done, making a total of 1843 gastruloids, sampled 26 (~1-2%). A simple IF stain of 2-3 markers in a bigger sample could have given a more accurate picture of specific domains of interest and their proximity. Regardless, more information should be given about the existing samples: variation across experimental batches, differences between 300-cell vs 100-cell gastruloids that were used.

      (2) Language in the manuscript should be revised. Overall the manuscript is very long, descriptive and written "impressions and beliefs" are often not adequately justified and indeed can be contradictory, e.g. in Section 1: the title states "cell types' locations ...are consistent", a few sentences down we find "there was substantial variation" and "within range of what would be considered a 'morphologically normal' gastruloid". "quite consistent", "compelling patterning", "we don't believe"... these types of expressions are best avoided and replaced with data or used and bolstered with quantitative numbers such as percentages when a given cutoff is used. Another example: "location of each cell type relative to gastruloid morphology was quite consistent the posterior region ... mainly consisted in NMPs." Given T expression in the posterior, this result phrased as such appears quite inflated, in fact, looking at cell types in Figures S1, 2a/b/c, this reviewer would state they are all but consistent and indeed it takes sophisticated analyses to find a pattern (of sorts) beyond the coarse domains expected!

      (3) Figure 6 is one of the most valuable parts of the work, as the authors use the battery of analyses developed to investigate the variable and not-so-robust endothelial clusters in gastruloids. However, this investigation is still very preliminary, and it should be further linked with known biology. It is still unclear what the unique organization of this cell type is (circularity isn't convincing) and whether any signalling cues of adjacent cells could explain it. Is there any evidence that more mature endodermal cell types are generated (like the suggested "liver") to give rise to endothelial cells? It would certainly be interesting to perform IF for this cell type together with mesodermal and endodermal markers to validate seqFISH predictions on a bigger sample.

      (4) Figures 1c and 6b need statistical significance assessments.

      (5) The article should include an analysis of Hox colinearity expression in these gastruloids as a validation of the system.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents an ambitious and technically challenging spatial-transcriptomic atlas of 26 gastruloids using seqFISH. The authors introduce quantitative metrics (mixing score, exposure index, L-metric / scL-metric, spatial L-metric, triplets) to characterize spatial organization at multiple scales. The dataset is valuable, and several analyses are original, particularly the rank-based L-metric family for mutual exclusivity.

      Strengths:

      The authors generate one of the most detailed spatial transcriptomic datasets of gastruloids to date. They propose creative computational metrics (L-metric/scL-metric) to quantify mutual exclusivity of gene expression without predefined thresholds, and they explore organizational principles from single-cell topology to cluster-level structure. Many observations align well with known gastruloid biology, such as posterior robustness and anterior variability. The writing is generally clear, and the figures are rich.

      Weaknesses:

      Several central claims rely on metrics whose computation and justification are insufficiently explained, making it difficult to assess how robust or interpretable the results are. Many choices in the analysis appear arbitrary or are insufficiently motivated (normalization schemes, choice of parameters such as the number of neighbors, the distance cutoffs, hierarchical clustering setup, and so on). The interpretations of spatial consistency, gene-program inference, and endothelial heterogeneity are plausible but might be stronger than the evidence currently supports.

      The manuscript would benefit from stronger benchmarking, quantification of uncertainty, and explicit controls for known artifacts in spatial transcriptomics (e.g., spillover, 2D slicing, cell type assignment entropy). The biological insights are promising, but since several depend on methodological assumptions that have not yet been demonstrated to be stable, they would benefit from clearer methodological explanation.

      The work is rich and could become a reference dataset. Then, clarifying and validating the quantitative methods will considerably strengthen the impact and reliability of the conclusions.

    4. Reviewer #3 (Public review):

      Summary:

      Triandafillou and colleagues report a single-cell resolved spatial atlas of gene expression of 26 gastruloids. While previous work had analyzed either single-cell gene expression or spatially coarse-grained patterns of gene expression (van den Brink et al, 2020), the authors here use multiplexed sequential RNA FISH (seqFISH) to create the first gastruloid atlas, which is simultaneously spatially and cellularly resolved. This atlas adds to a growing list of resources cataloging gastruloid development (see also Suppinger et al 2023).

      To analyze this dataset, the authors also describe a novel analytical framework. Their analysis centers around the 'L-metric', which measures the degree to which pairs of genes are either coexpressed or mutually exclusive. While this metric is similar to calculating correlations in gene expressions, it has important differences (including that it can, in principle, be asymmetric; although the authors symmetrize much of their analysis). In addition to the gene-centric L-metric analysis, the authors also analyze cells in their dataset according to the cell type entropy (an information-theoretical measure of confidence in cell type assignment) and the 'exposure index' (a measure of the similarity of nearest cellular neighbors).

      Using this framework, the authors focus their analysis on two major features of development. The first is the differentiation of the bipotent neuromesodermal progenitor (NMP) cells in the posterior of the gastruloid into either presomitic mesoderm (PSM) or spinal cord SC lineages. They use L-metric analysis to compare overlap in marker genes used to separate NMP, PSM, and SC fates. They highlight that L-metric analysis can recover spatial patterns of gene expression (without explicit spatial information) and discern subtle features of marker genes beyond simple binning of cell types (e.g., that Epha5 expression in anterior NMPs may predict future SC differentiation).

      The second is the formation of endothelial (spatial) clusters within the gastruloid. The authors highlight two subtypes of endothelial clusters: (1) smaller clusters within the somitic anterior region, and (2) larger clusters associated with endoderm. While the authors discern some subtle differences in gene expression between these two clusters, their different spatial patterns suggest a potential physiological difference that would not be captured in traditional droplet microfluidic-based scRNAseq pipelines.

      Overall, this manuscript is a sophisticated and technically sound study that will provide a valuable beachhead for future studies of developmental patterning in gastruloids and organoids.

      Strengths:

      The major strengths of this study are the overall technical sophistication of the data set and analysis, as well as its potential generalizability to other developmental systems (both in vitro and in vivo). The data are extensively analyzed and reasonably interpreted, and this atlas makes good use of the variability in gastruloid development to extract the statistical structure of developmental processes. The L-metric offers a parameter-free tool to analyze transcriptomic datasets that could overcome the pitfalls of other approaches.

      Weaknesses:

      The major limitations of this study are the depth and novelty of the developmental processes studied. The authors provide very convincing proof-of-concept that their data set can recover known features of gastruloid development, including NMP differentiation and endothelial development. However, further analysis and/or investigation would be required to discover new principles of gastruloid development and patterning.

    1. eLife Assessment

      This valuable study provides the first broad cross-species evolutionary analysis of the pir multigene family in malaria parasites, showing that the family evolved through rapid duplication and loss while retaining a small number of conserved orthologs with essential functions. The authors identify pirC1 as a key determinant of parasite growth across multiple Plasmodium species. However, the work remains incomplete because the mechanistic role of PIRCl and its precise subcellular localization are not directly resolved.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript entitled "Essential function reflected in the phylodynamics of a multigene family - the pir genes of malaria parasites" by Jackson and colleagues investigates the global phylogeny of pir genes across 14 Plasmodium species and one Hepatocystis species. The authors also focus on the functional characterization of the conserved ortholog pirC1 and claim that pirC1 is not the founder of the family and that it plays an essential role in blood-stage growth.

      Strengths:

      Overall, the manuscript is well written and interesting, as it combines comparative genomics and evolutionary analysis with functional experiments. The phylogenetic analysis is rigorous and represents a major strength of the manuscript.

      Weaknesses:

      The general conclusions regarding the potential function of this gene family are not fully supported by the data presented. The manuscript moves too quickly from growth phenotype and localization studies to a specific mechanistic model. The discussion argues that PIRC1 may be involved in nutrient acquisition, host sensing, or metabolic support, but the data provided do not directly support these functions, and the manuscript in its present form remains speculative. Although the manuscript includes some experimental results, it lacks direct mechanistic validation of the specific functions of the pir genes, including pirC1. In its current form, the study does not yet establish a definitive role for pirC1 in metabolic processes.

    3. Reviewer #2 (Public review):

      Summary:

      This is an extensive study using phylogenetic comparison across multiple plasmodium species to gain new insights in relation to their evolutionary pathways and the potential function of pir. In addition to establishing a framework to identify related orthologues across species as well as expanding paralogues families within a species, the work also focuses on understanding loss and gain of different PIRs and how this indicates a relative lack of functional constraints and essentiality for most members of the gene family.

      The authors provide evidence that at least pirC has a conserved function and plays an important role in parasite growth in multiple species.

      While this study represents a significant effort and does provide interesting new insights that would help our understanding of this complex gene family in the future, it has a number of limitations.

      Strengths:

      Extensive and thorough phylogenetic analysis that is supported by some biological validation. Provides an indication that the PIR gene family has limited biological constraints and evolved independently across different species, leading to rapid expansion and deletion of orthologous groups. Identified pirC as a functional and important member of the family that is conserved across the species.

      Weaknesses:

      The phylogenetic tree is based on a truncated sequence that focuses on the more conserved parts of the pir sequence. This could potentially lead to missing the key functional drivers of evolution. The biological validation of the role of pirC has some inconsistencies that need to be addressed.

    4. Reviewer #3 (Public review):

      This paper aims to classify, from an evolutionary perspective, the multigene family PIR found in malaria parasites infecting rodents and Old World monkeys, and to link this classification to functional diversification. The authors also hypothesize that PIR members conserved across species play important roles in parasite survival, and seek to clarify their functions.

      To achieve these aims, the authors comprehensively analyze the evolution of PIR genes using genomic and transcriptomic information from many malaria parasite species. They focus on PIRC1, a member conserved across species, and attempt to clarify its function in rodent and simian malaria parasites by examining the phenotypes of parasites in which the corresponding genetic locus has been disrupted. They also attempt to determine its localization using PIRC1 tagged with an epitope sequence. However, although the locus-disrupted parasites appear to show an approximately 50% reduction in growth rate, this effect seems to be overestimated. Another weakness is that the cause of the reduced growth rate has not been clarified. The localization analysis also remains insufficiently conclusive.

      Therefore, I consider that the first half of the paper, consisting of the bioinformatics analyses, achieves the objective of comprehensively summarizing PIR and may become a reference paper for discussing the evolution and function of the PIR gene family. On the other hand, regarding the function of PIRC1, no clear conclusion can be drawn from the results presented, and several additional experiments are necessary.

      My major comments are as follows.

      (1) The claim that the failure of eight disruption attempts indicates that pirC1 is essential is too strong.

      Lines 319-321: The authors argue that a total of eight failed attempts to disrupt the pirC1 locus using two different construct designs suggest that pirC1 is essential in P. berghei. However, the failure of these attempts could also reflect technical issues with the construct design itself, such as the length of the homologous regions used for recombination, which are approximately 650 bp. Therefore, it is an overstatement to conclude that "pirC1 is essential for P. berghei blood-stage growth." Given that parasites with disruption of the corresponding locus could be obtained in both P. chabaudi and P. knowlesi, a more appropriate statement would be that "pirC1 is important for P. berghei blood-stage growth."

      (2) The data on the mCherry-expressing P. berghei line shown in Supplementary Figure 11 are insufficient.

      (a) Panel C: Southern blot analysis<br /> To conclusively identify the lower band in panel C as chromosome 1, additional probes specific to genes located on chromosomes 1 and 2 would be required. In addition, a parental parasite control should also be included. The Southern blot image of the parental parasite should show only a single band at the higher position, with no band at the lower position. Probes specific to chromosomes 1 and 2 would help demonstrate that the lower band corresponds to chromosome 1, rather than chromosome 2.

      To this end, the authors could describe the result as follows:<br /> "In the parental parasite, only a single band corresponding to chromosome 7 was detected, indicating that the smaller chromosome was genetically modified. The size of the lower band detected with the dhfr probe was identical to that of the band detected with the control chromosome 1 probe, but distinct from that detected with the chromosome 2 probe, indicating that chromosome 1 was modified."

      That said, this chromosome-level Southern blot analysis is not sufficient to demonstrate that the target PBANKA_0100500 locus was specifically modified. The authors should provide more direct evidence showing that the PBANKA_0100500 locus, rather than another genomic locus, was modified. For example, Southern blot analysis after restriction enzyme digestion would provide more definitive evidence. Diagnostic PCR may also provide more specific evidence.

      (b) Panel D: Flow cytometry analysis

      To allow a more accurate interpretation of the percentage of mCherry-positive cells, flow cytometry data for the parental parasite line should also be presented.

      (3) There are unclear points in the PCR results shown in Supplementary Figure 12.

      Supplementary Figure 12: In panel B, a PCR product should also be amplified from dPCHAS_0101200 using the P1-P3 primer pair. Why is this band absent? The authors should provide the uncropped electrophoresis image so that the larger band can be seen. In addition, if labels 1 and 2 indicate independent clones, this should be stated in the figure legend.

      (4) The growth rates of P. chabaudi and P. knowlesi parasites with disruption of the PIRC1 gene locus should be quantitatively analyzed.

      The growth rates of P. chabaudi and P. knowlesi are described only qualitatively, but they should be evaluated quantitatively. In Figure 4A, the parasitemia of wild-type P. chabaudi increases from approximately 6.1% on day 6 to approximately 15.6% on day 8, corresponding to a 3.8-fold increase. However, because parasite growth may already be affected by immune-mediated suppression at this stage, this value should be regarded as a minimum estimate. In contrast, the mutant increases from approximately 3.2% on day 8 to approximately 6.8% on day 10, corresponding to a 2.1-fold increase. Based on these values, the daily growth rate of the mutant appears to be reduced to at least approximately 56% of that of the wild type. Similarly, from the growth curve of P. knowlesi in Fig. 5A, the DMSO-treated group appears to increase approximately two-fold per day, whereas the rapamycin-treated group increases only approximately one-fold per day. Thus, P. knowlesi also appears to show an approximately 50% reduction in growth rate. Taken together, both P. chabaudi and P. knowlesi appear to reproducibly show an approximately 50% reduction in growth capacity. A reduction of this magnitude is difficult to describe as a "severe growth defect"; a more appropriate wording would be simply that the parasites "showed a growth defect." In addition, the terms "a severe growth defect" and "essential" appear to be overstated throughout the manuscript, and the wording should be toned down. Finally, I recommend presenting Figure 4A and Figure 5A on a logarithmic scale so that the trend in growth rates can be more intuitively appreciated from the graphs.

      (5) The evidence that disruption of the PIRC1 gene locus in P. knowlesi does not affect erythrocyte invasion is weak.

      The authors describe that "the developmental cycle of the parasites lacking PIRCl is slightly longer than that of parasites that produce PIRCl (line 383-384)," and appear to support this interpretation with data showing that "mutant parasites are significantly smaller than wild-type parasites (line 414)" and that "the DNA content in ML10-arrested parasites lacking PIRCl is lower than that of DMSO-treated parasites (line 417-418)" at 24 hours after invasion. However, a slightly longer developmental cycle alone does not seem sufficient to explain a 50% growth reduction.

      I think the erythrocyte invasion capacity has not been quantitatively evaluated, and therefore, the evidence supporting the conclusion that the phenotype of P. knowlesi parasites with disruption of the PIRC1 gene locus is unrelated to erythrocyte invasion is weak. The authors should assess invasion efficiency using purified merozoites. For P. chabaudi, it should also be possible to apply an in vitro or in vivo erythrocyte invasion assay similar to that used for other rodent malaria parasites, and this should be evaluated as well.

      (6) The authors should examine whether disruption of the PIRC1 gene locus results in a phenotype characterized by a reduced number of merozoites.

      Alternatively, the reduced DNA content in ML10-arrested parasites lacking PIRC1 (lines 416-417) could suggest that the number of merozoites formed per schizont may be reduced. To clarify this point, the authors should assess whether the number of merozoites per schizont is altered in P. knowlesi (and P. chabaudi parasites lacking PIRC1).

      (7) The authors propose the possibility that PIRC1 expressed in merozoites is released after invasion; however, the evidence that PIRC1 localizes to intracellular organelles is weak.

      Line 333: "a peripheral pattern around the parasite" is indicative of parasite plasma membrane, PV, or PVM. ", indicative of a parasitophorous vacuole (PV) or parasitophorous vacuole membrane (PVM) location" should be amended to ", indicative of parasite plasma membrane, a parasitophorous vacuole (PV) or parasitophorous vacuole membrane (PVM) location". In the Figure S14 image, red signals are uniformly detected from the merozoites formed in the schizont stage parasite (not really microorganelle patterns), but not from the PVM surrounding the schizont, suggesting parasite plasma membrane localization, not PVM. I agree that the signal is detected from the compartments extending into the iRBC cytosol, which may be difficult to explain if it is located on the parasite plasma membrane, but how frequently were such images seen?

      Figure 4D. In the images of liver-stage schizonts, AMA1 does not appear to localize to the micronemes in mature merozoites, suggesting this image is an immature schizont. Although PIRC1 appears to be expressed in liver-stage schizonts, it is difficult to clearly determine whether it localizes to intracellular organelles or to the parasite plasma membrane.

      To clarify the above points, the authors should examine whether PIRC1 is detected in intracellular organelles or around the merozoites by analyzing its localization in purified merozoites.

    1. eLife Assessment

      This important study by Bi and colleagues employed a clever genetics screen to uncover the role of the GidB rRNA methylase in translation fidelity, under certain conditions, in Mycobacterium smegmatis. The findings are solid, supporting the findings that the loss of GidB results in mistranslation. The work contributes to a more in-depth understanding of mycobacterial translation fidelity and will be of interest to microbiologists.

    2. Reviewer #2 (Public review):

      Summary:

      Protein synthesis - translation - involves repeated recognition and incorporation of amino-acyl-tRNAs by the ribosome. This process is a trade-off between the rate and accuracy of selection (for review see (Johansson et al, 2008; Wohlgemuth et al, 2011)). The ribosome does not just maximise the rate or the accuracy, it balances the two. Therefore, it is possible to select mutants that translate faster than the wt (but are sloppy) or that are very accurate (more than the wt) but translate slower. Slow translation is detrimental as it limits the rate of protein synthesis (and, therefore, growth) and hyper-accurate mutants accumulate mis-translated proteins, which is detrimental for the cell.

      Bi and colleagues employ genetics, MIC measurements, reporter assays and structural biology to characterise the role of GidB rRNA methylase in translational accuracy in Mycobacterium smegmatis.

      Strengths:

      The genetics and phenotypic assays are convincing and establish the biological role of the methylase. The authors use a powerful set of complementary assays that convincingly demonstrates that the loss of GidB results in mistranslation.

      Weaknesses:

      Cryo-EM analysis of vacant 70S ribosomes is not sufficient for understanding the mechanisms underlying the accuracy defects in the gidB KO. Ideally, one should assemble and solve structurally near-cognate and non-cognate complexes.

      References:

      Johansson M, Lovmar M, Ehrenberg M (2008) Rate and accuracy of bacterial protein synthesis revisited. Curr Opin Microbiol 11: 141-147

      Wohlgemuth I, Pohl C, Mittelstaet J, Konevega AL, Rodnina MV (2011) Evolutionary optimization of speed and accuracy of decoding on the ribosome. Philos Trans R Soc Lond B Biol Sci 366: 2979-2986

    3. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Javid and colleagues worked to understand the molecular mechanisms involved in mistranslation in mycobacteria. They had previously discovered that mistranslation is an important mechanism underlying antibiotic tolerance in mycobacteria. Using a clever genetic screen they identify that deletion of gidB, a 16S ribosomal RNA methyltransferase, leads to lowered mistranslation (i.e. higher translational fidelity), but only in genetic backgrounds or environmental conditions that increase mistranslation rates.

      Strengths:

      The strengths of this manuscript are the clever genetic screen, the powerful mistranslation assays, and the clear writing and figures explaining a complex biological problem. Their identification of gidB as a factor important for mistranslation deepens our knowledge about this interesting phenomenon.

      We thank the Reviewer for their summary of our work and the strength of coupling specific mistranslation assays with the genetic screen approach.

      Weaknesses:

      The structural work at the end feels like both an afterthought in terms of the science and the writing. I would suggest re-writing that section to be clearer about what the figure says and does not say. For example, the caption of Figure 6 appears to be more informative than the text and refers to concepts not present in the main text. In general, I found this section to be the most difficult to understand.

      We have revised this section, including re-analysis of the structural data and completely new figures, as well as revised comments placing the findings in the context with the other data. See Revised Figs. 6.

      Reviewer #2 (Public review):

      Summary:

      Protein synthesis - translation - involves repeated recognition and incorporation of amino-acyl-tRNAs by the ribosome. This process is a trade-off between the rate and accuracy of selection (for review see (Johansson et al, 2008; Wohlgemuth et al, 2011)). The ribosome does not just maximise the rate or the accuracy, it balances the two. Therefore, it is possible to select mutants that translate faster than the wt (but are sloppy) or that are very accurate (more than the wt) but translate slower. Slow translation is detrimental as it limits the rate of protein synthesis (and, therefore, growth) and hyper-accurate mutants accumulate mis-translated proteins, which is detrimental for the cell.

      Bi and colleagues employ genetics, MIC measurements, reporter assays, and structural biology to characterise the role of GidB rRNA methylase in translational accuracy in Mycobacterium smegmatis.

      Strengths:

      The genetics and phenotypic assays are convincing and establish the biological role of the methylase. The authors use a powerful set of complementary assays that convincingly demonstrate that the loss of GidB results in mistranslation.

      We thank the Reviewer for their recognition of the strengths of our work, including the combination of genetic screens and specific assays to demonstrate the contribution of GidB in specific translational fidelity in mycobacteria.

      Weaknesses:

      (1) It would be essential to provide information regarding the growth rate and, ideally, translation rates in the gidB KO and the isogenic WT. As translation balances accuracy and speed, only characterising the speed is not sufficient to understand the phenomenon.

      We have now performed these assays (New Fig. S6). (1) The growth rate of gidB1-KO is the same as the respective background (WT or HWS19) strain with functional GidB. (2). We have performed a measure of translational efficiency as a surrogate for speed (see PMID 32723820), New Fig. S7. As can be seen, deletion of GidB does not affect translation of Nluc luciferase, in both WT and HWS19 backgrounds, suggesting that discrimination of mischarged tRNAs (even in a context in which that is the dominant form of translational error), is not rate-limiting, and that this form of accuracy is distinct to ribosomal mRNA decoding. This is further corroborated by a new preprint from our group (https://www.biorxiv.org/content/10.1101/2024.10.20.619312v2) that a novel small molecule that also increases specific translational fidelity does not affect translational efficiency, suggesting that this is a conserved phenomenon in mycobacterial translation.

      (2) Cryo-EM analysis of vacant 70S ribosomes is not sufficient for understanding the mechanisms underlying the accuracy defects in the gidB KO. One should assemble and solve structurally near-cognate and non-cognate complexes. I believe the authors are over-interpreting the scant structural data they have. Furthermore, current representation makes it impossible to assess the resolution of the structure, especially in the areas of interest.

      While we agree with the Reviewer that structures of translating ribosomes will be most informative in elucidating the molecular mechanism(s) by which methylation (or not) by GidB contributes to mistranslation, those experiments are ongoing and beyond the scope of the current study. Unlike E. coli ribosomes, for which there are a plethora of structures for mutants available, there are very structures of mycobacterial ribosomes beyond wild-type apo ribosomes. Therefore, we feel that the structures of apo mycobacterial ribosomes +/- GidB-mediated methylation are still of value, and a necessary “first step” for the mechanistic work alluded to above. Secondly, the apo ribosome structures still hint at potential mechanisms by which mistranslation and 16S rRNA methylation may impact on each other – as in the comments to R#1 above, we have revised the text to increase clarity and coherence of this section.

    1. eLife Assessment

      This valuable work addresses a longstanding question of how the extant genetic code came to be selected and conserved almost universally across life. Using a mutational approach and a small set of reporters, the authors demonstrate that the mutational impact was similar for non-standard genetic codes. The data provide solid support for the claim of having provided experimental verification of the error minimization theory.

    2. Reviewer #1 (Public review):

      [Editors' note: This version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review satisfactorily and toned down the comments as advised.]

      In this manuscript, the authors investigate the relationship between genetic codes and their robustness to single-point mutations. They construct ten alternative genetic codes by reassigning nine codons to Leu, Ser, or Ala, and assess mutational robustness using three reporter proteins subjected to error-prone PCR. This represents an interesting experimental approach to addressing the hypothesis that the standard genetic code is optimized for mutational robustness.

    3. Reviewer #2 (Public review):

      The study addresses the long-standing question in molecular biology and genetics: why has nature selected the current genetic code (SGC, or standard genetic code)? The authors have tested 'error minimization theory', one of the prevailing hypotheses to explain this. Their approach is to create a minimum genetic code (MGC) and its variants (3^9 theoretical possible codes). Using three parameters to quantify the effect of mutations (Polarity, volume, and hydropathy), they computationally test the cost of these genetic codes (3^9) by simulations. Finally, they test this cost experimentally using an in vitro translation system with 10 select genetic code variants with a range of costs (low to high). They use three randomly mutated reporter genes for this purpose - beta-galactosidase, luciferase, and mSG. They find no correlation between the cost of the genetic code and the reporters' output. Based on these observations, they suggest that error-minimization theory may not explain the current egocentric code.

      The question they are asking is very exciting, and their approach is solid. The authors are very careful in their analyses and conclusions.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Miyachi and Ichihashi investigate whether the arrangement of the genetic code affects mutational robustness. Using an in vitro minimal genetic code with vacant codons, they constructed 10 non-standard genetic codes by reassigning Ala, Ser, and Leu, generating codes with replacement costs that were generally higher than those of the standard genetic code across several amino acid property measures. They then tested how random mutations affected the activity of reporter proteins translated under these altered codes. Although error minimization theory predicts that higher-cost codes should make mutations more harmful, the authors report that protein function declined to a similar extent across all codes examined, suggesting that mutational robustness remains largely unchanged within the range of genetic code alterations tested here.

      Strengths:

      This is an interesting study that investigates one of the most fundamental and intriguing questions in molecular evolution: the emergence of the genetic code, which is nearly universal across nature. The in vitro approach is a powerful aspect of the work and provides an opportunity to examine this phenomenon experimentally at a depth that has previously been inaccessible.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, the authors investigate the relationship between genetic codes and their robustness to single-point mutations. They construct ten alternative genetic codes by reassigning nine codons to Leu, Ser, or Ala, and assess mutational robustness using three reporter proteins subjected to error-prone PCR. This represents an interesting experimental approach to addressing the hypothesis that the standard genetic code is optimized for mutational robustness.

      We sincerely thank the reviewer for the positive evaluation of our experimental approach. We are encouraged that the reviewer recognizes the value of constructing multiple non-standard genetic codes in vitro and using them to experimentally examine the relationship between genetic code arrangement and mutational robustness. In the revised manuscript, we have further clarified the scope of our experimental system and the interpretation of the results, particularly emphasizing that our conclusions concern the mutational robustness of individual reporter protein activity measured in an in vitro translation system.

      Major comment:

      While I find the experimental design valuable, I am not fully convinced by the authors' conclusion that "alterations of the genetic code within the ranges explored in this study have no significant effect on mutational robustness". The current analysis is based on the functional output of three individual reporter proteins. Given that cellular systems involve far more complex interactions, it would be more appropriate to limit this conclusion to mutational robustness at the level of individual protein activity, rather than making broader generalizations.

      We thank the reviewer for this important comment. We agree that our original wording was broader than what can be directly supported by the present experiments. Because our analysis is based on the functional outputs of three individual reporter proteins translated in a reconstituted in vitro system, the results do not directly address mutational robustness at the level of the cellular system, protein interaction networks, or organismal fitness.

      Accordingly, we have revised the manuscript to limit our conclusion to the mutational robustness of individual reporter protein activity. In the revised Abstract, Results, and Discussion, we now state that within the experimentally tested range of non-standard genetic codes, we did not detect a dependence of the mutation-induced decrease in reporter protein activity on mutational cost. We have also added a statement in the Discussion noting that cellular systems involve many additional layers, including protein–protein interactions, metabolic networks, quality-control systems, and growth selection, and that whether genetic code arrangement affects robustness at these higher biological levels remains an important question for future work.

      Specifically, we have added this explanation and the new experiment to the revised manuscript as follows.

      Abstract

      “This result provides direct experimental evidence that mutational robustness does not significantly change in individual reporter protein activity when the genetic code is altered within the range of mutational cost tested in this study…”

      Introduction

      “Random mutations decreased reporter protein function at similar levels across all genetic codes examined, implying that alterations of the genetic code within the ranges explored in this study have no significant effect on mutational robustness of individual protein activity.”

      Result

      “Taken together, these results indicate that mutational robustness of individual reporter protein function did not substantially differ among the genetic codes…”

      Discussion

      “…suggesting that mutational robustness of protein activity remained largely unchanged within at least the ranges of mutational cost tested in this study. It should be noted that this conclusion is limited to the activity of individual reporter proteins translated in a reconstituted in vitro system. Therefore, whether similar trends would be observed at the level of cellular fitness or long-term evolution remains an open question.”

      Specific comments

      (1) tRNA modification and expression efficiency (Page 5, line 131)

      The authors attribute the observed inefficiency to the lack of chemical modifications in the tRNAs used. However, gene expression efficiency can also be strongly influenced by DNA sequence design. To better support this claim, it would be helpful to compare luciferase activity when expressed using native E. coli tRNAs. This comparison could clarify whether the observed effects are due to tRNA modification status or other sequence-dependent factors.

      We thank the reviewer for this important suggestion. We agree that the translation efficiency of NanoLuc templates with 21-, 32-, and 46-codons may be affected not only by the chemical modification of tRNAs but also by sequence-dependent factors, such as codon context and mRNA structure.

      To examine this possibility, we performed an additional comparison using native E. coli tRNAs in the tfPURE system. When the NanoLuc templates encoded with 21, 32, or 46 codons were translated using native E. coli tRNAs, the observed luminescence values were 1.2 × 10<sup>10</sup>, 0.78 × 10<sup>10</sup>, and 0.60 × 10<sup>10</sup>, respectively. Thus, the 46-codon NanoLuc template showed lower activity than the 21- and 32-codon templates even with native tRNAs, indicating that sequence-dependent effects indeed contribute to translation efficiency.

      However, the difference among these templates with native E. coli tRNAs was within approximately two-fold. This effect was much smaller than the marked decrease observed when the 46-codon template was translated using the in vitro prepared 46 tRNAs SGC system. Therefore, while sequence-dependent effects cannot be excluded, the inefficient translation in the reconstructed 46 tRNAs SGC is likely to be mainly attributable to the limited functionality of unmodified tRNAs decoding NNA codons.

      We have revised the manuscript to clarify this interpretation and have added the new comparison using native E. coli tRNAs.

      “We also examined whether the lower translation efficiency of the 46-codon NanoLuc template could be explained by sequence-dependent effects, such as codon context or mRNA structure. When the 21-, 32-, and 46-codon NanoLuc templates were translated using native E. coli tRNAs in the tfPURE system (Figure 1–figure supplement 2), the 46-codon template showed lower activity than the 21- and 32-codon templates; however, this difference was within approximately two-fold. Accordingly, we decided to use only the 32 codons used in near-SGC (i.e., excluding NNA codons) in the subsequent construction of non-standard genetic codes.”

      (2) Discrepancy between expression level and activity (Figure S7 vs Figure S8).

      Although GAL expression levels appear similar across different genetic codes (Figure S7), their activities differ substantially (Figure S8), even in the low-mutation library. This discrepancy warrants further investigation. Possible explanations include differences in protein folding efficiency or translational error rates, as mentioned by the authors in the main text.

      To address this, the authors could analyze the protein products using mass spectrometry. If this is not feasible due to low expression levels, alternative approaches such as SDS-PAGE (e.g., with radiolabeling or Western blotting) would still provide valuable information. Additionally, comparing activity after in vitro refolding could help distinguish between folding defects and sequence-level errors. While I understand that the primary aim of this study is to compare mutational robustness across genetic codes, discussing these observations would significantly enhance the mechanistic insight of the work.

      We agree that the discrepancy between similar GAL expression levels and different GAL activities across genetic codes is important for interpreting the results.

      In our experiment, GAL protein amounts were quantified using a C-terminal HiBiT tag. Because the HiBiT tag was fused to the C-terminus of GAL, this assay indicates that the amount of C-terminally completed GAL products did not differ substantially among genetic codes. However, we agree that this assay does not evaluate the sequence fidelity, amino acid misincorporation patterns, or folding state of the translated products. Therefore, the observed differences in GAL activity despite similar HiBiT signals may reflect genetic code-dependent differences in translational error rates, amino acid misincorporation, protein folding efficiency, or other effects on the fraction of catalytically active protein.

      We have revised the Discussion to explicitly describe this interpretation and to clarify that detailed mechanistic dissection of these baseline activity differences, for example by mass spectrometry, SDS-PAGE/Western blotting, or refolding analysis, is an important future direction but beyond the scope of the present study. We also clarified that the main analysis in this study uses the ratio of activity from the high-mutation library to that from the corresponding low-mutation library within each genetic code.

      We have added this explanation to the revised manuscript as follows.

      “Although protein amounts quantified by the HiBiT tag were comparable among genetic codes, GAL activities differed substantially. This indicates that the activity differences among genetic codes were not primarily attributable to differences in the amount of C-terminally completed translation products. The HiBiT assay does not provide information on the fraction of catalytically active protein, including sequence fidelity or folding state, and therefore cannot distinguish among these possibilities. Detailed characterization of translated products by mass spectrometry would provide further mechanistic insight into how individual non-SGCs affect protein quality. However, the primary objective of the present study was to compare mutation-dependent activity loss across genetic codes. Therefore, we evaluated this effect by normalizing the activity of the high-mutation library to that of the corresponding low-mutation library within each genetic code.”

      (3) Protein expression analysis for additional reporters.

      Since protein expression levels are critical for interpreting reporter activity, similar analyses should also be performed for luciferase (Luc) and mSG in both high- and low-mutation libraries. This would ensure that differences in activity are not confounded by variations in protein abundance.

      We agree that protein abundance is an important factor for interpreting reporter activity. In this study, we performed HiBiT-based protein quantification for GAL because GAL showed the largest variation in absolute activity among genetic codes, even in the low-mutation library. This analysis showed that the amount of C-terminally completed GAL products was broadly comparable among genetic codes and between low- and high-mutation libraries, indicating that the observed GAL activity differences were not primarily attributable to differences in total protein abundance.

      For all three reporters, our main analysis was based on the ratio of activity from the high-mutation library to that from the corresponding low-mutation library within each genetic code. This normalization was intended to evaluate mutation-dependent activity loss while reducing the influence of code-specific baseline differences in expression level or protein quality. We believe that the data are sufficient to evaluate the effect of mutations on protein activities. Nevertheless, we agree that protein quantification for Luc and mSG would provide useful information regarding variation in the baseline levels of reporter activity, and this is an important direction for future work.

      Reviewer #2 (Public review):

      Summary:

      The study addresses the long-standing question in molecular biology and genetics: why has nature selected the current genetic code (SGC, or standard genetic code)? The authors have tested 'error minimization theory', one of the prevailing hypotheses to explain this. Their approach is to create a minimum genetic code (MGC) and its variants (3^9 theoretical possible codes). Using three parameters to quantify the effect of mutations (Polarity, volume, and hydropathy), they computationally test the cost of these genetic codes (3^9) by simulations. Finally, they test this cost experimentally using an in vitro translation system with 10 select genetic code variants with a range of costs (low to high). They use three randomly mutated reporter genes for this purpose - beta-galactosidase, luciferase, and mSG. They find no correlation between the cost of the genetic code and the reporters' output. Based on these observations, they suggest that error-minimization theory may not explain the current egocentric code.

      The question they are asking is very exciting, and their approach is solid. The authors are very careful in their analyses and conclusions.

      We sincerely thank the reviewer for the positive assessment of our study and for the helpful suggestions. We are encouraged that the reviewer found the question exciting and the approach solid. In the revised manuscript, we have clarified the rationale for using the MGC/near-SGC framework, added further analyses and explanations of the mutational cost calculations, and revised the wording of our conclusions to more explicitly define the scope and limitations of the present experimental system.

      (1) The rationale for using MGC instead of SGC: It is unclear why the authors rely on the MGC for this analysis when the central question concerns the SGC. If the goal is to evaluate whether the SGC minimizes mutational cost, a more direct approach would be to generate alternative variants of the SGC itself and compare their mutational cost distributions. At present, it is difficult to assess whether conclusions drawn from this comparison are fully relevant to the stated biological question.

      We thank the reviewer for this important comment. We agree that directly constructing alternative variants of the SGC by changing amino acid assignment from SGC would be the most straightforward approach to testing whether the SGC minimizes mutational cost. However, this approach is currently not feasible in our reconstituted translation system for two reasons.

      First, our attempt to construct a 46-tRNA SGC-like system revealed that translation using the 46-codon NanoLuc template was approximately 100-fold less efficient than translation using the MGC or near-SGC (Fig. 1). This low activity likely reflects inefficient decoding of NNA codons by in vitro-prepared tRNAs, which lack native post-transcriptional modifications. Because this system did not provide sufficient translational activity for systematic reporter assays, we restricted subsequent experiments to the 32-codon near-SGC framework, excluding NNA codons. We now describe this technical limitation more explicitly in the revised manuscript.

      Second, the MGC framework provides vacant codons that can be reassigned by adding anticodon-variant tRNAs. This feature is essential for constructing multiple genetic code variants in parallel under controlled in vitro conditions. We, therefore, constructed the near-SGC-based non-SGC by adding each tRNA variant to the MGC as an experimentally tractable model system to verify whether differences in genetic code arrangement affect mutation-induced decreases in reporter protein activity.

      We have added this explanation to the revised manuscript as follows.

      “We first established a minimal genetic code, composed of 21 tRNAs with vacant codons, which allows multiple alternative codon assignments to be introduced under otherwise comparable translation conditions.”

      Despite this technical limitation, we believe that the central conclusion of this study—that mutational robustness in individual reporter protein activity does not change significantly when the genetic code is altered within the range of mutational costs tested here—remains well-supported by the present results.

      (2) The mutational cost analysis appears biologically oversimplified because all amino acid substitutions are treated equivalently. The analysis assumes that all mutations contribute equally to fitness consequences, which does not reflect biological reality. In natural proteins, the impact of an amino acid substitution depends strongly on its structural and functional context. For example, substitutions affecting catalytic residues, ligand-binding interfaces, phosphorylation sites, or other regulatory motifs can severely impair protein function even when associated changes in polarity, hydropathy, or volume are minimal. Conversely, substitutions in structurally permissive or functionally dispensable regions may have little or no measurable effect despite larger physicochemical differences. Therefore, changes in polarity, hydropathy, and volume alone do not necessarily predict functional consequences.

      We agree that the mutational cost used in this study is a simplified measure and does not capture the full biological complexity of amino acid substitutions. As the reviewer pointed out, the functional consequence of a substitution depends strongly on its structural and functional context, including whether the affected residue is involved in catalysis, ligand binding, protein–protein interactions, regulatory motifs, folding, or structurally permissive regions.

      In this study, we used physicochemical-property-based mutational costs because this type of definition has been widely used in classical formulations of the error minimization theory. Our aim was therefore not to construct a comprehensive predictor of protein fitness effects, but to experimentally test whether the conventional theoretical cost metrics used to discuss genetic code optimality are reflected in the average mutation-induced decrease in reporter protein activity. We have now clarified this rationale in the revised manuscript.

      “It should be noted that this conclusion is limited to the activity of individual reporter proteins translated in a reconstituted in vitro system. Therefore, whether similar trends would be observed at the level of cellular fitness or long-term evolution remains an open question.”

      (3) It is not clear why they increased the concentration of the two tRNAs in near-SGC. Have they maintained the same tRNA concentrations in experiments explained in Fig 5 for all 10 genetic codes tested?

      We apologize that the rationale for increasing the concentrations of tRNA<sup>Val</sup><sub>CAC</sub> and tRNA<sup>Arg</sup><sub>CCU</sub> was not sufficiently clear in the original manuscript. As we wrote in the previous manuscript, “To improve translation efficiency with near-SGC, we focused on two tRNA concentrations (tRNA<sup>Val</sup><sub>CAC</sub> and tRNA<sup>Arg</sup><sub>CCU</sub>), which were suggested to have low activities in a previous study (Iwane et al., 2016),” we tested whether increasing their concentrations would improve translation efficiency. As shown in Figure 1–figure supplement 1, NanoLuc activity increased as the concentrations of these two tRNAs were raised and used at 100 ng/µL for tRNA<sup>Val</sup><sub>CAC</sub> and tRNA<sup>Arg</sup><sub>CCU</sub> in the optimized near-SGC, referred to as near-SGC (RV), and in all subsequent experiments. Additional anticodon-variant tRNAs required for each non-SGC were used at optimized concentrations determined from Figure 2–figure supplement 1. For each genetic code, the same tRNA composition and concentrations were used for the low- and high-mutation libraries (See Supplementary Table S7). To clarify this point, we added the sentence, “The increased concentrations of these two tRNAs were used in all the subsequent experiments,” in the corresponding part.

      Reviewer #3 (Public review):

      In this manuscript, Miyachi and Ichihashi investigate whether the arrangement of the genetic code affects mutational robustness. Using an in vitro minimal genetic code with vacant codons, they constructed 10 non-standard genetic codes by reassigning Ala, Ser, and Leu, generating codes with replacement costs that were generally higher than those of the standard genetic code across several amino acid property measures. They then tested how random mutations affected the activity of reporter proteins translated under these altered codes. Although error minimization theory predicts that higher-cost codes should make mutations more harmful, the authors report that protein function declined to a similar extent across all codes examined, suggesting that mutational robustness remains largely unchanged within the range of genetic code alterations tested here.

      Strengths:

      This is an interesting study that investigates one of the most fundamental and intriguing questions in molecular evolution: the emergence of the genetic code, which is nearly universal across nature. The in vitro approach is a powerful aspect of the work and provides an opportunity to examine this phenomenon experimentally at a depth that has previously been inaccessible.

      Weaknesses:

      However, the authors' use of random mutation libraries has certain limitations that prevent the study from realizing its full potential to uncover the mechanisms governing the molecular evolution of the genetic code.

      We sincerely thank the reviewer for the positive evaluation of our study and for recognizing the strength of the in vitro approach. We are encouraged that the reviewer considers this system a powerful way to experimentally address the emergence of the genetic code.

      We also appreciate the reviewer’s constructive comments regarding the limitations of random mutation libraries. We agree that pooled random libraries do not allow us to assign functional effects to individual mutations or to fully uncover the molecular mechanisms underlying mutational robustness. In the revised manuscript, we therefore clarify that our conclusions concern the library-averaged effects of random mutations on individual reporter protein activity, rather than the effects of specific mutations or cellular-level fitness. To address this limitation, we have added explanations of the scope and limitations of the present approach.

      (1) Statistical analyses are missing for several of the manuscript's main claims. This issue applies throughout the paper, including, but not limited to, Figures 1D, 2B, 4B-D, and 5B.

      We thank the reviewer for this important comment. We agree that statistical analyses are necessary to support the major claims of the manuscript. We have therefore added statistical analyses appropriate for the purpose and experimental design of each figure.

      For Fig. 1D, we performed one-way ANOVA followed by Tukey’s post hoc test on NanoLuc activity to compare translation efficiencies among the MGC, near-SGC, near-SGC (RV), and SGC conditions. This analysis showed a significant overall difference among conditions (one-way ANOVA, p < 0.0001). Tukey’s post hoc test showed that near-SGC was significantly lower than MGC, that near-SGC (RV) significantly improved near-SGC translation, and that near-SGC (RV) was not significantly different from MGC. In contrast, the 46-tRNA SGC remained significantly less efficient than near-SGC (RV). We have summarized the major comparisons in Supplementary Table S8.

      For Fig. 2B, we compared NanoLuc activity between the 21-code control and the corresponding 21+1-code condition for each codon reassignment using Welch’s t-test on luminescence. This analysis was added to statistically support whether each anticodon-variant tRNA increased NanoLuc translation from the corresponding reassigned template. The statistical results are summarized in Supplementary Table S9.

      For Fig. 4B–D, we converted mutation rates per base to estimated numbers of mutations per gene and performed Spearman’s rank correlation analysis to evaluate whether reporter activity decreased monotonically with increasing mutational load. This analysis showed strong negative monotonic trends between mutation rate (estimated mutation number) and reporter activity for all three reporters (ρ = −0.90 to −1.00), supporting that the random mutation libraries reduced protein activity in a mutation-load-dependent manner.

      For Fig. 5B, replicate-level data were available for GAL, and we therefore performed two-way ANOVA using genetic code and mutation level as factors. This analysis detected significant main effects of genetic code and mutation level, indicating that GAL activity differed among genetic codes and decreased in the high-mutation library. However, no significant interaction between genetic code and mutation level was detected, indicating that the magnitude of mutation-induced activity reduction was not strongly code-dependent under the conditions examined.

      Finally, because the central claim of Fig. 5C, 5E, and 5G is that mutational cost does not systematically predict mutation-induced activity loss, we performed Spearman’s rank correlation analysis between each mutational cost metric and the high-/low-mutation activity ratio. No significant correlations were detected for any reporter or cost metric (Spearman’s ρ = −0.23 to 0.25), supporting the conclusion that mutational cost did not show a detectable monotonic relationship with mutation-induced activity loss within the tested range.

      We have added these statistical analyses to the revised manuscript. The following sentences were added to the figure legends:

      Fig. 1

      “Statistical comparisons in (D) were performed using one-way ANOVA followed by Tukey’s post hoc test on NanoLuc activity; major comparisons are summarized in Table S8.”

      Fig. 2

      “For each template, NanoLuc activity in the 21-code and corresponding 21+1-code conditions was compared using Welch’s t-test on luminescence. Statistical results are summarized in Table S9.”

      Fig. 4

      “Spearman’s rank correlation coefficients were ρ = −0.90 for GAL, ρ = −1.00 for Luc, and ρ = −1.00 for mSG”

      Fig. 5

      “For GAL activity in (B), two-way ANOVA was performed using genetic code and mutation level as factors. Significant main effects of genetic code and mutation level were detected (both p < 0.0001), whereas their interaction was not significant. For (C), (E), and (G), Spearman’s rank correlation analysis was performed between each mutational cost metric and the high-/low-mutation activity ratio. Statistical details are summarized in Table S10.”

      (2) In Figure 2A, the authors modify the NanoLuc gene by reassigning Ala, Leu, or Ser to new codons and elegantly show that the in vitro availability of the corresponding tRNAs is important for protein function. However, the functional importance of the specific modified positions within NanoLuc is not clear. As a result, it is difficult to determine what the expected consequences of these codon changes should be, which in turn limits the interpretation of the observed changes in protein activity. To improve the interpretability of this experiment, the authors should report exactly how many codons were modified in each variant and, ideally, examine the effect of progressively increasing the number of reassigned codons.

      We agree that the exact positions and numbers of codon replacements should be clearly reported. In the revised manuscript, we have added a list of the modified amino acid positions. In brief, two Ala codons, three Ser codons, or four Leu codons were replaced with the target vacant codon; the modified positions were Ala16 and Ala120, Ser31, Ser49, and Ser150, and Leu32, Leu67, Leu144, and Leu170, respectively.

      We also agree that progressively increasing the number of reassigned codons would provide additional mechanistic insight. However, the purpose of Fig. 2 was to test whether each vacant codon could be decoded by the corresponding anticodon-variant tRNA to produce functional NanoLuc, rather than to analyze the positional contribution of each replacement. We previously performed such progressive codon replacement analysis for one reassigned codon, ACG, in a related study (Miyachi et al., 2025), and the results supported the same qualitative interpretation. Although we did not repeat this progressive analysis for all codons in the present study, we expect that the qualitative interpretation of Fig. 2 would not be substantially changed.

      We have revised the figure text to clarify the scope of the experiment and added the detailed codon replacement information.

      “(A) Schematic illustration of reassignment experiments. Translation with the original MGC and NanoLuc template is shown at the top for comparison. An example of Ala reassignment to the UUG codon is shown at the bottom. In this example, three Ala codons in the NanoLuc sequence were replaced with one type of vacant codon (e.g., UUG), generating a 21 + 1 (UUG-Ala) codon set. Similar reassignment experiments were performed for three amino acids (Ala, Ser, and Leu) and nine vacant codons. Specifically, two Ala codons (Ala16 and Ala120), three Ser codons (Ser31, Ser49, and Ser150), or four Leu codons (Leu32, Leu67, Leu144, and Leu170) were replaced.”

      (3) The calculations presented in Figure 3 raise an interesting conceptual question: why does the near-standard genetic code not exhibit the lowest cost? One possible explanation is that the standard genetic code evolved under multiple competing constraints and is therefore not expected to be optimal for any single cost metric, while still achieving strong overall performance. In this context, it would be informative if the authors combined the three cost measures into a single integrated index and examined whether the near-SGC performs more favorably when all three dimensions are considered together. Such an analysis could add important depth to the study.

      We agree that the near-SGC is not necessarily expected to minimize each individual cost metric, because the standard genetic code may reflect multiple competing physicochemical, translational, biosynthetic, and evolutionary constraints rather than optimization of a single property.

      To address this point, we added an integrated cost analysis combining the three physicochemical cost metrics, Cost<sub>PR</sub>, Cost<sub>MV</sub>, and Cost<sub>HI</sub>. Because these three metrics have different numerical scales, we normalized each metric before integration. We used two types of integrated indices.

      First, for each metric m 𝛜 {PR, MV, HI}, we calculated a min–max normalized cost,

      Where G denotes the set of 19,683 candidate non-SGCs generated by assigning Ala, Ser, or Leu to the nine vacant codon boxes. We then defined the integrated min–max cost as

      Second, we calculated a z-score-normalized cost for each metric,

      Where µ<sub>m,G</sub> and 𝜎<sub>m,G</sub> are the mean and standard deviation of Cost<sub>m<sub>norm</sub></sub> across the candidate non-SGCs. The integrated z-score cost was then defined as

      Using both integrated indices, the near-SGC ranked first when compared with all 19,683 candidate non-SGCs; in other words, no candidate non-SGC showed a lower integrated cost than the near-SGC. The integrated min–max cost of the near-SGC was 0.01525, whereas the lowest value among candidate non-SGCs was 0.12301. Similarly, the integrated z-score cost of the near-SGC was −2.47947, whereas the lowest candidate value was −1.90838.

      We have added this integrated cost analysis as Supplementary Figure 5–figure supplement 7. We have also revised the Discussion to note that the near-SGC does not necessarily minimize every individual physicochemical cost, but performs most favorably when PR, MV, and HI are considered comprehensively. This result is consistent with the idea that the standard genetic code may represent a compromise among multiple constraints rather than optimization of a single physicochemical property.

      “We consider that the cost ranges examined in this study represent substantial fractions, especially for MV and HI. Although the near-SGC did not necessarily exhibit the lowest cost for each individual physicochemical metric, this does not mean that it is unfavorable in the multidimensional cost space. Because the SGC may reflect a balance among multiple physicochemical constraints rather than optimization of a single property, we also calculated integrated cost indices by combining Cost_PR, Cost_MV, and Cost_HI after min–max normalization or z-score normalization. In both integrated indices, the near-SGC showed the lowest overall cost when compared with all 19,683 candidate non-SGCs (Figure 5–figure supplement 7), indicating that no candidate non-SGC exhibited a lower combined cost than the near-SGC when the three physicochemical properties were considered comprehensively.”

      (4) It is difficult to assess the consequences of the random mutations presented in Figure 4 on reporter gene function based solely on the reported "error rate/base" parameter. In particular, the x-axis in Figure 4B should be converted into the estimated number of mutations per gene. This would make the results more intuitive and would allow the reader to better evaluate the expected degree of disruption to protein function.

      We agree that the mutation rate per base alone does not provide an intuitive sense of the expected mutational burden for each reporter gene. We therefore added a second x-axis to Fig. 4B–D showing the estimated number of mutations per gene. This value was calculated by multiplying the mutation rate per base by the coding sequence length of each reporter gene.

      We retained the original mutation rate per base axis to preserve the direct link to the sequencing-based mutation rate measurement, while adding the estimated mutations per gene axis to improve interpretability. We have revised the figure and figure 4 legend accordingly.

      “The lower x-axis indicates the estimated number of mutations per gene, calculated by multiplying the mutation rate per base by the coding sequence length of each reporter gene.”

      (5) A central limitation of the random mutagenesis libraries used in Figure 5, which also underlie one of the manuscript's main claims, is that the exact mutations and their distribution across the reporter genes are not reported. In addition, protein activity is measured only at the level of the entire library, without directly linking individual mutations to their functional consequences. This substantially limits mechanistic interpretation. In my view, this issue can only be addressed convincingly if the authors test a set of defined variants carrying specific mutations and directly evaluate their functional effects.

      (6) Related to the previous point, in Figures 5C, 5E, and 5G, the authors present the ratio between low-mutation-rate and high-mutation-rate libraries. However, because each library contains a different collection of mutations, it is unclear what can be inferred from these comparisons. To overcome this limitation, the authors should assess the effects of altered genetic codes on specific, defined mutations rather than on heterogeneous mutation pools alone.

      (7) Along the same lines, in Figures 5C, 5E, and 5G, it is unclear why the effects of random mutations would be expected to correlate with the three calculated cost metrics, given that the positions, identities, and functional relevance of the mutations within the genes are not known. Without this information, the biological meaning of these correlations remains difficult to evaluate.

      We agree that using pooled random mutation libraries does not allow us to directly link individual mutations to their functional consequences. We also agree that testing defined variants carrying specific mutations would provide a more direct and mechanistic understanding of how each genetic code affects the functional impact of particular amino acid substitutions. However, the purpose of the present study was different from such a defined-variant analysis. Our aim was to experimentally test whether the conventional mutational cost metrics used in error minimization theory predict the average effect of random mutational loads on protein activity. Because these theoretical costs are themselves defined as average expected physicochemical effects over many possible single-nucleotide substitutions, we reasoned that pooled random mutation libraries provide an appropriate first experimental framework to evaluate whether such average-cost metrics are reflected in the average functional output of translated proteins.

      We agree that low- and high-mutation libraries do not contain identical sets of mutations. Therefore, the high-/low-mutation activity ratio should not be interpreted as the effect of the same individual variants before and after additional mutations. Rather, it represents the relative reduction in average activity caused by increasing the mutational burden in a heterogeneous mutation pool under each genetic code. We have revised the text to clarify this interpretation.

      We also agree that the positions, identities, and functional relevance of individual mutations are not resolved in this pooled assay. This limitation prevents us from assigning mechanistic effects to specific substitutions. At the same time, using a small set of defined variants would introduce its own selection bias, because the conclusions could strongly depend on which mutations and which protein positions were chosen. Therefore, we consider the random-library approach to be a useful first step for testing library-averaged effects, whereas systematically defined variant analysis or genotype-resolved activity assays will be necessary to reveal mutation-specific mechanisms in future studies.

      In response to the reviewer’s concern, we have revised the Discussion to explicitly limit our conclusion to library-averaged effects on individual reporter protein activity. We now state that this approach does not identify the functional effects of individual mutations and that future studies using defined variants or high-throughput genotype–phenotype mapping will be required to determine how specific substitutions contribute to genetic code-dependent mutational robustness.

      Result

      “To estimate the average activity reduction associated with increased mutational burden under each genetic code, we calculated the ratio of activity obtained from the high-mutation library to that from the corresponding low-mutation library and plotted this ratio against each of the three mutational costs (Fig. 5C).”

      Discussion

      “A further limitation of this study is that the reporter activities were measured at the level of pooled random mutation libraries. Therefore, the high-/low-mutation activity ratio used in this study should be interpreted as the relative reduction in average activity caused by increasing the mutational burden in a heterogeneous mutation pool, rather than as the effect of identical variants before and after additional mutations. This library-averaged approach was chosen because the mutational costs considered here are also defined as average expected physicochemical effects over many possible single-nucleotide substitutions. In addition, because the non-SGCs constructed in this study were generated by reassigning only Ala, Ser, and Leu, the detectable effects may depend on how frequently mutations involving these amino acids occur in each reporter gene and whether the affected positions are functionally important. If genetic code dependent effects are restricted to a small subset of deleterious variants, such effects may be masked in pooled activity measurements. Future studies using defined variants or high-throughput genotype–phenotype mapping assays will be required to determine the mutation-specific and position-specific mechanisms underlying genetic code dependent effects on protein function (Rozhoňová et al., 2024).”

      (8) For each mutagenesis library, the number of variants, the average number of mutations per variant, and the distribution of mutation positions should be reported clearly and transparently. These details are important for evaluating the strength of the conclusions.

      We agree that a more transparent characterization of the random mutagenesis libraries is necessary for evaluating the strength and limitations of our conclusions.

      In the revised manuscript, we have added the estimated number of mutations per gene to the Results section. This value was calculated by multiplying the mutation rate per base by the coding sequence length of each reporter gene. For the high-mutation libraries used in Fig. 5, the estimated numbers of mutations per gene were approximately 8.0 for GAL, 4.5 for Luc, and 3.3 for mSG. We also added position-wise mutation profiles along each reporter gene (Figure 4–figure supplement 2), in addition to the heatmap shown in the original manuscript. These analyses clarify the mutational burden of each library and show that mutations were broadly distributed across the analyzed regions (approximately 300 nt in the middle of each gene) of the reporter genes.

      Regarding the number of variants, the translation reactions were performed using 5 nM DNA template in a 5 µL reaction, corresponding to approximately 1.5 × 10<sup>10</sup> DNA molecules. However, this value represents the total number of DNA molecules introduced into the reaction and does not directly indicate the number of unique full-length sequence variants, because multiple molecules can share the same genotype, and our sequencing analysis was designed to quantify mutation frequencies and positional distributions rather than to reconstruct full-length genotypes of individual library members. Therefore, we do not infer the exact number of unique variants in each library. Instead, we report the average mutation burden and position-wise non-reference rate distributions.

      We have revised the Results and added Supplementary Figure 4–figure supplement 2 accordingly.

      “For this experiment, two random mutation libraries were used: a low-mutation library prepared using the high-fidelity polymerase and a high-mutation library prepared using Taq DNA polymerase at a Mn<sup>2+</sup> concentration that yields mutation rates of 0.002 – 0.005 per base (0.0026 for GAL, 0.0027 for Luc, and 0.0048 for mSG, corresponding to approximately 8.0, 4.5, and 3.3 mutations per gene). We also plotted position-wise non-reference rates along the analyzed regions of each reporter gene, confirming that mutations were broadly distributed across the amplicons (Figure 4–figure supplement 2).”

      (9) Because only three amino acids were manipulated in the non-standard genetic codes, it remains unclear whether these particular amino acids occupy positions in the reporter proteins that are especially important for function and therefore likely to generate strong phenotypic effects. More broadly, it is not clear whether the assay is sufficiently sensitive to detect the effects of only a subset of deleterious variants within a pooled library. This point should be addressed more explicitly.

      We agree that this is an important limitation of the present study. Because our non-SGCs were constructed by reassigning only Ala, Ser, and Leu, the mutation-dependent effects that can differ among genetic codes are limited to mutations involving these reassigned codons or amino acid substitutions affected by these assignments. Therefore, the sensitivity of the assay depends on how frequently such substitutions occur in the reporter genes and whether the affected Ala, Ser, and Leu-related positions are functionally important.

      We have revised the Discussion to address this point more explicitly. In the revised manuscript, we now state that the absence of a detectable cost-dependent effect may reflect not only the limited cost range examined, but also the limited set of reassigned amino acids, the position-dependent importance of Ala/Ser/Leu residues in the reporter proteins, and the sensitivity limit of pooled activity measurements. We further note that future studies using genotype-resolved activity assays (defined variants) will be required to determine whether specific amino acid substitutions or specific protein positions exhibit stronger genetic code-dependent effects.

      “A further limitation of this study is that the reporter activities were measured at the level of pooled random mutation libraries. Therefore, the high-/low-mutation activity ratio used in this study should be interpreted as the relative reduction in average activity caused by increasing the mutational burden in a heterogeneous mutation pool, rather than as the effect of identical variants before and after additional mutations. This library-averaged approach was chosen because the mutational costs considered here are also defined as average expected physicochemical effects over many possible single-nucleotide substitutions. In addition, because the non-SGCs constructed in this study were generated by reassigning only Ala, Ser, and Leu, the detectable effects may depend on how frequently mutations involving these amino acids occur in each reporter gene and whether the affected positions are functionally important. If genetic code-dependent effects are restricted to a small subset of deleterious variants, such effects may be masked in pooled activity measurements. Future studies using defined variants or high-throughput genotype–phenotype mapping assays will be required to determine the mutation-specific and position-specific mechanisms underlying genetic code-dependent effects on protein function (Rozhoňová et al., 2024).”

      Recommendations for the authors:

      Reviewing Editor Comments:

      While we suggest that you address all the technical points raised by the reviewers, you may specifically want to limit the conclusion of the study to mutational robustness at the level of individual protein activity, rather than making broader generalizations. Also, the statistical analysis needs to be strengthened, as indicated in the reviews.

      We thank the Reviewing Editor for these important suggestions. We agree that the conclusion of the original manuscript was broader than what can be directly supported by the present experiments. In the revised manuscript, we have therefore limited our conclusion to mutational robustness at the level of individual reporter protein activity measured in a reconstituted in vitro translation system. We now explicitly state that our results do not directly address robustness at the level of cellular fitness, protein interaction networks, or long-term evolution.

      We have also strengthened the statistical analyses throughout the manuscript. Specifically, we added one-way ANOVA followed by Tukey’s post hoc test for Fig. 1D, Welch’s t-tests for Fig. 2B, Spearman’s rank correlation analyses for Fig. 4B–D and Fig. 5C/E/G, and two-way ANOVA for GAL activity in Fig. 5B. These analyses have been incorporated into the revised Results, figure legends, and supplementary information.

      Reviewer #2 (Recommendations for the authors):

      (1) Discuss other alternative hypotheses if the error minimization theory is unlikely.

      We thank the reviewer for this helpful suggestion. We think that the absence of a detectable relationship between mutational cost and reporter protein activity in our assay should not be interpreted as excluding all possible roles of error minimization in the evolution of the genetic code. Our results specifically address one aspect of the error minimization theory: whether physicochemical-property-based mutational cost predicts the average effect of random point mutations on individual reporter protein activity within the experimentally accessible range of non-SGCs tested here.

      In the revised Discussion, we have clarified that the organization of the SGC may have been shaped by multiple factors, including robustness to translational errors, historical constraints associated with genetic code expansion, biosynthetic or coevolutionary processes, stereochemical interactions, and the evolvability of proteins. Our results suggest that the contribution of mutational robustness at the level of individual protein activity may be limited within the range examined here, but they do not exclude the possibility that the SGC provides advantages under other forms of error, at the level of translation fidelity, cellular fitness, or long-term evolution.

      We have added a short discussion to clarify this point without expanding the scope of the manuscript beyond the present experimental results.

      “It should be noted that this conclusion is limited to the activity of individual reporter proteins translated in a reconstituted in vitro system. Therefore, whether similar trends would be observed at the level of cellular fitness or long-term evolution remains an open question. Moreover, our results do not exclude other possible roles of SGC organization. The SGC may have been shaped by multiple factors, including robustness to translational errors, historical constraints during genetic code expansion, biosynthetic or coevolutionary relationships among amino acids, stereochemical interactions, and effects on protein evolvability (Katoh and Suga, 2023; Koonin and Novozhilov, 2017, 2009; Novozhilov et al., 2007; Wong, 2005).”

      (2) A brief description of the PURE translation system can be provided for people from outside the field.

      We have added a brief description of the PURE system in the Introduction to make the experimental platform more accessible to readers outside the field. Specifically, we now explain that the PURE system is a reconstituted cell-free translation system composed of purified translation factors, ribosomes, aminoacyl-tRNA synthetases, tRNAs, amino acids, and energy-regeneration components. We also clarify that, in this study, we used a tRNA-free version of the PURE system, in which defined synthetic tRNA sets were supplied externally to reconstruct each genetic code.

      Introduction

      “A representative platform for such reconstitution is the PURE system (Shimizu et al., 2001), a reconstituted cell-free translation system composed of purified translation components, including ribosomes, translation factors, aaRSs, amino acids, and energy-regeneration components. In particular, a tRNA-free PURE system (Miyachi et al., 2022), in which endogenous tRNA activity is minimized and defined tRNA sets are supplied externally, enables genetic codes to be reconstructed by controlling the supplied tRNAs.”

      (3) Figure 5D and F - Technical replicates are provided only for GAL. A similar approach should be taken for LUC and mSG.

      We agree that replicate-level measurements for Luc and mSG would further improve reliability. However, repeating the full translation experiments for these reporters was not feasible in the current revision, as each experiment requires large amounts of freshly prepared tRNA-free PURE system and multiple defined tRNA mixtures for every genetic code variant tested. Given these material and technical constraints, we were unable to perform additional biological replicates within the scope of this revision. We would like to emphasize, however, that the GAL replicates shown in Fig. 5D and F are fully consistent across independent experiments, providing direct evidence for the reproducibility of the assay itself. Furthermore, the key metric in our analysis, the activity ratio between high- and low-mutation groups within each genetic code, is an internally normalized measure that is inherently less sensitive to between-experiment variability than absolute activity values. The correlation analyses further showed no significant relationship between mutational cost and this ratio across all three reporters, and this conclusion is consistent regardless of which reporter is examined. Together, we believe these results provide a robust basis for the conclusions drawn, even in the absence of full replication for Luc and mSG.

      (4) Provide statistical analysis wherever it is relevant (e.g, to support a lack of correlation).

      We have strengthened the statistical analyses throughout the revised manuscript. In particular, to support the lack of detectable correlation between mutational cost and mutation-induced activity loss, we performed Spearman’s rank correlation analyses between each mutational cost metric and the high-/low-mutation activity ratio for all three reporters. No significant correlations were detected for any reporter or cost metric. In addition, we added statistical analyses for other relevant figures, including one-way ANOVA followed by Tukey’s post hoc test for Fig. 1D, Welch’s t-tests for Fig. 2B, Spearman’s rank correlation analyses for Fig. 4B–D, and two-way ANOVA for GAL activity in Fig. 5B.

      Reviewer #3 (Recommendations for the authors):

      (1) In line 122, the phrase "as evenly as possible" is ambiguous and should be explained more precisely.

      We thank the reviewer for pointing this out. We have revised the phrase “as evenly as possible” to describe the codon design more precisely. Specifically, we now state that the NanoLuc coding sequences were designed so that the codons available in each genetic code were used with minimal differences in codon counts, while preserving the amino acid sequence of NanoLuc.

      “For near-SGC and SGC, the NanoLuc coding sequences were designed so that the codons available in each genetic code were used with minimal differences in codon counts, while preserving the amino acid sequence (Fig. 1B, 32 codons and 46 codons).”

      (2) For Figure 1D, a Western blot or another protein gel-based assay would be helpful to exclude the possibility that the observed differences arise from variation in translation efficiency rather than differences in protein activity.

      We agree that a protein gel-based assay such as Western blotting would in principle allow us to distinguish differences in translated protein amount from differences in specific activity, and we understand why such data would be informative. However, we would like to clarify that the primary purpose of Fig. 1D was to evaluate the overall functional translation output of each reconstructed genetic code, rather than to determine the mechanistic basis of any observed differences. In this context, NanoLuc luminescence serves as an integrated readout of the entire translation process, encompassing both translational efficiency and protein folding/activity. Crucially, regardless of whether the observed differences in NanoLuc luminescence reflect lower protein yield, reduced specific activity, or a combination of both, the conclusion of Fig. 1D remains the same. Although we did not perform Western blotting in this study, we believe that such an analysis would not change this interpretation and that the current data are sufficient to support this conclusion.

      (3) The number 3^9 is not immediately intuitive. It would be helpful if the authors also stated that this corresponds to approximately 20,000 possible non-standard genetic codes.

      We have revised the text to state both the exact number and the approximate value: 3<sup>9</sup> = 19,683, approximately 20,000 possible non-standard genetic codes.

      (4) The rationale for using the three cost parameters (PR, MV, and HI) should be explained in greater detail. Because these parameters are central to the manuscript, a citation alone is not sufficient. A concise explanation of their biological relevance would improve the clarity and accessibility of the study.

      We agree that the biological relevance of the three cost parameters should be explained more clearly. In the revised manuscript, we have added a concise explanation of why polar requirement (PR), molecular volume (MV), and hydropathy index (HI) were used.

      These parameters were selected because they have been widely used in theoretical studies of genetic code optimality and represent distinct physicochemical aspects of amino acid substitutions. PR reflects polarity-related interactions and has been a classical metric in error minimization analyses of the genetic code. MV represents side-chain size and steric volume, which could influence packing and structural stability in proteins. HI reflects hydrophobicity, which is closely related to protein folding and hydrophobic core formation. We have also clarified that these metrics are simplified descriptors and do not capture residue-specific structural or functional context, which we now discuss as a limitation of the study.

      “PR reflects polarity-related interactions of amino acids and has been used as a classical measure of amino acid similarity in error minimization analyses. MV represents side-chain size and steric volume, which could affect protein packing and structural stability, whereas HI reflects hydrophobicity, which could be closely related to protein folding or hydrophobic core formation.”

      (5) In Figure 3, the experimental framework would be easier to follow if the authors included a schematic and data for one representative non-SGC, explicitly illustrating how it differs from the near-SGC with respect to each of the three cost measures.

      We agree that showing one representative non-SGC would make the experimental framework and cost calculation more intuitive.

      In the revised manuscript, we added a new panel to Fig. 3 comparing the near-SGC with a representative non-SGC. We selected the PR<sub>max</sub> code as the representative example because it clearly illustrates how reassignment of vacant codon boxes can increase one mutational cost metric relative to the near-SGC. In this panel, we first show the codon assignment schemes of the near-SGC and PR<sub>max</sub> code in the same genetic-code format used in Fig. 1. We then show the corresponding heatmap representations for the three physicochemical properties used in the cost calculation: polar requirement, molecular volume, and hydropathy index. The Cost<sub>PR</sub>, Cost<sub>MV</sub>, and Cost<sub>HI</sub> values are shown for each code.

      This new panel illustrates how changes in codon assignment are translated into different physicochemical cost landscapes and clarifies how the representative non-SGC differs from the near-SGC with respect to each of the three cost measures.

      “To make the design of non-SGCs more explicit, we show one representative non-SGC together with the near-SGC in Fig. 3B. This comparison illustrates how assignment of Ala, Ser, or Leu to the vacant codon boxes changes the three mutational cost metrics, Cost<sub>PR</sub>, Cost<sub>MV</sub>, and Cost<sub>HI</sub>.”

      (6) In line 329, the phrase "similar pattern" is ambiguous and should be explained more explicitly.

      We have revised the ambiguous phrase “similar pattern” to describe the observation more explicitly. Specifically, we now state that the relative differences in GAL activity among genetic codes observed in the low-mutation library were broadly retained in the high-mutation library, although overall activity decreased.

      “For the high-mutation library, GAL activity decreased overall, while the relative differences in activity among genetic codes observed in the low-mutation library were broadly retained.”

      (7) Figure S7 appears to be an important control for the experiments shown in Figure 5, and I recommend moving it to the main figures.

      We thank the reviewer for this helpful suggestion. We agree that the HiBiT-based quantification of GAL protein amount is an important control for interpreting the GAL activity measurements in Fig. 5, and we appreciate the recommendation to increase its visibility. This analysis shows that the amount of C-terminally completed GAL products was broadly comparable among genetic codes, indicating that the large differences in GAL activity were not primarily attributable to differences in total translated protein amount.

      After careful consideration, we have opted to retain this analysis in the supplementary figures because the main focus of Fig. 5 is the relationship between mutational cost and mutation-induced activity loss, quantified by the high-/low-mutation activity ratio. The HiBiT experiment addresses a related but distinct question: whether differences in absolute GAL activity among genetic codes can be explained by differences in protein abundance, and we felt that including it in the main figures might shift the emphasis away from the central message of Fig. 5. Nevertheless, we have added a clear reference to Figure 4–figure supplement 1 in the main text and the figure legend to ensure that readers are directed to this control when interpreting Fig. 5.

    1. eLife Assessment

      This study reports important advances in our understanding of how enteropathogenic E. coli (EPEC) interacts at the intestinal interface. Compelling data describe a novel model of spatially coordinated calcium signaling to modulate NF-kB activation. These findings, which integrate imaging, genetics, and computational modeling, provide a new way to consider host-pathogen interactions in EPEC infections that may lead to improved therapies.

    2. Reviewer #1 (Public review):

      Summary:

      In their article, Guo and coworkers investigate the Ca²⁺ signaling responses induced by Enteropathogenic Escherichia coli (EPEC) in epithelial cells and how these responses regulate NF-κB activation. The authors show that EPEC induces rapid, spatially coordinated Ca²⁺ transients mediated by extracellular ATP released through the type III secretion system (T3SS). Using high-speed Ca²⁺ imaging and stochastic modeling, they propose that low ATP levels trigger "Coordinated Ca²⁺ Responses from IP₃R Clusters" (CCRICs) via fast Ca²⁺ diffusion and Ca²⁺-induced Ca²⁺ release. These responses may dampen TNF-α-induced NF-κB activation through Ca²⁺-dependent modulation of O-GlcNAcylation of p65. The interdisciplinary work suggests a new perspective on calcium-mediated immune response by combining quantitative imaging, bacterial genetics, and computational modeling.

      Strengths:

      The study provides a new concept for host responses to bacterial infections and introduces the concept of Coordinated Ca²⁺ Responses from IP₃R Clusters (CCRICs) as synchronized, whole-cell-scale Ca²⁺ transients with the fast kinetics typical of local events. This is elegantly done by an interdisciplinary approach using quantitative measurements and mechanistic modelling.

      Comments on revised version.

      The revised version of the manuscript has addressed all my raised points. I'd like to thank the authors for the work they have put into the revision to make this a very compelling publication.

    3. Reviewer #2 (Public review):

      Summary:

      The authors of this study are trying to resolve how cellular infection by enteropathogenic E. coli (EPEC) subverts cellular signaling pathways to promote infection and dampen immune responses. Specifically, alteration in calcium dynamics has been evidenced in the prior literature as a potential initiator of these adaptions, and this study provides ideas and mechanistic detail as to how cellular calcium dynamics may be subverted by pathogens.

      Strengths:

      The clear strengths of this paper relate to the new ideas inherent in the proposed hypothesis and their support from the experimental approaches used. Overall, the proposed work provides new ideas in this area, which will benefit from further investigation. Certainly, this is an interesting and challenging paradigm to pick apart mechanistically, and is important for improving treatments from intestinal infections. The authors have provided additional data to clarify and expand on concerns raised during the original review, and these additions are helpful.

      Comments on revised version.

      Thorough response to original review. No further comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their article, Guo and coworkers investigate the Ca²⁺ signaling responses induced by Enteropathogenic Escherichia coli (EPEC) in epithelial cells and how these responses regulate NF-κB activation. The authors show that EPEC induces rapid, spatially coordinated Ca²⁺ transients mediated by extracellular ATP released through the type III secretion system (T3SS). Using high-speed Ca²⁺ imaging and stochastic modeling, they propose that low ATP levels trigger "Coordinated Ca²⁺ Responses from IP₃R Clusters" (CCRICs) via fast Ca²⁺ diffusion and Ca²⁺-induced Ca²⁺ release. These responses may dampen TNF-α-induced NF-κB activation through Ca²⁺-dependent modulation of O-GlcNAcylation of p65. The interdisciplinary work suggests a new perspective on calcium-mediated immune response by combining quantitative imaging, bacterial genetics, and computational modeling.

      Strengths:

      The study provides a new concept for host responses to bacterial infections and introduces the concept of Coordinated Ca²⁺ Responses from IP₃R Clusters (CCRICs) as synchronized, whole-cell-scale Ca²⁺ transients with the fast kinetics typical of local events. This is elegantly done by an interdisciplinary approach using quantitative measurements and mechanistic modelling.

      Weaknesses:

      (1) The effect of coordination by fast diffusion for small eATP concentrations is explained by the resulting low Ca2+ concentration that is not as strongly affected by calcium buffers compared to higher concentrations. While I agree with this statement on the relative level, CICR is based on the resulting absolute concentration at neighboring IP3Rs (to activate them). Thus, I do not fully agree with the explanation, or at least would expect to use the modelling approach to demonstrate this effect. Simulations for different activation and buffer concentrations could strengthen this point and exclude potential inhibition of channels at higher stimulation levels.

      We fully agree that CICR is determined by the local Ca<sup>2+</sup> concentration at each IP<sub>3</sub>R cluster, not by a global cytosolic average. In our stochastic model, IP<sub>3</sub> R clusters are represented as phenomenological entities at discrete spatial sites. Each cluster senses the local Ca<sup>2+</sup> concentration at its position, and its stochastic gating depends only on this local [Ca<sup>2+</sup>] and on [IP3]. Buffers are not included explicitly. Instead, we use an effective Ca2+ diffusion coefficient Deff, which accounts for the effect of endogenous Ca<sup>2+</sup> buffers. To reproduce the coordinated low-amplitude Ca<sup>2+</sup> responses observed experimentally, we found that we had to use Deff = 100 µm<sup>2</sup>/s. In the supplementary analysis, we show that an effective diffusion coefficient of this order is indeed plausible for a realistic mixture of mobile and immobile Ca<sup>2+</sup> buffers (Supplementary Note 2. Figure 1).

      In the revised manuscript, we now provide a supplementary analysis (Supplementary Note 2) to justify this choice. Using an equation to compute the effective diffusion coefficient considering a plausible mixture of mobile and immobile buffers and an explicit reaction–diffusion model, we show that:

      - The effective diffusion coefficient of Ca<sup>2+</sup> becomes Ca<sup>2+</sup> dependent, and

      - There exists a regime in which low-amplitude Ca<sup>2+</sup> elevations are characterized by an effective diffusion coefficient of Deff = 100 µm<sup>2</sup>/s and a larger spatial extent than higher-amplitude transients (Supplementary Note 2. Figure 1).

      Thus, the value of Deff used in the cluster model is quantitatively consistent with classical buffering theory and with plausible cytosolic buffer mixtures. This provides a mechanistic basis for the observation that small-amplitude, short-lived events can nevertheless produce coordinated signals with large spatial extent and, occasionally, almost immediate activation of IP<sub>3</sub>R clusters at distant locations in both simulations and experiments.

      In this respect, I would also include the details of the modelling, such as implementation environment, parameters, and benchmarking. The description in the Supplementary Methods is very similar to the description in the main text. In terms of reproducibility, it would be important to at least provide simulation parameters, and providing the code would align with the emerging standards for reproducible science.

      We apologize for the lack of details of the modelling in the previous submission. In this revised version, we are providing with a full description of the model in the Supplementary Information, Note1.

      To address the reviewer’s request for simulations at different activation levels, we now show an additional simulation in which [IP<sub>3</sub>] is higher (0.1 µM, constant in time and space) and Deff is set to 40 µm<sup>2</sup>/s (Supplementary Note 3). This lower effective diffusion coefficient is consistent with the stronger buffering and reduced Ca<sup>2+</sup> mobility expected for higher-amplitude signals. In this case, the same phenomenological cluster model generates a global Ca<sup>2+</sup> response with larger amplitude and longer duration, rather than a loss of activity due to excessive inhibition ((Supplementary Note 3, Figure 1, left panel). The Supplementary Note 3. Figure 1, right panel shows the 2D cell geometry, where dots indicate the random positions of IP<sub>3</sub>R clusters whose behavior is described by our phenomenological cluster model.

      (2) Quantitative characterization of CCRICs:

      The paper would benefit from a clearer definition of the term CCRICs and quantitative descriptors like duration, amplitude distribution, frequency, and spatial extent (also in relation to the comment on the EGTA measurements below). Furthermore, it remains unclear to me whether CCRICs represent a population of rapidly propagating micro-waves or truly simultaneous events. Maybe kymographs or wave-front propagation analyses (at least from simulations if experimental resolution is too bad) would strengthen this point.

      We agree and completed the description of the CCRICs by adding:

      In the Results section, p. 8, l. 27:

      “…with a duration of 2.1 ± 1.0 sec (mean ± SEM) (N = 4, 128 responses)”. p. 9, l. 13:

      “In rare instances (less than 3%), typical local “Puff” responses elicited by these ATP concentrations could also be detected often occurring at the cell periphery (Figs. 4B, red region and 4C, red arrow; Fig. S6D, blue trace) (N > 20, cells > 500). As expected from the small concentrations of Ca<sup>2+</sup> released at puff sites, no increase in cytosolic Ca<sup>2+</sup> was detected in a distal cell region (Fig. S6D, top), indicating that isotropic Ca<sup>2+</sup> diffusion from a puff release site cannot account for Ca<sup>2+</sup> increase over large cell area. Puffs could also be detected concomitantly with CCRICs in different ROIs of the same cell (Fig. S6D, bottom). In contrast to puffs, CCRICs often showed responses of comparable amplitude in distal regions over the whole cell (Figs. 4C and S6A, B), suggesting the contribution from IP<sub>3</sub>R cluster activation by Ca<sup>2+</sup>-Induced Ca<sup>2+</sup> Release (CICR). Within a given cell, the vast majority of CCRICs appeared quasi-synchronized at the fatest acquisition rate of 22 ms / frame that we could achieve. However, in few instances a delay could be detected in the elicitation of a peak in distant region of a cell (Fig. S6C). These observations suggest that the quasi-synchronization of CCRICs result from the fast diffusion of Ca<sup>2+</sup> leading to the activation of IP<sub>3</sub>R clusters over large cell area, which may be delayed in a some instances. Scrutinizing of CCRICs showed that while their profiles were comparable, the amplitude of these responses varied in different regions of the cell, with often a single 1 µm<sup>2</sup> region, likely corresponding the initial firing cluster, showing a prominent amplitude and other regions with smaller amplitude for a given response (Figs. 4B and 4C). For example, in Fig. 4C, the highest amplitude is observed in the red region for peaks 1 and 3, whereas it is observed and in the purple region for peak 2. Thus, for a given CCRIC, the respective contribution of local IP<sub>3</sub>R cluster activation and isotropic diffusion of Ca<sup>2+</sup>from other release sites in Ca<sup>2+</sup> increase may vary in different regions of the cell”.

      In the Discussion section, 2nd sentence p. 12:

      “CCRICs showed rapid kinetics with an average duration of ca 2.1 seconds and amplitude corresponding to an increase in Ca<sup>2+</sup> cytosolic concentration of a few hundreds nM, seemingly smaller than that of puffs (Fig. S6D), often occurring repeatedly with a frequency of up to 12 CCRICs / min over the whole cell.”

      We have tried to clarify the notion of coordination versus synchronization of CCRICs by showing the delay observed in some instances in the elicitation of CCRICs at distal regions of the cell, now illustrated shown in Fig S6C.

      (3) Specificity of pharmacological tools:

      Suramin and U73122 are known to have off-target effects. Control experiments using alternative P2 receptor antagonists like PPADS or inactive U73343 analogs would strengthen the causal link.

      As suggested by the referee, we have performed complementary experiments showing the inhibitory effects of PPADS and absence of effects of U73343 on EPEC-induced Ca2+ responses including CCRICs now shown in the amended Fig. S2.

      Reviewer #2 (Public review):

      Summary:

      The authors of this study are trying to resolve how cellular infection by enteropathogenic E. coli (EPEC) subverts cellular signaling pathways to promote infection and dampen immune responses. Specifically, alteration in calcium dynamics has been evidenced in the prior literature as a potential initiator of these adaptations, and this study provides ideas and mechanistic detail as to how cellular calcium dynamics may be subverted by pathogens.

      Strengths:

      The clear strengths of this paper relate to the new ideas inherent in the proposed hypothesis and their support from the experimental approaches used. Overall, the proposed work provides new ideas in this area, which will benefit from further investigation. Certainly, this is an interesting and challenging paradigm to pick apart mechanistically, and is important for improving treatments from intestinal infections.

      Weaknesses:

      Additional insight is needed in three specific areas to convincingly support the conclusions drawn by the authors. These three areas are: first, a better description of the infection-associated calcium signals. Second, a mechanistic definition of the relevant purinoceptors versus other pathways to increase cellular calcium. Third, an effort to show that the proposed pathways have relevance in a polarized epithelial cell.

      (1) first, a better description of the infection-associated calcium signals.

      We agree and have added a more detailed description of the CCRICs in the results and discussion section, as detailed in response to referee 1, Weakness 2 by adding:

      In the Results section, p. 8, l. 27:

      “…with a duration of 2.1 ± 1.0 sec (mean ± SEM) (N = 4, 128 responses)”. p. 9, l. 13:

      “In rare instances (less than 3%), typical local “Puff” responses elicited by these ATP concentrations could also be detected often occurring at the cell periphery (Figs. 4B, red region and 4C, red arrow; Fig. S6D, blue trace) (N > 20, cells > 500). As expected from the small concentrations of Ca<sup>2+</sup> released at puff sites, no increase in cytosolic Ca<sup>2+</sup> was detected in a distal cell region (Fig. S6D, top), indicating that isotropic Ca<sup>2+</sup> diffusion from a puff release site cannot account for Ca<sup>2+</sup> increase over large cell area. Puffs could also be detected concomitantly with CCRICs in different ROIs of the same cell (Fig. S6D, bottom). In contrast to puffs, CCRICs often showed responses of comparable amplitude in distal regions over the whole cell (Figs. 4C and S6A, B), suggesting the contribution from IP<sub>3</sub>R cluster activation by Ca<sup>2+</sup>-Induced Ca<sup>2+</sup> Release (CICR). Within a given cell, the vast majority of CCRICs appeared quasi-synchronized at the fatest acquisition rate of 22 ms / frame that we could achieve. However, in few instances a delay could be detected in the elicitation of a peak in distant region of a cell (Fig. S6C). These observations suggest that the quasi-synchronization of CCRICs result from the fast diffusion of Ca<sup>2+</sup> leading to the activation of IP<sub>3</sub>R clusters over large cell area, which may be delayed in a some instances. Scrutinizing of CCRICs showed that while their profiles were comparable, the amplitude of these responses varied in different regions of the cell, with often a single 1 µm<sup>2</sup> region, likely corresponding the initial firing cluster, showing a prominent amplitude and other regions with smaller amplitude for a given response (Figs. 4B and 4C). For example, in Fig. 4C, the highest amplitude is observed in the red region for peaks 1 and 3, whereas it is observed and in the purple region for peak 2. Thus, for a given CCRIC, the respective contribution of local IP<sub>3</sub>R cluster activation and isotropic diffusion of Ca<sup>2+</sup> from other release sites in Ca<sup>2+</sup> increase may vary in different regions of the cell” In the Discussion section, 2nd sentence p. 12:

      “CCRICs showed rapid kinetics with an average duration of ca 2.1 seconds and amplitude corresponding to an increase in Ca<sup>2+</sup> cytosolic concentration of a few hundreds nM, seemingly smaller than that of puffs (Fig. S6D), often occurring repeatedly with a frequency of up to 12 CCRICs / min over the whole cell.”

      We have tried to clarify the notion of coordination versus synchronization of CCRICs by showing the delay observed in some instances in the elicitation of CCRICs at distal regions of the cell, now illustrated shown in Fig S6C.

      CRICCs are observed over the whole cell or very large cell area. We agree that this point as well as comparison with previously described puffs needed clarification. We have added the following sentences in the discussion and inserted the seminal Thomas et al. 1999 citation in the references, p. 13, l. 18:

      “Consistently, while CRICCs were detected in the vast majority of cells at these very low agonist concentrations, in rare instances, local “puff-like” responses were also detected at the cell periphery. These observations are in contrast to previously described Ca<sup>2+</sup> puffs preceding global responses reported to occur preferentially in perinuclear area (Thomas et aL., 1999). These earlier studies, however, involved higher agonist concentrations (1-5 µM ATP) expected to lead to the release of higher IP<sub>3</sub> concentrations, which may preferentially stimulate larger IP<sub>3</sub>R clusters at the perinuclear region because of the higher density of IP<sub>3</sub> Rs. In addition, larger IP<sub>3</sub> clusters may release higher amounts of Ca<sup>2+</sup> for which, as opposed to CCRICs, diffusion would be restrained by Ca<sup>2+</sup> buffers thereby favoring the spatial confinement of the response. “

      (2) Second, a mechanistic definition of the relevant purinoceptors versus other pathways to increase cellular calcium

      We do not believe that CCRICs are specific to EPEC, since they are also elicited by low agonist concentrations. The discrete action of Type III translocons leading to the release of small amounts of extracellular ATP at the onset of EPEC prompted us to perform fast Ca<sup>2+</sup> imaging at low agonists concentrations (150 nM ATP, 100 nM histamine now shown in Fig. S4), which to our knowledge, differ from higher agonist concentrations used in all previous studies describing puffs. Our modelling studies support the notion that CCRICs correspond to generic Ca<sup>2+</sup> release-dependent responses triggered by low levels of IP3.

      We now show inhibition of CCRICs by PPADS, another purinergic receptor antagonist, and extracellular ATP depletion by addition of hexokinase in the extracellular medium in Figs. S4 and S7.

      Knocking down ATP receptors represents a challenging task since HeLa cells were shown to express transcripts for most of the described 8 P2Xs and 7 P2Ys purinergic receptors (10.1016/j.bbamem.2009.03.006). Mostly, we do not believe that CCRICs are triggered by a specific ATP receptor and do not expect to see inhibition of CCRICs in single knock-down experiments. Our experimental and modelling studies suggest that CCRICs are not specific to EPEC nor to a particular ATP receptor, but instead correspond instead to generic Ca<sup>2+</sup> elicited at low agonist concentrations such as ATP or histamine.

      Zhong et al., 2020 indeed previously showed a role for Ca<sup>2+</sup> influx mediated by the TRPV2 receptor in EPEC-mediated cell death. However, this influx occurred following 8 hours of cell infection with EPEC. We do not detect significant cell death or Ca<sup>2+</sup> influx at the onset of infection corresponding to the 12 hours infection kinetics that we used. Our experiments indicate that CCRICs do not involve Ca<sup>2+</sup> influx.

      (3) Third, an effort to show that the proposed pathways have relevance in a polarized epithelial cell.

      We agree and have performed complementary experiments showing induction of CCRICs by EPEC and eATP in polarized intestinal epithelial cells, now shown in Figure S8.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Statistical treatment and data presentation:

      Some figure legends lack clarity on replicates (n = cells vs N = independent experiments). Timecourse quantifications of p-IκB and p-p65 should include normalized fold-change plots with clear statistical tests.

      To clarify, we replaced “n” by “cells”. The number of determinations and independent experiments (N) has been added in the legends to all relevant Figures and Supplementary Figures.

      As requested, we now show the p-IκB and p-p65 plots as plots normalized to basal p-IκB and p-p65 levels. We mentioned in legend to Fig. 6 that we used an ANCOVA test showing significance of the effects of eATP on TNF-∝-induced IκB- and p65 phosphorylation.

      (2) Clarification on the temperature used in imaging (why measured at 35{degree sign} C)?

      We have added the following clarification in the Materials and Methods section p. 14, l. 21:

      “Imaging was then carried out at 35°C to allow for bacterial type III secretion, …”

      (3) Figure 4A:

      The image shows a lower image acquisition interval than every 2s that is stated in the caption.

      We apologize for the mistake. The legend to Fig. 4A now reads:

      “Image acquisition every 52 ms (A)…”

      (4) Figure 4B:

      The color of ROIs could be more intense for better identification.

      We have replaced the colors of blue and green ROIs, by light cyan and purple ROIs

      (5) Figure 4c:

      I don't understand the meaning of the dashed lines described by "The dashed red and green lines point at the aggregation of responses throughout the cell" in the caption or in the text.

      We apologize for the lack of clarity and have re-written the corresponding text p. 9, l.25 as follows:

      “Scrutinization of CCRICs showed that while their profiles were comparable, the amplitude of these responses varied in different regions of the cell, with often a ca 3 µm<sup>2</sup> single region, likely corresponding to a source point release, showing a prominent amplitude and other regions with smaller amplitude for a given response (Figs. 4B and 4C). For example, in Fig. 4C, the highest amplitude is observed in the red region for peaks 1 and 3, whereas it is observed and in the purple region for peak 2. Thus, for a given CCRIC, the respective contribution of local IP3R cluster activation and isotropic diffusion of Ca<sup>2+</sup> from other release sites in Ca<sup>2+</sup> increase may vary in different regions of the cell.”

      (6) Figure S4A:

      The responses for EGTA are not really pointed out. Are the traces meant to show events?

      We have added arrowheads in traces corresponding to ATP + EGTA-AM treatment pointing at “flattened Ca<sup>2+</sup> responses”. The Legend to Fig. S4A now includes the sentence: “ATP + EGTA-AM treatment led to an inhibition of Ca<sup>2+</sup> responses, associated with small variations in the Ca<sup>2+</sup> baseline, that were arbitrarily scored as flattened Ca<sup>2+</sup> pseudo-responses (ATP+EGTA-AM, red arrows).”

      (7) Figure S5:

      Could not identify the purple arrow for the less mobile cluster.

      We agree that the former Figure lacked clarity and have remade Figure S5, now Figure S6, with higher magnification of panels with fast acquisition. The previously purple arrows pointing at larger and less mobile clusters are now shown in black in these enlarged panels. The legend has been changed accordingly.

      (8) There are some typos and suboptimal formulations throughout the manuscript, such as:

      P8: "minute amount" could be changed to low, minor or similar.

      “minute” amounts of eATP was replaced by “low amounts of eATP”.

      P8: put a "%" to the numbers 61.2 {plus minus} 5.8.

      “%” was added.

      P16: "manuscript".

      Thank you.

      Reviewer #2 (Recommendations for the authors):

      Suggestions relate to the following three topics.

      First, a better description of the infection-associated calcium signals. The authors emphasize throughout the paper that their imaging data challenge established concepts in the calcium signaling field (discussion). I do not see the calcium imaging data explained either with data or textually with sufficient clarity to evaluate this assertion. A start would be a clear description of the characteristics of the EPEC-evoked calcium signals relative to other local and global domains of calcium signaling previously described in HeLa cells. Prior work has shown that PI-coupled agonists evoke local calcium signals that are perinuclear in HeLa cells (PMID: 10660296), but the relationship of EPEC-evoked transients to these previously defined responses is not clear.

      We agree and have added a more detailed description of the CCRICs in the results and discussion section, as detailed in response to referee 1, Weakness 2.

      Most importantly, it is ambiguous where in the HeLa cell recordings are made. Are these recordings close to the plasma membrane and/or deeper within the cell? The only spatial information is provided in Figure 3A, and these responses are not well described in the text or presented in a way that comparisons can be made to responses from a PI-coupled agonist.

      CRICCs are observed over the whole cell or very large cell area. We agree that this point as well as comparison with previously described puffs needed clarification. We have added the following sentences in the discussion and inserted the seminal Thomas et al. 1999 citation in the references, p. 13, l. 18:

      “Consistently, while CRICCs were detected in the vast majority of cells at these very low agonist concentrations, in rare instances, local “puff-like” responses were also detected at the cell periphery. These observations are in contrast to previously described Ca<sup>2+</sup> puffs preceding global responses reported to occur preferentially in perinuclear area (Thomas et aL., 1999). These earlier studies, however, involved higher agonist concentrations (1-5 µM ATP) expected to lead to the release of higher IP<sub>3</sub> concentrations, which may preferentially stimulate larger IP<sub>3</sub>R clusters at the perinuclear region because of the higher density of IP<sub>3</sub>Rs. In addition, larger IP<sub>3</sub> clusters may release higher amounts of Ca<sup>2+</sup> for which, as opposed to CCRICs, diffusion would be restrained by Ca<sup>2+</sup> buffers thereby favoring the spatial confinement of the response. “

      If I understand the described responses correctly, could not these rapid local responses result from a change in cellular calcium buffering capacity consequent to infection? Are the authors proposing that these responses occur in other cells also, or represent a pathogen-specific signaling mode?

      We do not believe that CCRICs are specific to EPEC, since they are also elicited by low agonist concentrations. The discrete action of Type III translocons leading to the release of small amounts of extracellular ATP at the onset of EPEC prompted us to perform fast Ca<sup>2+</sup> imaging at low agonists concentrations (150 nM ATP, 100 nM histamine now shown in Fig. S4), which to our knowledge, differ from higher agonist concentrations used in all previous studies describing puffs. Our modelling studies support the notion that CCRICs correspond to generic Ca<sup>2+</sup> release-dependent responses triggered by low levels of IP3.

      Second, evidence supporting a mechanistic role of ATP comes from prior literature, together with the authors' presented data showing the effects of PLC (to inhibit IP3), pharmacological inhibition (suramin, a non-selective purinoceptor blocker), and the effects of T3SS-deficient mutants (to prevent ATP release). However, there are missing steps here to mechanistically identify how ATP is working. First, does degradation of extracellular ATP (e.g., apyrase) block these responses? Second, given HeLa cells are easily amenable to knockdown approaches, does knockdown of particular ATP receptors, or TRPV2 as suggested in the prior literature, impact the calcium signal dynamics?

      We now show inhibition of CCRICs by PPADS, another purinergic receptor antagonist, and extracellular ATP depletion by addition of hexokinase in the extracellular medium in Figs. S4 and S7.

      Knocking down ATP receptors represents a challenging task since HeLa cells were shown to express transcripts for most of the described 8 P2Xs and 7 P2Ys purinergic receptors (10.1016/j.bbamem.2009.03.006). Mostly, we do not believe that CCRICs are triggered by a specific ATP receptor and do not expect to see inhibition of CCRICs in single knock-down experiments. Our experimental and modelling studies suggest that CCRICs are not specific to EPEC nor to a particular ATP receptor, but instead correspond instead to generic Ca<sup>2+</sup> elicited at low agonist concentrations such as ATP or histamine.

      Zhong et al., 2020 indeed previously showed a role for Ca<sup>2+</sup> influx mediated by the TRPV2 receptor in EPEC-mediated cell death. However, this influx occurred following 8 hours of cell infection with EPEC.

      We do not detect significant cell death or Ca<sup>2+</sup> influx at the onset of infection corresponding to the 12 hours infection kinetics that we used. Our experiments indicate that CCRICs do not involve Ca<sup>2+</sup> influx.

      Third, while the use of HeLa cells provides advantages for imaging and mechanistic assays, the effort to replicate findings in an intestinal cell line would heighten relevance, given the likely importance of cell type and cell polarity on the pathogen-evoked responses.

      We agree and have performed complementary experiments showing induction of CCRICs by EPEC and eATP in polarized intestinal epithelial cells, now shown in Figure S8.

    1. eLife Assessment

      This valuable study advances our understanding of best practices for analyzing population-level data using advanced functional alignment methods. It provides convincing evidence that demographic-specific functional templates improve functional neuroimaging studies that use hyperalignment. This study will be of interest to cognitive neuroscientists, neuroimaging methodologists, and computational researchers with an interest in the human brain.

    2. Reviewer #1 (Public review):

      The authors present a compelling case for the necessity of age-specific templates in functional hyperalignment. Given that the brain undergoes substantial developmental, structural, and functional changes across the lifespan, a 'one-size-fits-all' canonical template is often insufficient. This study effectively demonstrates that incorporating age-congruent features significantly enhances the performance and sensitivity of hyperalignment models. By validating these findings across two independent datasets (Cam-CAN and DLBS), the paper provides robust evidence that accounting for age-related functional organization is a critical prerequisite for accurate functional alignment in lifespan research

      Comments on revised version:

      The authors have been exceptionally thorough in addressing the concerns raised by the reviewers. In particular, the inclusion of the supplemental analysis on the middle-aged cohort is a valuable addition that strengthens the manuscript. Furthermore, the rationale for employing a congruent template is well-articulated; this approach clearly provides a more robust and accurate foundation for reconstructing individualized connectomes. I appreciate the authors' detailed responses and have no further comments.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, Zhang and colleagues examine the role of participant selection in creating and using functional templates to improve analyses using hyperalignment. Hyperalignment aligns participants' functional MRI data to a shared functional template, analogous to the anatomical templates used to bring anatomical MRI data into a shared space (e.g., MNI152). The question of appropriate template creation is especially pressing for population-level analyses, where a large number of demographic groups (e.g., different age ranges, clinical statuses) may be included in the same analysis. These different demographic groups may have differences in their functional organization that complicate the creation of a single study-specific functional template.

      To provide an initial investigation of the potential effect of demographic-specific templates, the authors use the publicly available Cam-CAN dataset which contains participants from 18 to 87 years of age. They define a young adult (< 45 years of age) and an older adult group (> 65 years of age) from this dataset with approximately the same number of participants. They investigate whether "age-congruent" templates (i.e. defined in the same age group they are used) improve three analyses where hyperalignment has been previously shown to boost performance: inter-subject correlation, predicting individual connectomes, and predicting individual functional responses. Using the Cam-CAN derived older adult template, they then replicate the ISC analyses using the publicly available Dallas Lifespan Brain Study (DLBS).

      Overall, the presented results are highly suggestive that age-congruent templates consistently improve performance, though the absolute effects are small.

      Strengths:

      The use of a separate validation sample-re-using the same template calculated with Cam-CAN-highlights the potential of developing independent templates for individual demographic groups and then distributing these for wider use, analogous to the MNI templates that are widely used throughout the field of neuroimaging. This suggests that the potential impact of this framework is significant.

      Weaknesses:

      In their revision, the authors have addressed the previously raised "weaknesses" by providing guidance for researchers interested in using age-specific hyperalignment templates in practice.

      Impact:

      Overall, this work is likely to encourage future development of age-specific functional templates in the imaging community.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors present a compelling case for the necessity of age-specific templates in functional hyperalignment. Given that the brain undergoes substantial developmental, structural, and functional changes across the lifespan, a 'one-size-fits-all' canonical template is often insufficient. This study effectively demonstrates that incorporating age-congruent features significantly enhances the performance and sensitivity of hyperalignment models. By validating these findings across two independent datasets (Cam-CAN and DLBS), the paper provides robust evidence that accounting for age-related functional organization is a critical prerequisite for accurate functional alignment in lifespan research.

      Strengths:

      (1) The authors used three metrics to evaluate performance. Across all metrics, they found that age-congruent templates outperformed age-incongruent templates, suggesting that age-specific templates can improve alignment.

      (2) These findings highlight the superiority of age-congruent templates for hyperalignment. This work underscores the importance of age-matching in cross-subject functional mapping and represents a vital step forward for the methodology.

      We thank the reviewer for the summary and the positive evaluation of our manuscript.

      Weaknesses:

      (1) Participant Demographics and Group Separation:

      The study defines the 'older' cohort as 65-90 years and the 'younger' cohort as 18-45 years. While this 20-year gap (ages 46-64) effectively maximizes the contrast between groups, the results in Figure 4a suggest that the predicted individualized connectomes follow a continuous distribution. Given this continuity, could the authors provide the average median trends for Figures 2a and 2b to illustrate how the model behaves across the missing age range?

      Thanks for raising this important point. We had calculated the results for the middle-aged cohort template and have included them in the Supplementary Figures 4 & 5. Similar to Figure 2a, 2b, 3a and 3b, we directly compare the intersubject correlation and prediction performance of the middle-aged participants when aligned to their congruent middle-aged template versus an incongruent template. We observed consistent results across validation analyses (ISC and prediction) and groups (young vs. middle-aged, middle-age vs. old). Consistent with our main findings, the middle-aged cohort exhibits significantly higher intersubject correlation and prediction performance when using the age-congruent middle-age template. These results confirm that the age-related shifts in functional brain organization captured by the hyperalignment templates follow a continuous trajectory across the lifespan.

      (2) Request for Implementation:

      I have been unable to locate the source code associated with this publication. Could the authors please provide a link to the repository or clarify if the implementation is available for reproduction?

      We have made our scripts public in GitHub and here’s the link: https://github.com/yuqi98/Aging_templates_scripts

      (3) Analysis of Prediction Performance and Distribution:

      While Figures 3b and 5b clearly demonstrate that the congruent template improves correlation, Figure 4a shows a distinct shift in the scatter distribution. Could the authors provide a detailed explanation of the prediction performance metrics used? Specifically, I would like to understand how the underlying method accounts for the distribution differences observed when applying the congruent template.

      Our prediction performance metric is the average Pearson correlation. We calculated the correlation between the model-predicted data (the individualized connectome in Figure 3 and the movie response in Figure 5) and the participant's actual measured data for each cortical vertex and averaged the correlations across vertices. A higher correlation indicates that the group template, when combined with the participant’s individualized transformation matrix, more accurately reconstructs the individualized functional connectome and responses to stimuli.

      The distinct upward shift in prediction performance when using a congruent template occurs because brain functional organization shows age-specific features. A congruent template captures these age-specific connectivity and response features. Importantly, the template creation algorithm aims to reflect the central tendency of the training data, including representational/connectivity geometry and functional topographies. Therefore, the observed differences in templates reflect differences in functional organization across age groups. As a result, when projecting the common template back into an individual’s native cortical space using the transformation matrix derived from independent data, the congruent template provides a richer, more accurate basis for reconstructing the individualized connectome and movie-watching responses.

      Reviewer #2 (Public review):

      Summary:

      In this study, Zhang and colleagues examine the role of participant selection in creating and using functional templates to improve analyses using hyperalignment. Hyperalignment aligns participants' functional MRI data to a shared functional template, analogous to the anatomical templates used to bring anatomical MRI data into a shared space (e.g., MNI152). The question of appropriate template creation is especially pressing for population-level analyses, where a large number of demographic groups (e.g., different age ranges, clinical statuses) may be included in the same analysis. These different demographic groups may have differences in their functional organization that complicate the creation of a single study-specific functional template.

      To provide an initial investigation of the potential effect of demographic-specific templates, the authors use the publicly available Cam-CAN dataset, which contains participants from 18 to 87 years of age. They define a young adult (< 45 years of age) and an older adult group (> 65 years of age) from this dataset with approximately the same number of participants. They investigate whether "age-congruent" templates (i.e. defined in the same age group they are used) improve three analyses where hyperalignment has been previously shown to boost performance: inter-subject correlation, predicting individual connectomes, and predicting individual functional responses. Using the Cam-CAN-derived older adult template, they then replicate the ISC analyses using the publicly available Dallas Lifespan Brain Study (DLBS).

      Overall, the presented results are highly suggestive that age-congruent templates consistently improve performance, though the absolute effects are small.

      Strengths:

      The use of a separate validation sample, reusing the same template calculated with Cam-CAN, highlights the potential of developing independent templates for individual demographic groups and then distributing these for wider use, analogous to the MNI templates that are widely used throughout the field of neuroimaging. This suggests that the potential impact of this framework is significant.

      We thank the reviewer for the summary and the positive evaluation of our manuscript.

      Weaknesses:

      While the authors appropriately highlight the potential applications of this result (e.g., to different clinical statuses), it is not apparent how to appropriately extend this methodology to many common experimental paradigms. For example, in case-control studies (where researchers are interested in comparing clinical and non-clinical participants) the use of two different functional templates may complicate rather than ease analyses. Providing this as a potential limitation of the current template construction method, or providing recommendations to researchers interested in comparing across groups, would help to increase the impact of this work.

      We appreciate the reviewer raising this important practical consideration. We have added additional explanation to the Discussion section to provide clear recommendations for researchers applying this methodology, which we summarize below:

      When the goal of a case-control study is to directly compare functional organization or brain responses between clinical and non-clinical participants, it is essential that all individuals are hyperaligned to the same common template. For these analyses, researchers should either construct a joint template containing a balanced, representative sample from both groups, or align all participants to a normative control template. This ensures that the resulting data share a single coordinate system, allowing for valid statistical comparisons between groups.

      However, disease-specific or age-specific templates are highly advantageous when the research objective is to maximize decoding accuracy or predictive performance within a specific population. In real world clinical or lifespan research, if the goal is to build a reliable diagnostic biomarker for disease progression or map individualized connectomes for a specific patient's cohort, researchers should use a template congruent with that specific group. The congruent template will preserve the group-specific representational geometry, providing a better individual-level prediction than a general cortical template.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      In general, there appears to be significantly more spread in the values for older adults (e.g., Figure 4b). It would be useful to know whether subdividing this group improves its relative performance; however, this will likely require additional investigation into the number of participants needed to establish a minimal template.

      We thank the reviewer for this constructive comment. We agree that older adults exhibit greater inter-individual variability in functional organization, which likely drives the larger spread observed in Figure 4b. We also appreciate the suggestion to subdivide this group to see if narrower age bins improve relative performance.

      We have constructed templates using narrower, 10-year age intervals and evaluated their performance. Because model performance increases with the amount of training data, we use a fixed number of training participants for each age group (two thirds of the people from the group with the minimal number of people) to build the templates to make a fair comparison. We have added the results in the Supplementary Figure 6. The results show a continuous gradient of age-related divergence. When predicting data for the 80–90 cohort, the 20–30 template performs the worst and the performance steadily improves as the template age gets closer to the target demographic. This systematic gradient further supports our main finding: the penalty for using an incongruent template increases with the discrepancy between the template age and participant age.

      Interestingly, we noticed that at the extreme ends of the age range (20–30 and 80–90), the strictly congruent template was slightly outperformed by the immediately adjacent age bin (i.e., the 30–40 template for young participants, and the 70–80 template for the oldest participants). Because we strictly matched the number of training subjects across all bins, this slight dip is likely driven by differences in raw data quality. It is common for fMRI data from the extreme ends of the lifespan to have slightly lower signal-to-noise ratios or higher head motion compared to the intermediate 30–40 or 70–80 cohorts. This suggests that while age congruency is a key driver of hyperalignment success, the intrinsic data quality of the cohort used to build the template also plays a practical role in its overall performance.

      This brings up the reviewer’s second point regarding the number of participants needed to establish a minimal template. Subdividing the age groups reduces the sample size available to construct each template. Previous research has demonstrated that while a hyperalignment template derived from a relatively small number of participants can achieve acceptable performance, increasing the amount of data and the number of subjects in the template space consistently and robustly improves alignment quality (See Supplementary Figure 7 in Feilong et al., 2023). Ultimately, our long-term goal is to build highly robust, standardized templates for fine-grained age cohorts across the entire lifespan. We are preparing to collect large-scale datasets from age 20 to 100 to build age-specific templates and provide them as open resources. This will allow future researchers to directly align their data to an age-appropriate template without needing to construct one from their own limited samples.

      Reference

      Feilong, M., Nastase, S. A., Jiahui, G., Halchenko, Y. O., Gobbini, M. I., & Haxby, J. V. (2023). The individualized neural tuning model: Precise and generalizable cartography of functional architecture in individual brains. Imaging Neuroscience, 1, 1–34. https://doi.org/10.1162/imag_a_00032

    1. eLife Assessment

      This important study substantially advances the imaging toolbox available to neuroscientists by presenting a tunable Bessel (tBessel-TPFM) platform that enables high-speed volumetric two-photon imaging. The evidence supporting the novel methodology is convincing, with rigorous benchmarking and demonstrations of a wide range of neuroimaging applications covering vascular dynamics, neurovascular coupling, optogenetic perturbation, and microglial responses. The work will be of broad interest to neuroscientists and imaging system tool developers.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript presents a tunable Bessel-beam two-photon fluorescence microscopy (tBessel-TPFM) platform that enables high-speed volumetric imaging with stable axial focus. The work is technically strong and broadly significant, as it substantially improves the flexibility and practicality of Bessel-beam-based two-photon microscopy. The demonstrations are generally strong and bridge a wide range of neuroimaging applications, namely vascular dynamics, neurovascular coupling, optogenetic perturbation, and microglial responses. These convincingly show that the approach enables biological measurements that are difficult or impractical with existing methods.

      The evidence supporting the technical and biological claims is generally strong. The optical design is carefully motivated, clearly described, and validated through a combination of simulations and experimental characterization. The biological applications are diverse and well chosen to highlight the strengths of the proposed method, and the data are of high quality, with appropriate controls and comparative measurements where relevant.

      Strengths:

      (1) The optical innovation addresses a well-recognized limitation of existing Bessel-TPFM implementations, namely axial focus drift during tuning, and does so using a relatively simple, light-efficient, and cost-effective design.

      (2) The manuscript provides convincing experimental evidence for this being a versatile platform to map flow dynamics across diverse vessel sizes and orientations in both healthy and pathological states.

      (3) Biological demonstrations are comprehensive and span multiple domains such as hemodynamics, neurovascular coupling, and neuroimmune responses.

      (4) Quantitative analyses of blood flow across vessel sizes and orientations, including kilohertz line scanning, are particularly compelling and clearly beyond the reach of standard Gaussian TPFM.

      (5) Particular advantages are that higher blood slow speeds become measurable up to 23mm/sec (20x more than conventional frame scanning), and that simultaneous (Bessel-)imaging and (Gaussian-)perturbation are possible because of the stable axial focus.

      Weaknesses:

      (1) At present, the paper does not properly position the new Bessel-beam method against previous work, and fails to compare it to alternative fast volumetric imaging methods without Bessel beams.

      (2) The cost-effectiveness of the proposed method is not well described or supported by evidence; it would be useful to include more detail or remove this claim.

      (3) Some biological conclusions, e.g., regarding novel features of microglial dynamics (i.e., the observed two-wave responses and coordinated extension-retraction), are based on relatively limited sample size and would benefit from clearer discussion of variability across animals and fields of view.

      (4) The use of neural network-based denoising for microglial imaging is reasonable but introduces potential concerns about trustworthiness; additional clarification of validation or failure modes would strengthen confidence in these results.

      To conclude, most of the authors' claims are well supported by the data. The central conclusion, namely that tBessel-TPFM provides tunable volumetric imaging enabling experiments not feasible with existing two-photon approaches, is justified. Some biological interpretations would benefit from a more cautious framing, but they do not undermine the main technical and methodological contributions of the study. This is a strong and technically rigorous manuscript that makes a substantial methodological advance with clear relevance to neuroscience and intravital imaging. Minor clarifications and a slightly more measured discussion of certain biological findings are recommended.

    3. Reviewer #2 (Public review):

      Summary:

      The authors describe a tunable Bessel beam two-photon microscope (tBessel-TPFM) designed to overcome a common limitation of Bessel-based volumetric imaging: axial shifts of the effective focus during Bessel beam parameter tuning. Their optical design allows independent control of axial beam length and resolution while keeping the axial center fixed. This is extensively validated through simulations and experiments.

      Strengths:

      A major strength of the work is the breadth of validation combined with the level of technical detail provided. The authors carefully characterize the optical performance of the system and clearly explain the design choices and underlying derivations, which will make it easier for others to understand and implement. The authors demonstrate the utility of the method across several in vivo applications, including neurovascular imaging, blood flow measurements, optogenetic stimulation, and microglial dynamics.

      Weaknesses:

      In the in vivo demonstrations, the authors employ different Bessel beam configurations across experiments, but the beam parameters are not dynamically tuned during live imaging. A video example showing continuous or interactive tuning of the Bessel beam within a single in vivo imaging sequence would further highlight the practical advantages of this platform and strengthen the case for its potential applications. In addition, while excitation powers are reported, the manuscript does not place these values in the broader context of known photodamage thresholds for two-photon microscopy, which would be helpful to the readers. Denoising/image restoration are applied in one of the in vivo examples, but it is unclear why this step was used specifically for this dataset and whether it was necessary to achieve adequate SNR or primarily included as an additional demonstration.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript presents an elegant and cost-effective approach for generating a tunable Bessel beam on a conventional two-photon microscope. The authors assemble a compact optical module comprising three axicons and a series of lenses that permits rapid adjustment of both lateral resolution and axial extent without modifying the focal plane. This flexibility enables the system to be readily adapted to a variety of biological preparations. As a proof of concept, the authors employ the device to record blood flow velocities in cortical microcapillaries, arterioles, and venules, thereby directly visualizing vasodilatation and vasoconstriction dynamics and permitting quantitative analysis of neurovascular coupling across cortical layers in awake mice.

      The authors demonstrate that the tunability of the Bessel beam can be exploited to match the numerical aperture to the vessel type: a high NA configuration, albeit slower scan, is optimal for resolving flow in capillaries, whereas a low NA setting provides faster acquisition suitable for arterioles and venules. By implementing a one-dimensional line scan with the Bessel beam, they achieve an imaging speed that is twentyfold faster than conventional frame-by-frame scanning, which proves sufficient to capture hemodynamic transients before and after an induced ischemic stroke.

      In addition to pure observation, the authors integrate a co-propagating Gaussian line to the system, allowing simultaneous imaging and photostimulation within the same focal plane. This capability addresses a common limitation of other Bessel beam implementations, in which the observation and perturbation planes often become misaligned when the Bessel beam is altered. The manuscript also emphasizes the advantage of Bessel beam excitation for calcium imaging after a perturbation, because it captures neuronal activity in planes both above and below the nominal focal plane, signals that would be missed with a standard Gaussian focus. Finally, the authors apply the technique to investigate the neuroimmune response following targeted microglial ablation; they report that adjacent microglia extend processes toward the injury site while retracting processes in the opposite direction.

      Overall, the work offers a technically straightforward yet powerful extension to existing two-photon platforms, providing high-speed, volumetric imaging and stimulation capabilities that are well-suited to a broad range of neurovascular and neuroimmune studies. The experimental validation is quite thorough, and the presented data convincingly illustrates the benefits of the approach.

      Strengths:

      The authors present a truly clever and inexpensive optical module that can be integrated into almost any two-photon microscope, providing a tunable Bessel beam with a minimal modification of the existing system. The experimental data and accompanying quantitative analysis convincingly demonstrate that the system can reveal physiological events, such as capillary flow, calcium transients across multiple axial planes, and microglial process dynamics, that are difficult or impossible to capture with a conventional Gaussian beam. The breadth of experiments chosen for the manuscript illustrates the practical utility of the device and supports the authors' conclusions that it extends the functional repertoire of standard two-photon microscopy.

      Weaknesses:

      The manuscript would benefit from a more detailed contextualisation of the claimed speed advantage. Although the authors mention other techniques in the introduction, they do not provide any direct comparison with other state-of-the-art high-speed two-photon approaches such as light beads microscopy (Demas et al., Nat. Methods 2021), temporal multiplexing schemes (Weisenburger et al., Cell 2019), or random access microscopy (Villette et al., Cell 2019). A brief comparison of imaging speed, spatial resolution, and instrumental complexity would enable readers to assess the relative merits of the present method.

      A second limitation that warrants discussion is the inherent trade off between volumetric coverage and image specificity. Because the Bessel beam excites fluorescence throughout an extended axial range, the detector inevitably integrates signal from a three dimensional volume into a two dimensional image. In densely labelled tissue, this can lead to significant signal crosstalk, reducing contrast and complicating quantitative interpretation. A brief analysis of how labeling density affects the fidelity of flow or calcium measurements, or suggestions for mitigating crosstalk (e.g., computational deconvolution, adaptive excitation shaping, or combinatorial sparse labeling), would broaden the applicability of the technique.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      This manuscript presents a tunable Bessel-beam two-photon fluorescence microscopy (tBessel-TPFM) platform that enables high-speed volumetric imaging with stable axial focus. The work is technically strong and broadly significant, as it substantially improves the flexibility and practicality of Bessel-beam-based two-photon microscopy. The demonstrations are generally strong and bridge a wide range of neuroimaging applications, namely vascular dynamics, neurovascular coupling, optogenetic perturbation, and microglial responses. These convincingly show that the approach enables biological measurements that are difficult or impractical with existing methods.

      The evidence supporting the technical and biological claims is generally strong. The optical design is carefully motivated, clearly described, and validated through a combination of simulations and experimental characterization. The biological applications are diverse and well chosen to highlight the strengths of the proposed method, and the data are of high quality, with appropriate controls and comparative measurements where relevant.

      Strengths:

      (1) The optical innovation addresses a well-recognized limitation of existing Bessel-TPFM implementations, namely axial focus drift during tuning, and does so using a relatively simple, light-efficient, and cost-effective design.

      (2) The manuscript provides convincing experimental evidence for this being a versatile platform to map flow dynamics across diverse vessel sizes and orientations in both healthy and pathological states.

      (3) Biological demonstrations are comprehensive and span multiple domains such as hemodynamics, neurovascular coupling, and neuroimmune responses.

      (4) Quantitative analyses of blood flow across vessel sizes and orientations, including kilohertz line scanning, are particularly compelling and clearly beyond the reach of standard Gaussian TPFM.

      (5) Particular advantages are that higher blood slow speeds become measurable up to 23mm/sec (20x more than conventional frame scanning), and that simultaneous (Bessel-)imaging and (Gaussian-)perturbation are possible because of the stable axial focus.

      We thank the reviewer for this thoughtful and encouraging evaluation of our work. We are particularly grateful for the recognition of both the technical rigor and the broad applicability of the tBessel-TPFM platform, as well as the assessment that our approach enables biological measurements that are difficult or impractical with existing methods. We appreciate the reviewer’s detailed summary of the strengths of the manuscript, including the identification of axial focus drift as a major limitation in prior Bessel-TPFM implementations, and the value of our center-stable, light-efficient, and accessible solution. We thank the reviewer for the encouraging comment that our biological demonstrations to be compelling and well supported by quantitative analysis.

      Weaknesses:

      (1) At present, the paper does not properly position the new Bessel-beam method against previous work, and fails to compare it to alternative fast volumetric imaging methods without Bessel beams.

      We thank the reviewer for this important point. We agree that a more explicit comparison with existing fast volumetric imaging methods helps clarify the unique advantages of our system. Alternative fast volumetric imaging methods without Bessel beams include remote focusing (Sofroniew et al., 2016), acousto-optic deflectors (AOD) (Villette et al., 2019), piezoelectric objective stages (Göbel and Helmchen, 2007), tunable acoustic gradient lenses (TAG lens) (Huang et al., 2019), electrically tunable lenses (ETLs) (Grewe et al., 2011; Yang et al., 2018), and light beads microscopy (Demas et al., 2021). These methods have each enabled important forms of rapid volumetric imaging, but they differ in their speed, resolution, axial range, and optical complexity. For example, remote focusing can provide rapid axial refocusing while preserving high-resolution imaging but has limited defocus range and requires a carefully aligned relay system and aberration control to maintain image quality. AOD-based approaches enable fast random-access sampling, but introduce optical and calibration complexity associated with dispersion, and suffer light loss with limited diffractive efficiency. Piezoelectric objective scanning is comparatively simple and broadly accessible, but its mechanical inertia limits volume rate and can introduce artifacts during rapid or large axial motion. TAG lenses and ETLs provide compact non-mechanical axial scanning, but pose challenges on aberration control and synchronization. Light-beads microscopy achieves high volumetric throughput by near-simultaneously sampling multiple axial positions, but faces intrinsic compromise among axial coverage, number of sampling planes, and lateral sampling density, which limit lateral resolution when imaging over large depth ranges.        

      Previous Bessel-beam TPFM approaches address some of these limitations by converting volumetric imaging into two-dimensional scanning with an axially extended focus. However, many existing implementations either rely on a fixed Bessel beam profile, which limits the ability to adapt spatial resolution and axial coverage to different biological applications, or use spatial light modulators, which provide tunability but introduce higher cost, increased optical complexity, reduced light efficiency, and sequential rather than simultaneous multi-wavelength operation. Other axicon or lens based tunable Bessel approaches have also been reported, but these designs generally introduce axial displacement of the Bessel focus during tuning.

      In contrast, our tBessel-TPFM design provides full tunability comparable with SLM based methods, maintaining a stable axial beam center, at the same time low cost, easy to implement, intrinsically high light efficiency and support simultaneous multi-color imaging. Therefore, tBessel-TPFM provides a unique solution for applications where axial projection is acceptable and where high-speed volumetric monitoring, tunable axial coverage, motion robustness, optical simplicity, and compatibility with simultaneous perturbation are valuable.

      (2) The cost-effectiveness of the proposed method is not well described or supported by evidence; it would be useful to include more detail or remove this claim.

      We thank the reviewer for requesting clarification and supporting evidence regarding the cost-effectiveness of our method. We now provide a detailed cost breakdown of the tBessel module. Briefly, the module consists of three axicons, three lenses, and one iris that together enable independent control of the NA and ΔNA of the generated Bessel beam. Based on the specified components, the three axicons (AX252B and AX255B, Thorlabs) cost $635 each, the three lenses (AC254-125-B×2 and AC254-150-B, Thorlabs) cost $110 each, and the iris (SM2D25D, Thorlabs) costs $105, resulting in a total system cost of approximately $2,340. For comparison, spatial light modulator (SLM)-based implementations that offer comparable tunability typically require an SLM module costing on the order of $20,000 USD, in addition to more complex optical alignment and reduced optical efficiency.

      (3) Some biological conclusions, e.g., regarding novel features of microglial dynamics (i.e., the observed two-wave responses and coordinated extension-retraction), are based on relatively limited sample size and would benefit from clearer discussion of variability across animals and fields of view.

      We thank the reviewer for this important comment regarding the limited sample size of the microglial dynamics study. We agree that a more comprehensive assessment across animals would be required to establish the generality of these biological findings. In the current study, our intent is not to draw broad biological conclusions, but rather to report observations enabled by the tBessel-TPFM platform. As noted in the manuscript, we have deliberately used descriptive language (e.g., “two distinct waves of process extension were observed” “process dynamics revealed…” and “advancing processes displayed…”) to avoid over claim of the biological findings beyond the data presented.

      (4) The use of neural network-based denoising for microglial imaging is reasonable but introduces potential concerns about trustworthiness; additional clarification of validation or failure modes would strengthen confidence in these results.

      We thank the reviewer for raising this important point regarding the reliability of neural network-based denoising. We agree that additional validation and discussion of potential failure modes are essential to build confidence in these results. To assess the fidelity of the CARE-denoised data, we performed several additional analyses (Author response image 1). First, we compared normalized raw and denoised images averaged over 10 frames. The difference between the two images was spatially uniform and primarily reflected residual noise present in the raw data, rather than structured discrepancies (Author response image 1a). As expected, brighter features like microglial somata exhibited smaller differences due to their intrinsically higher signal-to-noise ratio, whereas weaker processes showed larger noise-related differences. Second, we extended this comparison across the full time-lapse sequence by applying consistent color mapping to both raw and denoised videos and computing frame-by-frame difference maps. These analyses show that the observed differences are consistent with noise suppression, without introducing coherent structural features or altering the apparent microglial dynamics (Author response image 1b).

      Author response image 1.

      Validation of CARE-based denoising for microglial imaging. (a) Comparison of 10-frame averaged normalized raw (left), CARE-denoised (middle), and their pixel-wise difference (right) images. The second row shows a zoomed-in view of the boxed region. (b) Color-coded time-lapse projections over a 10-minutes imaging session for the raw (left) and CARE-denoised (middle) data, along with their pixel-wise difference (right).

      To conclude, most of the authors' claims are well supported by the data. The central conclusion, namely that tBessel-TPFM provides tunable volumetric imaging enabling experiments not feasible with existing two-photon approaches, is justified. Some biological interpretations would benefit from a more cautious framing, but they do not undermine the main technical and methodological contributions of the study. This is a strong and technically rigorous manuscript that makes a substantial methodological advance with clear relevance to neuroscience and intravital imaging. Minor clarifications and a slightly more measured discussion of certain biological findings are recommended.

      We thank the reviewer for this thoughtful and encouraging summary of our work. We greatly appreciate the recognition that tBessel-TPFM provides a meaningful methodological advance and enables volumetric imaging experiments that are difficult or impractical with existing two-photon approaches.

      Reviewer #2 (Public review):

      The authors describe a tunable Bessel beam two-photon microscope (tBessel-TPFM) designed to overcome a common limitation of Bessel-based volumetric imaging: axial shifts of the effective focus during Bessel beam parameter tuning. Their optical design allows independent control of axial beam length and resolution while keeping the axial center fixed. This is extensively validated through simulations and experiments.<br /> Strengths:

      A major strength of the work is the breadth of validation combined with the level of technical detail provided. The authors carefully characterize the optical performance of the system and clearly explain the design choices and underlying derivations, which will make it easier for others to understand and implement. The authors demonstrate the utility of the method across several in vivo applications, including neurovascular imaging, blood flow measurements, optogenetic stimulation, and microglial dynamics.

      We thank the reviewer for their thoughtful and encouraging comments. We greatly appreciate the recognition of the technical rigor, breadth of validation, and clarity of explanation presented in our work.

      Weaknesses:

      In the in vivo demonstrations, the authors employ different Bessel beam configurations across experiments, but the beam parameters are not dynamically tuned during live imaging. A video example showing continuous or interactive tuning of the Bessel beam within a single in vivo imaging sequence would further highlight the practical advantages of this platform and strengthen the case for its potential applications.

      We thank the reviewer for their suggestion. While we agree that continuous or interactive tuning of the Bessel beam during imaging would further highlight the practical flexibility of the platform, and changing the Bessel beam parameters during imaging session is feasible in our tBessel-TPFM implementation, for the in vivo applications presented in this manuscript, dynamic tuning during the actual recording is generally not required. In practice, the Bessel beam parameters are selected before data acquisition based on the biological target, desired axial coverage, spatial resolution, and acceptable level of projection overlap.

      In addition, while excitation powers are reported, the manuscript does not place these values in the broader context of known photodamage thresholds for two-photon microscopy, which would be helpful to the readers.

      We thank the reviewer for bringing up this important point. It is known that multiphoton imaging relies on relatively high illumination power, which causes brain heating and thus photodamage. Previous studies have reported that continuous illumination with a 920-nm laser beam at 0.8 NA over 1000s results in a peak temperature increase of ~1.73 °C/100 mW in the brain, with power above 300 mW observed to cause cellular damage. Power levels below 250 mW were considered to be safe for long-term imaging. (Podgorski and Ranganathan, 2016) In our experiments, the measured post-objective powers range from 20 mW to 149 mW, which are well below the established safe threshold.

      Denoising/image restoration are applied in one of the in vivo examples, but it is unclear why this step was used specifically for this dataset and whether it was necessary to achieve adequate SNR or primarily included as an additional demonstration.

      We thank the reviewer for requesting clarification on the usage of the CARE denoising model. The CARE-based denoising was applied only in Figure 5, the microglial imaging example, and was primarily included as an additional demonstration of how neural network–based image restoration can be used to enhance low-SNR volumetric datasets acquired with tBessel-TPFM. All other images and analyses in the manuscript were performed on raw data without any denoising. To assess the reliability of the CARE denoising method, we further compared raw and denoised data using 10-frame averages and color-mapped the full 10-minute time-lapse video, both showed minimal differences (Response Fig 1). These analyses confirm that the CARE denoising model did not introduce structural artifacts or affect the biological dynamics observations in our dataset.

      Reviewer #3 (Public review):

      The manuscript presents an elegant and cost-effective approach for generating a tunable Bessel beam on a conventional two-photon microscope. The authors assemble a compact optical module comprising three axicons and a series of lenses that permits rapid adjustment of both lateral resolution and axial extent without modifying the focal plane. This flexibility enables the system to be readily adapted to a variety of biological preparations. As a proof of concept, the authors employ the device to record blood flow velocities in cortical microcapillaries, arterioles, and venules, thereby directly visualizing vasodilatation and vasoconstriction dynamics and permitting quantitative analysis of neurovascular coupling across cortical layers in awake mice.

      The authors demonstrate that the tunability of the Bessel beam can be exploited to match the numerical aperture to the vessel type: a high NA configuration, albeit slower scan, is optimal for resolving flow in capillaries, whereas a low NA setting provides faster acquisition suitable for arterioles and venules. By implementing a one-dimensional line scan with the Bessel beam, they achieve an imaging speed that is twentyfold faster than conventional frame-by-frame scanning, which proves sufficient to capture hemodynamic transients before and after an induced ischemic stroke.

      In addition to pure observation, the authors integrate a co-propagating Gaussian line to the system, allowing simultaneous imaging and photostimulation within the same focal plane. This capability addresses a common limitation of other Bessel beam implementations, in which the observation and perturbation planes often become misaligned when the Bessel beam is altered. The manuscript also emphasizes the advantage of Bessel beam excitation for calcium imaging after a perturbation, because it captures neuronal activity in planes both above and below the nominal focal plane, signals that would be missed with a standard Gaussian focus. Finally, the authors apply the technique to investigate the neuroimmune response following targeted microglial ablation; they report that adjacent microglia extend processes toward the injury site while retracting processes in the opposite direction.

      Overall, the work offers a technically straightforward yet powerful extension to existing two-photon platforms, providing high-speed, volumetric imaging and stimulation capabilities that are well-suited to a broad range of neurovascular and neuroimmune studies. The experimental validation is quite thorough, and the presented data convincingly illustrates the benefits of the approach.

      Strengths:

      The authors present a truly clever and inexpensive optical module that can be integrated into almost any two-photon microscope, providing a tunable Bessel beam with a minimal modification of the existing system. The experimental data and accompanying quantitative analysis convincingly demonstrate that the system can reveal physiological events, such as capillary flow, calcium transients across multiple axial planes, and microglial process dynamics, that are difficult or impossible to capture with a conventional Gaussian beam. The breadth of experiments chosen for the manuscript illustrates the practical utility of the device and supports the authors' conclusions that it extends the functional repertoire of standard two-photon microscopy.

      We sincerely thank the reviewer for the thoughtful and encouraging feedback. We're glad that the technical design and broad applicability of the tBessel module came through clearly, and we appreciate the recognition of its ease of integration and ability to capture dynamic physiological processes.

      Weaknesses:

      The manuscript would benefit from a more detailed contextualisation of the claimed speed advantage. Although the authors mention other techniques in the introduction, they do not provide any direct comparison with other state-of-the-art high-speed two-photon approaches such as light beads microscopy (Demas et al., Nat. Methods 2021), temporal multiplexing schemes (Weisenburger et al., Cell 2019), or random access microscopy (Villette et al., Cell 2019). A brief comparison of imaging speed, spatial resolution, and instrumental complexity would enable readers to assess the relative merits of the present method.

      We thank the reviewer for this important suggestion. We agree that a more explicit comparison with other high-speed two-photon imaging methods helps clarify the speed advantages of our system. Several existing approaches, including light-beads microscopy (LBM), temporal multiplexing, and AOD-based random-access microscopy, have demonstrated impressive high-speed volumetric imaging capabilities. Light-beads microscopy (Demas et al., 2021) reported imaging over a large volume of 5.4 × 6 × 0.5 mm<sup>3</sup> at 2 Hz. However, this large-volume acquisition used 5-μm lateral pixel sampling, corresponding to an effective lateral resolution of approximately 10 μm. In a more comparable mesoscopic volume, LBM imaged 0.6 × 0.6 × 0.5 mm<sup>3</sup> at 9.6 Hz with 1-μm lateral pixel sampling. In addition, the LBM module uses off-axis reflective concave mirrors, which require careful alignment, and the axial sampling range is not readily tunable. Temporal multiplexing approaches (Weisenburger et al., 2019), reported imaging over approximately 1 × 1 × 0.6 mm<sup>3</sup> at 17 Hz. However, this volume rate was achieved with relatively coarse spatial resolution of approximately 5 μm, together with a more complex optical design involving multiplexed excitation, detection, and synchronization. AOD-based random-access microscopy (Nadella et al., 2016; Villette et al., 2019) provides very fast point or region sampling, and reported 250 × 250 μm<sup>2</sup> imaging with 512 × 512 pixels and a 50-ns pixel dwell time, corresponding to ~0.5-μm pixel sampling and ~76 frames/s for two-dimensional imaging. However, volumetric imaging requires additional axial sampling, which lowers the effective 3D acquisition rate. In addition, AOD-based systems rely on diffractive beam steering, which introduces light loss due to finite diffraction efficiency and increases optical and calibration complexity. In comparison, tBessel-TPFM imaged a 0.4 × 0.4 × 0.12 mm<sup>3</sup> volume at 58 Hz with 0.2-μm lateral pixel sampling. Our largest demonstrated imaging volume reached 2.5 × 2.5 × 0.45 mm<sup>3</sup> while maintaining diffraction-limited lateral resolution. Therefore, compared with these high-speed volumetric approaches, tBessel-TPFM provides a distinct balance of volume rate and spatial sampling, and easier implementation simplicity.

      A second limitation that warrants discussion is the inherent trade off between volumetric coverage and image specificity. Because the Bessel beam excites fluorescence throughout an extended axial range, the detector inevitably integrates signal from a three dimensional volume into a two dimensional image. In densely labelled tissue, this can lead to significant signal crosstalk, reducing contrast and complicating quantitative interpretation. A brief analysis of how labeling density affects the fidelity of flow or calcium measurements, or suggestions for mitigating crosstalk (e.g., computational deconvolution, adaptive excitation shaping, or combinatorial sparse labeling), would broaden the applicability of the technique.

      We thank the reviewer for highlighting this important trade-off between volumetric coverage and image specificity in Bessel beam imaging. As Bessel beams project fluorescence from multiple features along the z-axis onto the same x–y plane, longer beams expand depth coverage at the same acquisition speed but can confound signals from axially spaced structures (Line 119-121 in manuscript). For densely labeled samples, the probability of having structures overlap in their x-y locations is high, and thus a shorter beam should be used. In sparsely labeled samples, structures have a lower probability of overlapping, and thus longer foci can be used (Line 166-168 in manuscript). Additionally, at the same NA, longer Bessel beam have more energy in the side rings surrounding the central peak, which may lead to higher background signal (Line 121-123 in manuscript) (Lu et al., 2017). These reasons necessitate to have not only NA tuning, but also independent length tuning (ΔNA tuning) to optimize imaging Bessel length to provide a balance between structural overlap that obscures signal localization, and the volumetric speedup, in any given sample based on labeling density and imaging goals, which are realized in our tBessel design.

      Reference:

      Demas, J., Manley, J., Tejera, F., Barber, K., Kim, H., Traub, F.M., Chen, B., Vaziri, A., 2021. High-speed, cortex-wide volumetric recording of neuroactivity at cellular resolution using light beads microscopy. Nat Methods 18, 1103–1111. https://doi.org/10.1038/s41592-021-01239-8

      Göbel, W., Helmchen, F., 2007. In Vivo Calcium Imaging of Neural Network Function. Physiology 22, 358–365. https://doi.org/10.1152/physiol.00032.2007

      Grewe, B.F., Voigt, F.F., van ’t Hoff, M., Helmchen, F., 2011. Fast two-layer two-photon imaging of neuronal cell populations using an electrically tunable lens. Biomed Opt Express 2, 2035–2046. https://doi.org/10.1364/BOE.2.002035

      Huang, C., Tai, C.-Y., Yang, K.-P., Chang, W.-K., Hsu, K.-J., Hsiao, C.-C., Wu, S.-C., Lin, Y.-Y., Chiang, A.-S., Chu, S.-W., 2019. All-Optical Volumetric Physiology for Connectomics in Dense Neuronal Structures. iScience 22, 133–146. https://doi.org/10.1016/j.isci.2019.11.011

      Lu, R., Sun, W., Liang, Y., Kerlin, A., Bierfeld, J., Seelig, J.D., Wilson, D.E., Scholl, B., Mohar, B., Tanimoto, M., Koyama, M., Fitzpatrick, D., Orger, M.B., Ji, N., 2017. Video-rate volumetric functional imaging of the brain at synaptic resolution. Nat Neurosci 20, 620–628. https://doi.org/10.1038/nn.4516

      Nadella, K.M.N.S., Roš, H., Baragli, C., Griffiths, V.A., Konstantinou, G., Koimtzis, T., Evans, G.J., Kirkby, P.A., Silver, R.A., 2016. Random-access scanning microscopy for 3D imaging in awake behaving animals. Nat Methods 13, 1001–1004. https://doi.org/10.1038/nmeth.4033

      Podgorski, K., Ranganathan, G., 2016. Brain heating induced by near-infrared lasers during multiphoton microscopy. Journal of Neurophysiology 116, 1012–1023. https://doi.org/10.1152/jn.00275.2016

      Sofroniew, N.J., Flickinger, D., King, J., Svoboda, K., 2016. A large field of view two-photon mesoscope with subcellular resolution for in vivo imaging [WWW Document]. eLife. https://doi.org/10.7554/eLife.14472

      Villette, V., Chavarha, M., Dimov, I.K., Bradley, J., Pradhan, L., Mathieu, B., Evans, S.W., Chamberland, S., Shi, D., Yang, R., Kim, B.B., Ayon, A., Jalil, A., St-Pierre, F., Schnitzer, M.J., Bi, G., Toth, K., Ding, J., Dieudonné, S., Lin, M.Z., 2019. Ultrafast Two-Photon Imaging of a High-Gain Voltage Indicator in Awake Behaving Mice. Cell 179, 1590-1608.e23. https://doi.org/10.1016/j.cell.2019.11.004

      Weisenburger, S., Tejera, F., Demas, J., Chen, B., Manley, J., Sparks, F.T., Traub, F.M., Daigle, T., Zeng, H., Losonczy, A., Vaziri, A., 2019. Volumetric Ca2+ Imaging in the Mouse Brain Using Hybrid Multiplexed Sculpted Light Microscopy. Cell 177, 1050-1066.e14. https://doi.org/10.1016/j.cell.2019.03.011

      Yang, W., Carrillo-Reid, L., Bando, Y., Peterka, D.S., Yuste, R., 2018. Simultaneous two-photon imaging and two-photon optogenetics of cortical circuits in three dimensions. eLife 7, e32671. https://doi.org/10.7554/eLife.32671

    1. eLife Assessment

      This important study links allelic expression imbalance with replication timing, suggesting a stochastic model for haploinsufficiency in dosage-sensitive disease. The integration of allele-specific RNA-seq and replication timing in clonal systems provides solid evidence for an association between asynchronous replication and allelic imbalance, although the scope and generality should be addressed in future work. This study will interest epigeneticists and genome regulation researchers studying replication timing and monoallelic expression, as well as developmental biologists and human geneticists concerned with clonal heterogeneity, haploinsufficiency, and variable disease penetrance.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #2 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      The authors pair analysis of replication timing and allele-specific expression in clonal populations of primary human cells. They combine these data with previously published data on clones from transformed human cell lines. They identify a number of genomic regions that display asynchronous replication timing in at least one clone and correlate these regions with allele-specific expression of genes within them. They also observe that several interesting gene sets, including genes that are associated with human diseases, map to asynchronously replicating regions. This is a good experimental approach that builds on already published data demonstrating the connection between allelic imbalance and replication timing.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #2 (Public review):

      Summary:

      The authors pair analysis of replication timing and allele-specific expression in clonal populations of primary human cells. They combine these data with previously published data on clones from transformed human cell lines. They identify a number of genomic regions that display asynchronous replication timing in at least one clone and correlate these regions with allele-specific expression of genes within them. They also observe that several interesting gene sets, including genes that are associated with human diseases, map to asynchronously replicating regions. This is a good experimental approach that builds on already published data demonstrating the connection between allelic imbalance and replication timing.

      - This is a research topic that touches on a few sub-fields of biology, and thus to make the paper more approachable we would recommend a careful edit of the text for clarity and precision of language.

      We thank the reviewers for their thoughtful and constructive comments, which substantially improved our manuscript. In response, we have revised the text and figures throughout to address the points raised.

      - Authors point out that this is a decades-old field; we would suggest to use terminology established within the field is possible. Allelic imbalance has been referred to as AI, MAE (monoallelic expression), RMAE (random monoallelic expression) etc. The paper whose mouse data the authors make use of uses Asynchronous Stochastic Replication Timing (ASRT) instead of VERT to refer to the same phenomenon.

      While we agree that allelic expression imbalance has been described by different investigators using many different phrases, we believe that MAE, RMAE and AI do not represent accurate descriptions of the phenomenon. We point out that “Allelic Expression Imbalance” has been used to describe this variable allelic expression by other investigators >120 times in the Pubmed database.  In our study [and our previous study; Nat Commun. 2022; 13(1):6301] we used clonal analysis of allele-specific expression and found that while some clones display equivalent levels of expression between alleles of a given gene (i.e. bi-allelic expression) other clones express only one allele (i.e. mono-allelic expression), and yet other clones have undetectable expression (i.e. silent on both alleles). This pattern of allele-restricted expression indicates that each allele independently adopts either an expressed or silent state. Importantly, because these expression states are mitotically stable, allele-autonomous, and independent of parental origin, we refer to the choice of the expressed allele as stochastic. Given this variability, we believe that the phrase “Allelic Expression Imbalance” (AEI) represents a more accurate descriptor for this phenomenon.

      In addition, the replication asynchrony that exists at these loci is not consistent with purely ASynchronous Replication Timing (ASRT) between alleles. We found that each allele can independently adopt either earlier or later replication timing in different clones. This variability results in some clones exhibiting pronounced asynchrony between alleles, while in others, the two alleles replicate synchronously, with both adopting either the earlier or later timing state. As reported in our previous study (Nat. Commun. 2022; 13:6301), this behavior reflects a stochastic and allele-autonomous process, leading us to describe these loci as exhibiting Variable Epigenetic Replication Timing (VERT), which we believe is a more accurate descriptor of this phenomenon.

      - Methods do not provide fully sufficient detail to fully evaluate or reproduce these experiments.

      We now provide a more detailed description of how VERT regions were identified, annotated, and quantified, including thresholds for allelic imbalance, replication timing variability, and sampling depth. We also justify the ≥80% AEI cutoff, which is based on recently published studies showing that modest allelic biases can have biological and clinical significance (Nature 2025; 637, 1186-1197). We also refer the readers to our recent description of these methods (Nat. Commun. 2022; 13:6301).

      - It is helpful to show representative loci as the authors do in Fig 1F and G and Fig 2 but these panels are very densely rendered and thus difficult to process visually - even the cartoon version (1D) is thick with overlapping lines. The point that allelic imbalance is enriched in VERTs would be enhanced if the authors could present the allelic ratio for all genes found in all VERTs, demonstrating how replication timing on either chromosome affects the allelic ratio.

      The stochastic nature of the allelic expression and replication timing observed at I/SCs is best visualized with each allele and each transcription unit displayed from multiple clones in the same panel. One of the goals of these figure panels is to emphasize that each I/SC has multiple transcription units that acquire expressed or silent states independently in each clone.  Therefore, the expressed or silent status of one allele of a transcription unit does not predict expression status of the same or opposite allele of any other transcription unit within the same VERT region. In addition, the Early/Late pattern of replication timing that we detect is not correlated with which allele is transcriptionally active (see below). In these figure panels, we display each clone using different colors, each allele as solid or dotted lines, and each transcription unit based on chromosome position. While this arrangement makes for busy images, we believe that this format captures the full breadth of the variability in expression and replication timing that occurs at I/SCs.

      Regardless, because each transcription unit is independent, we now provide the expression ratios for all transcripts that are generated from the VERT regions for the coding and non-coding transcription units in Figures 1, 2, and 6; shown in Supplemental Table 9. This analysis indicated that 4,017 informative reads were derived from the earlier replicating allele and 3,161 informative reads were derived from the later replicating allele, generating an allelic ratio of 1.3 (early/late) and a binomial P value of 1.0.

      In addition, a similar analysis of imprinted loci revealed that even at genomic regions with parent-of-origin–specific expression, the replication timing of each allele does not align with transcriptional activity, i.e. both early- and late-replicating alleles can be transcriptionally active, depending on the gene. This observation is consistent with the complex organization of many imprinted domains, where genes on opposite alleles exhibit reciprocal expression patterns. To illustrate this point, we now include Supplemental Figure 1 demonstrating that imprinted loci harbor genes expressed from both the earlier- and later-replicating alleles. In addition, quantification of the total number of informative transcripts at the DLK1/MEG8 imprinted locus (Supplemental Figure 1a-1c) indicates that the ratio of transcripts derived from the early versus late replicating alleles is equivalent (i.e. an allelic expression ratio of 1.0; See Supplemental Table 9).

      - The authors make the important point that VERTs are unlikely to be shared among different cell types and tissues (Fig 1i), but then find an enrichment for neuronal and immune genes in VERT regions identified in ACPs. It follows that these same genes are unlikely to be in such regions in the tissues where they are relevant. Some of the GO terms presented are too broad to suggest any biological significance to the result, even if there is statistical significance (for example, the top term for LCL clones 'Cytoplasm' is associated with 12,000 genes, and the second term for mouse clones 'Membrane' is associated with 10,000). It would be helpful to focus on GO terms lower in the GO hierarchy.

      We now include our complete Gene Ontology analysis, with more specific biological categories, in Supplemental Table 5.

      - Figure 3 highlights the association of related gene clusters with VERTs but the VERTs are assigned based on variable replication timing in just 1 or 2 clones. This is an interesting observation, but to make the point that "VERT regions frequently coincide with gene clusters in the human genome" there needs to be a systematic assessment of replication timing at all gene clusters across all clones, and a statistical test for significance.

      Our intent in Figure 3 was not to suggest that all gene clusters are subject to VERT and AEI, but rather to highlight that several well-characterized multigene families that are known to exhibit AEI, such as olfactory receptor, protocadherin, and HLA gene clusters, coincide with VERT regions at their genomic locations. These examples serve as representative illustrations demonstrating that I/SC-associated regulation occurs at established AEI loci organized in gene clusters.

      To clarify this point, we have revised the text to explicitly state that Figure 3 presents illustrative examples of known AEI-associated gene clusters overlapping with VERT regions, rather than a comprehensive or statistically exhaustive analysis of all gene clusters across the genome.

      - It is an interesting hypothesis that VERTs are conserved between species at syntenic loci. If such regions are really conserved, one would expect that replication timing at these sites would be consistently asynchronous. However the data presented shows that in human clones these VERTs can be specific to an individual donor (as in 5A) or an individual clone (as in 5H).

      As discussed in our Limitations Section, our analysis was restricted to a limited number of cell types, individuals, and clones, which may not capture the full diversity of I/SC usage across tissues and populations. While our dataset was sufficient to identify robust patterns of AEI and VERT, it likely represents only a subset of the broader landscape of I/SC regulation in both humans and mice. We anticipate that future studies incorporating a wider range of tissues, individuals, and clones will uncover an even greater degree of conservation and diversity in I/SC usage across genomes.

      - The finding that VERTs coincide with neurodevelopmental disease genes in immune and cartilage cells is at odds with the previous statements and data about the tissue specificity of VERTs. In order to support the claim that neurodevelopmental disease associated genes reside in asynchronously replicating regions, and are thus more prone to allelic imbalance, it would be helpful if the authors demonstrated this phenomenon in neuronal cells.

      We make two points that address this critique: First, many of the neurodevelopmental disease genes associated with VERT regions are not exclusively expressed in neuronal cells and have previously been shown to exhibit AEI in non-neuronal contexts. For example, Gimelbrant and Chess (Science, 2007; 318:1136–1140) demonstrated AEI of the Parkinson disease genes SNCA and LRRK2 in lymphoblastoid cell lines (LCLs), and in our previous study, that also used LCL cells, we detected AEI of DNAJC6, which is another Parkinson disease gene (Nat. Commun. 2022; 13:6301). In the present study, using cartilage progenitor cells, we identified VERT and AEI of several epilepsy-associated genes, including SCN1A, SCN2A (Fig. 6b), GABRA1(Fig. 6e), and SAMD12 (Fig. 6j), as well as a gene implicated in autism and neurodevelopmental disorders, SEMA5A (Fig. 5c), indicating that expression of these genes is not exclusive to neuronal cell types.

      Second, independent studies from the Dr. E. Heard laboratory have provided further evidence that AEI occurs in neuronal lineages. Using mouse neural progenitor cells (NPCs), they identified genes subject to AEI (Dev. Cell, 2014; 28:366–380) and they later evaluated AEI of syntenic human neurodevelopmental disease genes, including Snca, App, Eya4, and Grik2 (Nat. Commun. 2021; 12:5330). In our data, we find that these mouse genes are located within VERT regions. In addition, and consistent with our use of AEI, they used the phrase “Allelic Expression Imbalance” to describe the epigenetic expression biases at these genes.

      Together, these findings reinforce that AEI, and by extension I/SC regulation, is not restricted to specific cell types, but rather represents a generalizable mechanism of stochastic epigenetic regulation that includes genes relevant to neuro development and disease.

      - The authors consistently lean on sparse samples (i.e. a single clone) within a modestly sized dataset (4 clones from 2 donors each) to propose a new model for haploinsufficiency in human disease. It may well be but the consistent focus on limited elements in the data and perhaps an overreach in the interpretation makes it difficult to appreciate the very good experiments presented.

      We agree that our analysis was conducted on a modest number of cell types, individuals, and clones, which we explicitly acknowledge as a limitation of the present study. However, several key points support the robustness and broader relevance of our conclusions:

      i) Clonal Design and Replication: The strength of our approach lies in its clonal resolution. Each clone represents a single-cell–derived population expanded to over a million cells, enabling direct detection of stable, mitotically heritable allele-specific epigenetic states that would not be apparent in population-averaged data. Importantly, many of the VERT regions we identified are shared between independent clones from different donors and across distinct cell types (ACP and LCL), demonstrating reproducibility and biological consistency.

      ii) Cross-Species Validation: We further identified syntenic VERT regions in mouse pre-B cell clones, including at loci known to exhibit AEI in prior studies, providing independent validation and evolutionary conservation of the phenomenon.

      iii) Integration with Published Evidence: Our findings extend prior observations of AEI and VERT (e.g. Gimelbrant et al. Science 2007; Heskett et al. Nat. Commun. 2022) and are fully consistent with known stochastic allelic expression imbalance of autosomal genes.

      iv) We also draw parallels with the absence of cellular selection mechanisms that dictate dominant inheritance patterns for loss of function alleles for X linked disease genes (reviewed in: J Clin Invest, 2008, 20-23; and Nat Rev Genet. 2025, 26, 571–580). Our proposed model linking I/SC regulation to haploinsufficiency is therefore a synthesis of our results with an extensive body of published data, not an inference drawn from isolated observations.

      v) Scope and Framing: We have revised the manuscript to clarify that our proposed model represents a mechanistic framework, not a definitive or exclusive explanation, for how stochastic allelic regulation could contribute to dosage-sensitive disease phenotypes. We also explicitly discuss the need for larger datasets and additional tissues to refine and test this model.

      - This section refers to the revised version of the paper. We would like to thank the authors for the changes and explanations offered. Although we don't fully agree with a few answers offered, overall the answers and changes in the manuscript have significantly improved the work presented. As such it should be of interest to many readers.

      We thank the reviewers for their thoughtful evaluation and constructive feedback. We appreciate their recognition that the revisions have strengthened the manuscript and are pleased that they find the work to be of broad interest.

    1. eLife Assessment

      This valuable study uses technically compelling long-term in vivo recordings and computational modeling to investigate whether hawkmoth olfactory receptor neurons show circadian modulation of spontaneous firing. The authors further propose the provocative model that post-translational mechanisms, rather than the transcriptional-translational processes, may contribute to circadian regulation of neuronal excitability. However, the evidence for circadian firing in these neurons, and for post-translational modification of Orco as the underlying mechanism, remains incomplete. In contrast, the study does provide strong evidence that the application of cyclic nucleotides can modulate Orco-dependent activity at a single time point, and reports that the temporal pattern of Orco transcript abundance is not circadian.

    2. Joint Public Review:

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several assumptions that underlie their data analysis and model builds, as well as insufficient biological data including critical controls to validate and/or fully justify the model the authors are proposing.

      Strengths:

      The authors raise several intriguing model-based hypotheses regarding the mechanisms that underlie the generation of olfactory rhythms. The electrophysiological approach and the long-term recording paradigm are elegant and technically impressive. In the revised version, the authors have added additional qPCR data supporting the lack of rhythmic Orco transcript expression and included a new figure suggesting that cAMP can modulate Orco conductance.

      Major weaknesses:

      (1) The cAMP experiment was only conducted at one time-point, which is insufficient to support the central claim that "AMP and cGMP may have ZT-dependent effects on Orco conductivity".

      (2) The revised manuscript continues to rely heavily on prior publications or defers key mechanistic questions (or important manipulations) to future studies. In its current form, the evidence presented remains insufficient to support the central claim that a PTFL constitutes the primary underlying circadian clock mechanism. The proposed model is intriguing, but the data provided do not yet directly demonstrate the novel mechanism.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several questionable assumptions that underlie their data analysis and model builds, as well as insufficient biological data, including critical controls to validate and/or fully justify the model the authors are proposing.

      We thank the reviewers for their thorough and thoughtful comments and believe that the manuscript is much stronger now after the revision which incorporates the requested changes. We added results of new experiments and additional analyses. Although these new insights did not change the previous conclusions, we significantly reworked the Discussion and added further references to clarify the conclusions we want to make.

      Please note that we used ORN as acronym for “olfactory receptor neuron” throughout the manuscript. ORNs contain odorant receptors (ORs), and in insects these ORs associate with the olfactory receptor co-receptor (Orco) to be trafficked to the membrane of the cilium of the ORN, where they can be contacted by pheromones and odorants. In Manduca sexta, evidence is accumulating for G-protein coupled metabotropic pheromone transduction and not for OR-Orco dependent ionotropic transduction, as shown for Drosophila melanogaster. In both insect species, besides its chaperone function, Orco can form leaky cation channels, which can regulate the spontaneous spiking activity of ORNs. In this study, we explored this role of Orco.

      Strengths:

      The study is notable for its combination of long-term in vivo tip recordings with computational modeling, which is technically challenging and adds weight to the authors' claims. The link between Orco, cyclic nucleotides, and circadian regulation is potentially important for sensory neuroscience, and the modeling framework itself - a stochastic Hodgkin-Huxley formulation that explicitly incorporates channel noise - is a solid and forward-looking contribution. Together, these elements make the study conceptually bold and of clear interest to circadian and olfactory biologists.

      Major weaknesses:

      At the same time, several limitations temper the conclusions. The pharmacological evidence relies on a single antagonist and concentration, without key controls. The circadian analysis is based on relatively small numbers of neurons, with rhythms detected only in subsets, and the alignment procedure used in constant darkness raises concerns of bias. The molecular evidence is sparse, with only three qPCR timepoints, and the model, while creative, rests on assumptions that are not yet fully supported by in vivo data.

      Please see our responses to the detailed comments.

      Detailed comments are provided below:

      (1) The role for Orco proposed in the authors' model largely stems from the effects seen following the administration of (a single dose) of the Orco antagonist, OLC15. However, this hypothesis is undercut by the lack of adequate pharmacological controls, including a basic multipoint OLC15 dose-response series in addition to the administration of blockers for the other channels that are embedded in their model, but which were ruled out as being involved in the modulation of biological rhythms. In addition, these studies would (ideally) also benefit from the inclusion of the same concentration (series) of an inactive OLC15 analog to better control for off-target effects.

      The Orco agonist VUAA1 (Jones et al., 2011) binds directly to Orco and increases the channel open time probability. In M. sexta hawkmoths, we have already published that VUAA 1 increases the low spontaneous activity of ORNs in a dose-dependent fashion (Nolte et al., 2013). Chen and Luetje (2012) systematically varied the chemical structure of VUAA1 to identify new Orco ligands and discovered 22 Orco ligand candidates (OLCs) that either activated or inhibited Orco. In their heterologous expression system, Orco was most sensitive to inhibition by OLC15. Based on these results, we published a dose-response curve of OLC15 inhibition (1-100 µM) using in vivo tip recordings of pheromone-sensitive long trichoid sensilla of M. sexta (Nolte et al., 2016). There, we also demonstrated that OLC15 dose-dependently antagonizes the VUAA1-dependent activation of Orco.

      Furthermore, we tested other published Orco antagonists, which were characterized in heterologous assays, in primary cell cultures of hawkmoth ORNs, as well as in in vivo assays in intact hawkmoths. We focused on amiloride-derived antagonists, because we previously identified an amiloride-sensitive cation channel in hawkmoth ORNs. We found that, in contrast to OLC15, the amilorides HMA and MIA were not Orco-specific antagonists but instead affected different ion channel targets depending on the time of day (Nolte et al., 2016). Based on those experiments and the dose-response curves we determined that the Orco agonist VUAA1 (Jones et al., 2011) and the Orco antagonist OLC15 (Chen and Luetje, 2012) worked best in hawkmoth ORNs to target Orco pharmacologically. Due to those results and other comparative tests with other published Orco antagonists we settled since then in all further experiments on a dose of 50 µM OLC15 as most adequate to antagonize Orco functions in Manudca. In the current study, we focus on Orco without excluding the possibility that other ion channels in the ORNs contribute to the control of membrane potential rhythms.

      We have clarified the Methods section accordingly.

      (2) The expression pattern of Orco was assessed using qPCR at only three timepoints. Rhythmic transcripts can easily be missed with such sparse sampling (Hughes et al., 2017). A minimum of six evenly spaced timepoints across a 24-hour cycle would be required to confidently rule out circadian transcriptional regulation. In addition, the use of the timeless mRNA control from another study is not acceptable. Furthermore, qPCR analysis measures transcript abundance, not transcription, as the authors repeatedly state. Transcriptional studies would require nuclear run-off or, more recently, can be done with snRNAseq analysis. Taken together, these concerns undermine the authors' desire to rule out TTFL-based control that directly led them to implicate a PTTF-based model.

      We agree with the referees that more time points and a direct comparison between timeless and Orco mRNA levels should be included in this manuscript. We included these additional qPCR experiments and edited the manuscript to make clear that we measure transcript abundance, but we will not perform snRNAseq analysis due to time- and financial constraints.

      (3) The modelling presented is based on Orco as a ZT-dependent conductance tied to the cAMP oscillations that were reported by this group in the cockroach and from the presence and functionality in Manduca of homomeric Orco complexes that are devoid of tuning ORs. While these complexes have been generated in cell culture and other heterologous expression systems, as well as presumably exist in vivo in the Drosophila empty neuron and other tuning OR mutants, there is no evidence that these complexes exist in wild-type Manduca ORNs. While this doesn't necessarily undermine every aspect of their models, the authors should note the presence of Orco/OR complexes rather than Orco homomeric complexes.

      Our ELISAs found circadian oscillations in cAMP levels not only in antennae of the Madeira cockroach (Schendzielorz et al., 2014, 2012), but also in hawkmoth antennae (Schendzielorz et al., 2015). For clarification, we added the 2015 citation to the Modeling chapter in the Methods section.

      We agree with the referees that we cannot distinguish between Orco homo- and heteromers in the different compartments of our hawkmoth ORNs but we know that both are expressed in the pheromone-sensitive ORNs. Thus, as the referee suggests, we added text regarding the presence and localization of OR-Orco heteromers. Consistent data collected across different experiments (heterologous expression systems, primary cell cultures of hawkmoth ORNs, in vivo/in situ studies) support that Orco homomers are present in hawkmoth ORNs. In addition to co-expression of MsexOrco and MsexSNMP-1 with either MsexOr-1 or MsexOr-4 in a heterologous expression system, MsexOrco expression alone was already sufficient to increase intracellular Ca<sup>2+</sup> levels spontaneously as a result of its property as leaky, non-specific cation channel, and in response to VUAA1 application (Nolte et al., 2013). Both in developing hawkmoth pupae and differentiating primary cell cultures of hawkmoth ORNs, Orco expression started during a developmental time window where ORNs did not yet express pheromone receptors but where Orco affected spontaneous activity and intracellular Ca<sup>2+</sup> levels dependent on VUAA1 (Nolte et al., 2016). In vitro patch clamp studies of differentiating cultured hawkmoth ORNs during this time window of pupal development characterized ion channels/currents with properties of Orco as a leaky, non-specific cation channel/current that depends on protein kinase C and cyclic nucleotides (Dolzer et al., 2021, 2008; Krannich and Stengl, 2008; Stengl, 1993). Thus, Orco homomers are present in developing hawkmoth ORNs during a time window where ORNs already express spontaneous activity but they do not heteromerize with pheromone receptors. However, we do not know whether and in what ratio homo- and heteromers of Orco and ORs are present in the respective sensillum compartments of adult hawkmoths because all OR-specific antibodies tested did not work in immunocytochemical studies of hawkmoth antennae (Nolte et al., 2013; Stengl, 1994; Stengl and Hildebrand, 1990). Our hypothesis of differential distribution of Orco homomers in the some and dendrite compartment, and OR-Orco heteromers in the cilia is based on differential immunocytochemical localization of Drosophila ORs mainly in the cilia compartment (Benton et al., 2006).

      We clarified our manuscript accordingly.

      (4) Some aspects of the authors' models, most notably the decision to phase align/optimize their DD and OLC15 recordings, are likely to bias their interpretations.

      It is consensus that insects display daily and circadian rhythms in pheromone-dependent mating, odor-gated feeding, and egg-laying behavior that phase-locks to environmental rhythms, corresponding with daily/circadian rhythms of sensory neuron physiology (e.g., Merlin et al., 2007; Rymer et al., 2007; Schendzielorz et al., 2015, 2012). However, circadian rhythms can be easily masked by stress, like the disturbances during an experimentally very challenging long-term recording experiment over several days. In addition, we observed over the years in our animal raising facility that in 17:7 light-dark cycles the originally nocturnal hawkmoths M. sexta distribute their activity patterns over the course of the day, finding nocturnal as well as diurnal hawkmoths. Thus, light-dark cycles were not enough to ensure phase-synchronized behavioral rhythms, and it is very likely that the nocturnal hawkmoths, next to stress signals, rely heavily on pheromone/odor dependent synchronization as also found in other moth species (Ghosh et al., 2024). Because we focus on spontaneous activity and not on pheromone-dependent physiology in this study, we used isolated males that were never exposed to the female pheromones, taking phase dispersal into account. Therefore, it became necessary in free-running conditions to first determine the respective behavioral rhythm for each animal, and then to phase-align their activity patterns to allow for statistical analysis. Otherwise, circadian differences would average out in a phase-dispersed free-running population. As requested by the referees in point (7), we added RAIN to test for rhythmicity in each of our recordings and revised the manuscript accordingly.

      Furthermore, in preliminary experiments we briefly exposed hawkmoths to pheromone the night before the start of the experiment. However, we failed to obtain phase-synchronized spiking rhythms. Most likely, a circadian pattern of pheromone exposure would have been necessary as zeitgeber, which could not be used here due to long-term pheromone-dependent effects in spiking activity. These results are added as supplementary figure to Fig 3.

      (5) The tip recordings from long trichoid sensilla are critical aspects of this study. These recordings were carried out on upper sensillar tips located on the distal-most second annulus. Since there are approximately 80 annuli on the Manduca antennae, it is unclear whether the recordings are representative of the antennal response.

      We think the reviewers might have misinterpreted our description of the recording site. In the Methods, we state that we clip off the 20 most distal annuli (leaving a stump of about 60 annuli) and insert the reference electrode into the flagellum up to the second annulus from the cut end, i.e., the recording sites are located at 2/3 – 3/4 of the antenna length as seen from the head of the animal. We clarified this in the Methods section.

      In addition, our lab did show with antibody stainings against Orco that apparently all ORNs that innervate long and short trichoid sensilla along the whole flagellum express the same staining pattern (Nolte et al., 2016). Lee and Strausfeld (1990) mapped all types of antennal sensilla, and together with pheromone-dependent tip-recordings of Kaissling et al. (1989) it was shown that most of the male antennal sensilla are pheromone-sensitive long trichoid sensilla, with one of the two innervating ORNs always responding to bombykal, ensuring high sensitivity to pheromone detection. Furthermore, our patch clamp recordings of primary cell cultures of whole male antennae found largely overlapping ion channel populations across ORNs (review: (Stengl, 2010)). This would indicate that all ORNs, whether they express ORs sensitive to pheromone or general odorants, could potentially share the same Orco-dependent spontaneous activity rhythms. Furthermore, in our lab, different experimenters from different years that recorded from long trichoid sensilla on different annuli did not detect obvious differences in neither the spontaneous activity nor the pheromone responses (c.f., Dolzer et al., 2003; Gawalek and Stengl, 2018; Schneider et al., 2025). Thus, it is very likely that we are reporting a general encoding mechanism that is not locally restricted along the antennal flagellum and is very likely shared by all types of OR-Orco expressing ORNs.

      (6.1) The authors do not provide any data in support of their cAMP/cGMP-based Orco gating…

      There are publications supporting cyclic nucleotide gating of Orco in Drosophila, but only after previous phosphorylation via protein kinase C (PKC; review: (Wicher and Miazzi, 2021)). Since Orco is very conserved among insect species, it is likely that PKC- and cGMP/cAMP-dependent regulations are present for Orco in other insect species. To test this, we are currently characterizing second messenger-dependence of spontaneous spiking activity, which is the focus of a follow-up manuscript. Nevertheless, to provide more evidence for our hypothesis of the current manuscript, we added a new set of tip-recording experiments that demonstrate cAMP-dependent gating of Orco. Because of the addition of this figure, we merged figures 8-10 into Figure 8 and added the cAMP data as Figure 9.

      (6.2) … and the PTTF model proposed is somewhat disappointing.

      For a detailed introduction of our PTFL membrane clock hypothesis please see our opinion paper that we refer to in the manuscript (Stengl and Schneider, 2024). We added clarification of how Orco activation can influence cAMP levels. A more elaborate PTFL clock model including many more of the identified ion channels in hawkmoth ORNs is the focus of another manuscript to come.

      (6.3) The model seems to be influenced by their long-held proposal that insect olfactory signaling has a critical metabotropic component involving cyclic nucleotides, PKC, etc, a view that may be influenced by the use of Orco homomeric complexes generated in HEK cells.

      Indeed, we propose a metabotropic pheromone-transduction cascade, which in moths and cockroaches is based on G-protein-mediated activation of phospholipase C but not on adenylyl cyclase activation. Our hypothesis is not influenced by HEK cell heterologous expression studies of Orco but is supported by our own work comparing in vivo tip recordings of intact hawkmoths with patch clamp experiments on hawkmoth primary cell cultures of olfactory receptor neurons, which are able to respond to their species-specific pheromones in vitro (Schneider et al., 2025; Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In addition, a multitude of publications by other laboratories with in vivo and in vitro studies using physiological, genetic, and immunocytochemical assays all support a metabotropic signal transduction cascade in insect olfaction (Stengl, 2010; Stengl and Funk, 2013; Takagi et al., 2025; Wicher and Miazzi, 2021). In contrast, the hypothesis suggesting a solely ionotropic pheromone- and general odor-dependent transduction cascade for all insect species is based on very sparse experimental evidence, based primarily on heterologous expression studies such as HEK cells that lack the insect’s WT molecular surroundings, and thus, cannot predict OR-Orco function in vivo. Furthermore, the ionotropic hypothesis is heavily based upon the argument that an inverse 7TM receptor cannot couple to G-proteins, which lacks careful backup via biochemical and structural studies. In addition, the ionotropic hypothesis lacks support via carefully performed physiological in vivo studies in different insect species that paid attention to analysis of the distinct kinetic components of ORN´s odor/pheromone responses and that employ physiological concentrations and durations of odor/pheromone stimuli (please see our most recent publication by Schneider et al. (2025)). We added references to the possible odor transduction mechanisms to the introduction.

      (6.4) Nevertheless, structural studies on Orco do not support a cyclic nucleotide binding site, although PKC-based phosphorylation has been implicated in the fine-tuning/adaptation of olfactory signaling.

      While structural studies did not find evidence for conserved known cyclic nucleotide binding sites on Orco, this does not exclude the presence of indirect cAMP effects via e.g., Orco subunits complexing with other molecules under direct cAMP control, such as other ion channel subunits. Furthermore, it does not exclude so far unknown binding sites, or via sites that fold out only after a specific sequence of previous phosphorylations of the many phosphorylation sites on Orco. Indeed, physiological studies in Drosophila presented evidence for cyclic nucleotide dependence of Orco after previous PKC-dependent phosphorylation (Getahun et al., 2013). Our ongoing in vivo experiments in hawkmoths further corroborate a zeitgeber time-dependent PKC- and cyclic nucleotide-dependent modulation of Orco. These detailed studies will be published in a follow-up publication. In the revised version of this manuscript, we added tip-recording experiments that indicate cAMP involvement in Orco gating (new Figure 9).

      (7) Because only 5/11 LD and 7/10 DD animals showed daily rhythms, with averages lacking clear daily modulation, the methods are not sufficiently reliable enough to reveal novel underlying mechanisms of circadian rhythm generation. The reported results are therefore not yet reliable or quantifiable. To quantify their results, the authors should apply tests for circadian rhythmicity using methods such as RAIN, JTK CYCLE, MetaCycle, or Echo. The use of FFT and Wavelet is applauded, but these methods do not have tests of significance for rhythms and can be biased when analyzing data in which there could only be 1-3 circadian cycles. Because the conclusions appear to be based on 11-12 neurons that were recorded for 2-4 days, the reader is concerned that the methods are not yet perfected to provide strong evidence for circadian regulation of spontaneous firing of ORNs. The average data (e.g., Figure 3Bii and 3Cii) highlight the apparent lack of daily rhythms. In summary, the results would be more compelling if more than 50% of the recordings had significant circadian amplitudes and with similar periods and phases.

      The long-term tip-recordings of intact hawkmoths are very challenging and take a very long time to accomplish, thus, we are very happy that we succeeded in obtaining so many of them (N=40). We are thankful to the reviewers’ suggestion to use RAIN since this analysis revealed circadian rhythms in 7 of 11 LD recordings, 8 of 12 DD recordings, and 2 of 12 OLC15 recordings. Please see also our response to (4) above, commenting the phase-dispersal of activity rhythms observed in our experiments, as well as in the behavior of hawkmoth males in the mating cage.

      (8) The statement that circadian patterns of ORN firing are lost with the Orco antagonist (OLC15) is not strongly supported. The manuscript should be revised to quantify how Orco changed circadian amplitude in the 12 recorded neurons. Measures of circadian amplitude can avoid confusing/vague statements like Line 394 “low and high frequency bands appeared to merge during the activity phase around ZT 0 in the animals that showed clear circadian rhythms (N = 5 of 11 in LD)”. The conclusion that Orco blocks circadian firing appears to be contradicted by Figure 6, which indicates that ~6 of these neurons had circadian periods detected by wavelet. The manuscript would be strengthened with details about the specificity and reproducibility of the Orco antagonist. The authors quantify the gradual decrease in firing with the slope of a linear fit to estimate how the “effectiveness [of OLC15] increased over time.” They conclude that the drug “obliterated circadian rhythms and attenuated the spontaneous activity in several, but not all experiments (N = 8 of 12).” The report would be greatly strengthened with corroborating data from additional Orco antagonists and additional doses of OLC15 (the authors use only 50 uM OLC15).

      According to the valuable suggestions of the referees, we used RAIN to detect circadian rhythms in the spiking attributes in each individual animal. Since only 2 of 12 animals displayed a circadian rhythm in OLC15, statistical comparison of circadian amplitudes is not possible. We revised the results section accordingly and added to the figure legend to make it clearer that the heat maps in Fig 5 are representative from one animal each and not averages across animals.

      As the reviewer states correctly in (7), wavelet results of circadian rhythmicity must be interpreted carefully because of the low number of circadian cycles in ~3-4 day recordings. Since the heatmaps in Figure 5 visually revealed the presence of ultradian rhythms, the main focus of the wavelet analysis in Figure 6 is in the detection and quantification of ultradian periods up to 20 h.

      We revised the Methods section to include references to previous experiments that characterized the effect of different doses of OLC15 and other Orco antagonists and agonists in M. sexta antennae (Nolte et al., 2016). Please see also our response to (1).

      (9) The manuscript includes several statements that are more speculation than conclusion. For example, there is no evidence for tuning or plasticity in this report. Statements like the following should be removed or addressed with experiments that show changes in odor response specificity or sensitivity: "ORN signalosomes are highly plastic endogenous PTFL clocks comprising receptors for circadian and ultradian Zeitgebers that allow to tune into internal physiological and external environmental rhythms as basis for active sensing." (Discussion Line 622). The paper concludes that (line 380) "mean frequency of spontaneous spiking and the frequency of bursting expressed daily modulation, and are both most likely controlled via a circadian clock that targets the leak channel Orco." This is too bold given the available results.

      We revised the manuscript accordingly and clarified which statements are supported via published evidence and which are predictions based upon our novel hypothesis published in our opinion paper (Stengl and Schneider, 2024).

      (10.1) Because Orco conductance is modulated by cyclic nucleotides, it remains highly plausible that circadian regulation occurs upstream at the level of signaling pathways (e.g., calcium, calcium-binding proteins, GPCRs, cyclases, phosphodiesterases).

      We agree with the referees that it is very likely that there are multiple layers of interconnected feedback cycles that control Orco localization and activity. Our novel hypothesis suggests interlocked TTFL and PTFL control of physiological circadian rhythms, not strictly hierarchical TTFL control, which would require a daily turnover of membrane proteins and transcriptional control via the established TTFL clock in insect ORNs. We are currently searching for TTFL control at all levels of odor/pheromone transduction using ZT-dependent transcriptomics in combination with qPCR and single-nucleus transcriptomics, involving also all the molecules suggested by the referees. These studies are ongoing, are very time- and money-consuming, and are beyond the scope of this manuscript. However, we added a set of experiments to this manuscript in which we demonstrate that the effect of increased cAMP on the spontaneous spiking activity is mediated by Orco (new Figure 9).

      (10.2) The possibility that circadian oscillations of cyclic nucleotides are generated by the canonical TTFL mechanism has not been excluded. In fact, extensive work in Drosophila has demonstrated that the TTFL-based molecular clock proteins are required for circadian rhythms in olfaction.

      Our experiments that test circadian TTFL control at different levels of the cAMP transduction cascade in hawkmoth antennae are on the way and are part of another publication. In section 6.2 we already stated that our experiments do not exclude that Orco is under indirect control of the TTFL. We revised our discussion accordingly.

      The experiments published for TTFL dependent control of Drosophila olfaction that we are aware of (Krishnan et al., 1999; Tanoue et al., 2004) do not exclude interlinked PTFL and TTFL clocks. Krishnan et al. (1999) demonstrated that the TTFL clock in antennal olfactory receptor neurons correlates with circadian rhythms in odor responses measured in electroantennogram (EAG) recordings, not in single sensillum recordings as in our experiments. EAG recordings comprise not only voltage responses of the olfactory sensory neurons but also voltage changes generated in non-neuronal antennal cells such as trichogen and tormogen cells that built the transepithelial potential gradient via vATPases that generates the high K<sup>+</sup> concentration in the sensillum lymph (Jain et al., 2024; Klein, 1992; Thurm and Küppers, 1980). In addition, EAG recordings most likely contain responses of afferent neurons originating from somata in the brain that maintain central control of the antennae. Thus, EAG recordings are difficult to interpret.

      (11) A defining feature of circadian oscillators is the feedback mechanism that generates a time delay (e.g., PERIOD/TIMELESS repressing their own transcription). While the authors describe how cyclic nucleotides can regulate Orco conductance, they do not provide a convincing explanation of how Orco activity could, in turn, feed back into the proposed PTFL to sustain oscillations. For these reasons, the authors should consider:

      (a) Providing a broader discussion of non-TTFL models of circadian rhythms (e.g., redox cycles, post-translational modifications).

      We revised the discussion accordingly.

      (b) Reassessing Orco expression using a higher-resolution temporal sampling ({greater than or equal to}6 timepoints per 24 h).

      We added those experiments to the revised version of the manuscript (see our response to (2)).

      (c) Clarifying or revising the PTFL model to explicitly address how feedback would be achieved. Alternatively, the data may be more consistent with Orco conductance rhythms being regulated by post-translational mechanisms downstream of the canonical TTFL oscillator, as suggested by the Drosophila olfactory system literature.

      We added possible negative feedback elements to the Discussion to explain how our proposed PTFL could in principle work independent of TTFL clock.

      Minor weaknesses:

      (1) The authors should compare the firing patterns of ORN neurons to the bursts, clusters, and packets of retinal efferent spikes reported in Liu JS and Passaglia CL (2011; JBR). By comparing measures in moths to measures in Limulus, the authors might be able to address the question: Is the daily firing pattern of ORN neurons likely a conserved feature of circadian control of sensory sensitivity?

      We have revised the discussion accordingly.

      (2) The methods need further details. For example, it is unclear if or how single neuron activity was discriminated and whether the results were compromised by the relatively large environmental fluctuations in temperature (21-27oC), humidity (35-60%), or other cues known to modulate spontaneous firing.

      These large fluctuations stem from doing experiments at different seasons (higher temperature and humidity in the summer months, lower temperature and humidity in winter). Throughout each individual experiment, conditions were stable. We clarified the Methods section accordingly.

      Recommendations for the authors:

      The authors should post the code for their computational model to a repository like GitHub.

      The code for the computational model is now available at https://github.com/a-c-schneider/VijayanForlinoEtAl2025_Model.git

      References

      Benton R, Sachse S, Michnick SW, Vosshall LB. 2006. Atypical Membrane Topology and Heteromeric Function of Drosophila Odorant Receptors In Vivo. PLOS Biology 4:e20. DOI: https://doi.org/10.1371/journal.pbio.0040020

      Chen S, Luetje CW. 2012. Identification of New Agonists and Antagonists of the Insect Odorant Receptor Co-Receptor Subunit. PLOS ONE 7:e36784. DOI: https://doi.org/10.1371/journal.pone.0036784

      Dolzer J, Fischer K, Stengl M. 2003. Adaptation in pheromone-sensitive trichoid sensilla of the hawkmoth Manduca sexta. Journal of Experimental Biology 206:1575–1588. DOI: https://doi.org/10.1242/jeb.00302

      Dolzer J, Krannich S, Stengl M. 2008. Pharmacological Investigation of Protein Kinase C- and cGMP-Dependent Ion Channels in Cultured Olfactory Receptor Neurons of the Hawkmoth Manduca sexta. Chemical Senses 33:803–813. DOI: https://doi.org/10.1093/chemse/bjn043

      Dolzer J, Schröder K, Stengl M. 2021. Cyclic nucleotide-dependent ionic currents in olfactory receptor neurons of the hawkmoth Manduca sexta suggest pull–push sensitivity modulation. European Journal of Neuroscience 54:4804–4826. DOI: https://doi.org/10.1111/ejn.15346

      Gawalek P, Stengl M. 2018. The Diacylglycerol Analogs OAG and DOG Differentially Affect Primary Events of Pheromone Transduction in the Hawkmoth Manduca sexta in a Zeitgebertime-Dependent Manner Apparently Targeting TRP Channels. Frontiers in Cellular Neuroscience 12:218. DOI: https://doi.org/10.3389/fncel.2018.00218

      Getahun MN, Olsson SB, Lavista-Llanos S, Hansson BS, Wicher D. 2013. Insect Odorant Response Sensitivity Is Tuned by Metabotropically Autoregulated Olfactory Receptors. PLOS ONE 8:e58889. DOI: https://doi.org/10.1371/journal.pone.0058889

      Ghosh S, Suray C, Bozzolan F, Palazzo A, Monsempès C, Lecouvreur F, Chatterjee A. 2024. Pheromone-mediated command from the female to male clock induces and synchronizes circadian rhythms of the moth Spodoptera littoralis. Current biology 34:1414-1425.e5. DOI: https://doi.org/10.1016/j.cub.2024.02.042, PMID: 38479388

      Jain K, Prelic S, Hansson BS, Wicher D. 2024. Expression of Drosophila melanogaster V-ATPases in Olfactory Sensillum Support Cells. Insects 15:1016. DOI: https://doi.org/10.3390/insects15121016

      Jones PL, Pask GM, Rinker DC, Zwiebel LJ. 2011. Functional agonism of insect odorant receptor ion channels. Proceedings of the National Academy of Sciences 108:8821–8825. DOI: https://doi.org/10.1073/pnas.1102425108

      Kaissling KE, Hildebrand JG, Tumlinson JH. 1989. Pheromone receptor cells in the male moth Manduca sexta. Archives of Insect Biochemistry and Physiology 10:273–279. DOI: https://doi.org/10.1002/arch.940100403

      Klein U. 1992. The insect V-ATPase, a plasma membrane proton pump energizing secondary active transport: immunological evidence for the occurrence of a V-ATPase in insect ion-transporting epithelia. Journal of Experimental Biology 172:345–354. DOI: https://doi.org/10.1242/jeb.172.1.345

      Krannich S, Stengl M. 2008. Cyclic Nucleotide-Activated Currents in Cultured Olfactory Receptor Neurons of the Hawkmoth Manduca sexta. Journal of Neurophysiology 100:2866–2877. DOI: https://doi.org/10.1152/jn.01400.2007

      Krishnan B, Dryer SE, Hardin PE. 1999. Circadian rhythms in olfactory responses of Drosophila melanogaster. Nature 400:375–378. DOI: https://doi.org/10.1038/22566

      Lee JK, Strausfeld NJ. 1990. Structure, distribution and number of surface sensilla and their receptor cells on the olfactory appendage of the male mothManduca sexta. Journal of Neurocytology 19:519–538. DOI: https://doi.org/10.1007/BF01257241

      Merlin C, Lucas P, Rochat D, François M-C, Maïbèche-Coisne M, Jacquin-Joly E. 2007. An Antennal Circadian Clock and Circadian Rhythms in Peripheral Pheromone Reception in the Moth Spodoptera littoralis. Journal of Biological Rhythms 22:502–514. DOI: https://doi.org/10.1177/0748730407307737

      Nolte A, Funk NW, Mukunda L, Gawalek P, Werckenthin A, Hansson BS, Wicher D, Stengl M. 2013. In situ Tip-Recordings Found No Evidence for an Orco-Based Ionotropic Mechanism of Pheromone-Transduction in Manduca sexta. PLOS ONE 8:e62648. DOI: https://doi.org/10.1371/journal.pone.0062648

      Nolte A, Gawalek P, Koerte S, Wei H, Schumann R, Werckenthin A, Krieger J, Stengl M. 2016. No Evidence for Ionotropic Pheromone Transduction in the Hawkmoth Manduca sexta. PLOS ONE 11:e0166060. DOI: https://doi.org/10.1371/journal.pone.0166060

      Rymer J, Bauernfeind AL, Brown S, Page TL. 2007. Circadian rhythms in the mating behavior of the cockroach, Leucophaea maderae. Journal of Biological Rhythms 22:43–57. DOI: https://doi.org/10.1177/0748730406295462, PMID: 17229924

      Schendzielorz J, Schendzielorz T, Arendt A, Stengl M. 2014. Bimodal Oscillations of Cyclic Nucleotide Concentrations in the Circadian System of the Madeira Cockroach Rhyparobia maderae. Journal of Biological Rhythms 29:318–331. DOI: https://doi.org/10.1177/0748730414546133

      Schendzielorz T, Peters W, Boekhoff I, Stengl M. 2012. Time of Day Changes in Cyclic Nucleotides Are Modified via Octopamine and Pheromone in Antennae of the Madeira Cockroach. Journal of Biological Rhythms 27:388–397. DOI: https://doi.org/10.1177/0748730412456265

      Schendzielorz T, Schirmer K, Stolte P, Stengl M. 2015. Octopamine Regulates Antennal Sensory Neurons via Daytime-Dependent Changes in cAMP and IP3 Levels in the Hawkmoth Manduca sexta. PLOS ONE 10:e0121230. DOI: https://doi.org/10.1371/journal.pone.0121230

      Schneider AC, Schröder K, Chang Y, Nolte A, Gawalek P, Stengl M. 2025. Hawkmoth Pheromone Transduction Involves G-Protein–Dependent Phospholipase Cβ Signaling. eNeuro 12:ENEURO.0376-24.2024. DOI: https://doi.org/10.1523/ENEURO.0376-24.2024, PMID: 39880675

      Stengl M. 2010. Pheromone Transduction in Moths. Frontiers in Cellular Neuroscience 4:133. DOI: https://doi.org/10.3389/fncel.2010.00133

      Stengl M. 1994. Inositol-trisphosphate-dependent calcium currents precede cation currents in insect olfactory receptor neurons in vitro. Journal of Comparative Physiology A 174:187–194. DOI: https://doi.org/10.1007/BF00193785

      Stengl M. 1993. Intracellular-Messenger-Mediated Cation Channels in Cultured Olfactory Receptor Neurons. Journal of Experimental Biology 178:125–147. DOI: https://doi.org/10.1242/jeb.178.1.125

      Stengl M, Funk NW. 2013. The role of the coreceptor Orco in insect olfactory transduction. Journal of Comparative Physiology A 199:897–909. DOI: https://doi.org/10.1007/s00359-013-0837-3

      Stengl M, Hildebrand JG. 1990. Insect olfactory neurons in vitro: morphological and immunocytochemical characterization of male-specific antennal receptor cells from developing antennae of male Manduca sexta. Journal of Neuroscience 10:837–847. DOI: https://doi.org/10.1523/JNEUROSCI.10-03-00837.1990, PMID: 2319305

      Stengl M, Schneider AC. 2024. Contribution of membrane-associated oscillators to biological timing at different timescales. Frontiers in Physiology 14:1243455. DOI: https://doi.org/10.3389/fphys.2023.1243455

      Takagi S, Abuin L, Mermet J, Lee D, Benton R. 2025. A GPCR signaling pathway in insect odor detection. DOI: https://doi.org/10.1101/2025.10.03.680299

      Tanoue S, Krishnan P, Krishnan B, Dryer SE, Hardin PE. 2004. Circadian Clocks in Antennal Neurons Are Necessary and Sufficient for Olfaction Rhythms in Drosophila. Current Biology 14:638–649. DOI: https://doi.org/10.1016/j.cub.2004.04.009, PMID: 15084278

      Thurm U, Küppers J. 1980. Epithelial physiology of insect sensilla. In: Locke M, Smith DS (Eds). Insect Biology in the Future. Academic Press. p. 735–763. DOI: https://doi.org/10.1016/B978-0-12-454340-9.50039-2

      Wicher D, Miazzi F. 2021. Functional properties of insect olfactory receptors: ionotropic receptors and odorant receptors. Cell and Tissue Research 383:7–19. DOI: https://doi.org/10.1007/s00441-020-03363-x

    1. eLife Assessment

      This well-designed, valuable study uses isotope tracing to analyse how iron limitation alters TCA cycle metabolism in Mycobacterium tuberculosis, revealing potential antibiotic targets for non-replicating bacteria in the host. The evidence is solid, providing insights into metabolic remodelling under iron-limited conditions.

    2. Reviewer #1 (Public review):

      M. tuberculosis exhibits metabolic flexibility, enabling it to adapt to various environmental stresses, including antibiotic treatment. In this manuscript, Serafini et al. investigate the metabolic remodeling of M. tuberculosis used to survive iron-limited conditions by employing LC-MS metabolomics and 13C isotope tracing experiments. The results demonstrate that metabolic activity in the oxidative branch of the TCA cycle slows down, while the reductive branch is reverted to facilitate the biosynthesis of malate, which is subsequently secreted.

      Overall, this study is experimentally well-designed, particularly the use of 13C isotope tracing to monitor TCA cycle remodeling under iron-limited conditions. The findings are valuable as they offer potential new targets for antibiotics aimed at non-replicating M. tuberculosis occurring in the hosts.

      Comments on revised version:

      All concerns are well addressed.

      I have one minor concern: Page 3 line 16 - Fig. 1G & H: The kinetics of ATP levels between H37Rv and Erdman seem different; Erdman induces greater ATP at days 2 and 3 after DFO treatment, which was not clear in H37Rv. Fig. 1I shows NAD/NADH ratio not NADH/NAD ratio. Please change it to NADH/NAD+ to be consistent with Supplement Fig. 1 result. Include the 17-day result of NADH/NAD+ in the discussion section to explain the different viability between the two strains.

    3. Reviewer #2 (Public review):

      Summary:

      The authors investigated the effect of prolonged iron limitation (which does stop growth but does not lead to cell death) alters central metabolism in M. tuberculosis. The major tool they used is metabolomics combined with stable isotope tracing. They show that the Krebs cycle is still active, despite the fact that it is dependent on some iron-dependent enzymes. They show that carbon flux through the oxidative branch of the Krebs cycle is stalled, resulting in the accumulation of metabolites, such as malate and alpha-ketoglutarate that are partially secreted. Apparently, the carbon flux from glycolysis is partially diverted to the reductive branch of the Krebs cycle. This is not achieved by using the glyoxylate shunt but probably through the GABA shunt. This unprecedented split of the Krebs cycle and malate secretion allows a continuous flow of carbon through the core of carbon metabolism, overcoming the metabolic stalling triggered by iron starvation.

      Strengths:

      Novel insight in the central metabolism of a major pathogen and its adaptation to iron starvation. Carefully conducted experimentation. Paper ends with a clear and helpful model.

      Weaknesses:

      The authors show some surprising and important findings, but would need a little more effort to really substantiate this. Especially the role of the GABA shunt should be genetically tested, as they did for ICL and the glyoxylate shunt.

      Also, the dataset 1 is not very convincing, it is only based on transcriptomics and shown with up or down, hardly a strong base for major conclusions. The very least you want is actual differences, preferable on the protein level, where it really counts....

      Comments on the revised version:

      In the revised version all these points were appropriately dealt with and discussed, although some of them textually and not experimentally, but for reasons that are logical.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This well-designed, valuable study uses isotope tracing to analyse how iron limitation alters TCA cycle metabolism in Mycobacterium tuberculosis, revealing potential antibiotic targets for non-replicating bacteria in the host. The findings provide insights into metabolic remodelling under iron-limited conditions. Whilst some of the evidence is solid, the data around the GABA shunt is incomplete, requiring genetic validation, as was done for the glyoxylate shunt. Questions remain about the underlying mechanisms and their specific role in M. tuberculosis pathogenesis.

      We thank the Editor and the reviewers for the positive evaluation of our work and for the constructive comments, which helped us improve the manuscript. We have carefully considered all the points raised and addressed them to the best of our ability. Regarding the GABA shunt, we acknowledge that genetic validation would significantly strengthen our conclusions; as this was not feasible within the revision timeframe, we have revised the relevant section by adopting more cautious language and have included genetic validation among the future perspectives. Additionally, we have expanded the discussion to address the relevance of our findings in the context of Mtb pathogenesis and host-pathogen interaction. A point-by-point response to each comment is provided below.

      We also made minor adjustments to the main text and figures:

      We removed “normalised” from the Y-axis of Figure 1 (the data are normalised and the procedure is described in the Materials and Methods).

      We rearranged the order of a paragraph in the Introduction: the first paragraph “During infection pathogenic bacteria […] extensively investigated” has been moved down, (page 2, lines 8-12). -We edited two sentences in the Introduction (page 2, lines 4-7)

      Supplementary Information: we added the following sentence at page 4, lines 23-24: “The probability of the Figure 3 and 4–figure supplement 1E scenario should be equivalent to that of the Figure 3 and 4–figure supplement 1F scenario.”

      We made minor typing adjustments: page 3, lines 30 and 31; page 4, lines-11-12, lines 22-24; page 5, lines 23-24; page 7, line 6; page 12, lines 28 and 32.

      We added details to the Materials and Methods section at page 17, lines 1 and 19-21.

      Public Reviews:

      Reviewer #1 (Public review):

      M. tuberculosis exhibits metabolic flexibility, enabling it to adapt to various environmental stresses, including antibiotic treatment. In this manuscript, Serafini et al. investigate the metabolic remodeling of M. tuberculosis used to survive iron-limited conditions by employing LC-MS metabolomics and 13C isotope tracing experiments. The results demonstrate that metabolic activity in the oxidative branch of the TCA cycle slows down, while the reductive branch is reverted to facilitate the biosynthesis of malate, which is subsequently secreted.

      Overall, this study is experimentally well-designed, particularly the use of 13C isotope tracing to monitor TCA cycle remodeling under iron-limited conditions. The findings are valuable as they offer potential new targets for antibiotics aimed at non-replicating M. tuberculosis occurring in the hosts. However, despite these strengths, the reviewer has concerns regarding the mechanistic basis underlying the observed metabolic remodeling and its role in M. tuberculosis pathogenesis.

      We thank the reviewer for the positive evaluation of our work and for the constructive comments. Regarding the role of the observed metabolic remodelling in Mtb pathogenesis, we have expanded the discussion to address this aspect, contextualising our findings within the framework of Mtb infection and host-pathogen interaction (page 13, line 28-37; page 14, lines 1-23). Detailed responses to each specific comment are provided below.

      Major comments

      The authors argue that iron starvation is a physiologically relevant stressor encountered by M. tuberculosis post-infection. Using Erdman and H37Rv strains under DFO conditions, Erdman loses viability, whereas H37Rv maintains it. Nonetheless, both strains exhibit similar metabolic remodeling in the TCA cycle based upon metabolomics and isotope tracing data. The authors should clarify the specific metabolic adaptations in H37Rv that enable it to sustain viability under DFO conditions.

      We thank the reviewer for this observation. Following additional experiments performed in response to subsequent comments, we re-analysed the secreted metabolite data and monitored ATP, NADH, and NAD<sup>+</sup> levels over 17 days in both the Erdman and H37Rv strains. The results were concordant between the two strains, supporting the hypothesis that the decrease in CFU/mL over time does not reflect a loss of viability, but rather entry into a non-culturable state or, alternatively, an increased tendency to aggregate in liquid culture. Comments have been added at page 3, lines 16-24 and page 5, lines 30-36

      A mechanistic explanation of how Mtb sustains viability under iron starvation is provided at page 13, lines 2837.

      The authors report no significant changes in NAD/NADH and ATP levels in H37Rv and Erdman exposed to DFO conditions. They observe TCA cycle remodeling, particularly the reversal of the reaction between OAA and MAL, catalysed by malate dehydrogenase, an enzyme that uses NAD+ and NADH as cofactors. The directionality of this reaction likely depends on the relative levels of NAD+ and NADH. Additionally, other dehydrogenases, such as pyruvate DH and aKG DH, also require NAD+/NADH cofactors.

      We thank the reviewer for this important observation. We agree that the directionality of the malate dehydrogenase reaction, as well as the activity of other NAD<sup>+</sup>/NADH-dependent dehydrogenases, is likely influenced by the redox state of the cell. We therefore measured the NADH/NAD<sup>+</sup> ratio over 17 days in both strains under DFO conditions. We also note that the Y-axis title in Figure 1 was incorrectly reported and has been corrected accordingly. Results and interpretation of these new data are provided at:

      page 3 lines 16-21

      page 11 lines 16-36

      page 12 lines 1-9

      page 13 lines 3-5

      In Figure 1I, NAD+ and NADH levels are monitored only at day 3 post-exposure to DFO conditions. Since Erdman loses viability after 2-3 weeks, the authors should include measurements of NAD+, NADH, and ATP levels at weekly intervals up to 3 weeks.

      We thank the reviewer for this suggestion. As recommended, we extended the monitoring of NAD<sup>+</sup>, NADH, and ATP levels over 17 days in both strains. Results and interpretation have been discussed together and are reported in the manuscript. Please refer to the response above for the relevant page and line references.

      Furthermore, glycine levels - which are linked to NAD+ recycling via the conversion of glyoxylate - should be measured under both HI and DFO conditions as an indirect indicator of the NAD+/NADH ratio.

      We thank the reviewer for this comment. However, we believe that glycine levels cannot be considered a reliable indirect indicator of the NAD<sup>+</sup>/NADH ratio, as glycine is involved in multiple metabolic pathways. It can originate from serine, threonine, glyoxylate, or protein degradation, and can be incorporated into proteins, degraded to CO<sub>2</sub> and NH<sub>4</sub><sup>+</sup>, converted to glyoxylate, or transformed into other amino acids. Due to its metabolic versatility, therefore, glycine levels lack the specificity required to reliably reflect the cellular NAD<sup>+</sup>/NADH ratio. In addition, we could not find a single study that claim that glycine levels can be used as indicators of NAD<sup>+</sup>/NADH ratio.

      Nevertheless, this comment prompted us to examine glycine levels and isotopologue distribution under iron deprivation. Glycine levels showed no consistent trend under DFO conditions, remaining unchanged or increasing in both the Erdman and H37Rv backgrounds.

      Importantly, the isotopologue distribution analysis led us to conclude that glyoxylate is not a key precursor of glycine under iron starvation. This new analysis is described at page 10 (lines 1-20), and a new supplementary figure has been added, Figure 3 and 4 – figure supplement 3.

      In Figure 2A, it is unclear why a 100-fold accumulation of aKG does not correspond proportionally to the accumulation of (iso)citrate.

      We thank the reviewer for this observation. We agree that this point required clarification and have added a comment addressing this apparent discrepancy in the main text at page 4, lines 12–17.

      The authors state that fumarate, aKG, (iso)citrate, malate, and pyruvate are secreted under DFO conditions. While the secretion of aKG and pyruvate makes sense, given their marked intracellular accumulation, it is puzzling why (iso)citrate, malate, and fumarate are secreted even though there are no changes in their intracellular abundance.

      To rule out the possibility that these metabolites are released due to bacterial lysis rather than active secretion, the authors should analyze the 13C-labeled fractions of these metabolites in the culture filtrate using the M. tuberculosis culture in media containing 13C glycerol.

      We thank the reviewer for this important observation.

      Regarding the possibility of cell lysis, although it cannot be completely ruled out, several observations indicate that the increase in extracellular malate was not due to lysis. If substantial cell lysis had occurred, we would expect a general increase in all extracellular metabolites. However, the extracellular fumarate and succinate levels remained unchanged in both strains under DFO (similarly to the control conditions, HI and LI). Glutamate was detected in the culture filtrate, but its abundance increased only under HI conditions, not under DFO, in either H37Rv or Erdman. The lack of increase in extracellular glutamate, fumarate and succinate, therefore suggests that, even if some cell lysis occurred, it was minimal and did not significantly affect our observations.

      Regarding the 13C-fractions, we note that it is unclear how should the labelling profile would differ if extracellular metabolite derived from cell lysis. Nevertheless, as suggested by the reviewer, we compared the labelled fractions of extracellular isocitrate, malate, fumarate and glutamate. The comparison revealed variations consistent with two blocks in the carbon flow occurring at the levels of pyruvate and alpha-ketoglutarate, resulting in a slowdown in the downstream flux.

      A description of these new considerations has been added at page 5 (lines 27-36) including the Figure 2 – figure supplement 2 and a new section of SI-Appendix. Therefore, we are confident that the selective appearance of some but not all metabolites in culture filtrates is consistent with secretion but not cell lysis.

      To validate the role of the PCK-mediated reductive TCA cycle in malate biosynthesis and secretion under DFO conditions, the authors should generate a malate dehydrogenase (MDH) knockdown strain, considering that MDH is essential, and examine the 13C labeling patterns and NAD/NADH under DFO conditions.

      The authors also observe decreased GABA abundance and overall 13C labeling in DFO conditions, suggesting that the GABA shunt is the primary route for succinate biosynthesis under DFO conditions. Thus, it is strongly recommended that the authors perform a 13C glutamate tracing experiment to directly track labeling in aKG and GABA shunt metabolites, providing more definitive evidence for the involvement of the GABA shunt.

      We thank the reviewer for these valuable suggestions. We fully agree that both experiments would significantly strengthen the conclusions of our work.

      Regarding the MDH knockdown strain, we acknowledge that this experiment would provide direct validation of the PCK/PCA-mediated reductive TCA cycle in malate biosynthesis. However, generating a knockdown strain in Mtb is a technically demanding and time-consuming process, requiring several months even under optimal conditions, which makes it unfeasible within the revision timeframe. We have therefore incorporated this experiment as a future perspective in the conclusions, highlighting its importance for further validating the proposed model.

      Regarding the GABA shunt, we took the reviewer's comment as an opportunity to critically re-evaluate the strength of our data. As a result, we have revised the manuscript by merging the GABA shunt discussion with the glyoxylate shunt section, while adopting more cautious language in the concluding statement to reflect its hypothetical nature. The related figures have been moved to the Supplementary Materials. These aspects have been included among the future perspectives in the conclusions. Page 11, lines 10-13; page 14, lines 3-7.

      Reviewer #2 (Public review):

      Summary:

      The authors investigated the effect of prolonged iron limitation (which does stop growth but does not lead to cell death), altering central metabolism in M. tuberculosis. The major tool they used is metabolomics combined with stable isotope tracing. They show that the Krebs cycle is still active, despite the fact that it is dependent on some iron-dependent enzymes. They show that carbon flux through the oxidative branch of the Krebs cycle is stalled, resulting in the accumulation of metabolites, such as malate and alphaketoglutarate, that are partially secreted. Apparently, the carbon flux from glycolysis is partially diverted to the reductive branch of the Krebs cycle. This is not achieved by using the glyoxylate shunt but probably through the GABA shunt. This unprecedented split of the Krebs cycle and malate secretion allows a continuous flow of carbon through the core of carbon metabolism, overcoming the metabolic stalling triggered by iron starvation.

      Strengths:

      Novel insight into the central metabolism of a major pathogen and its adaptation to iron starvation. Carefully conducted experimentation. The paper ends with a clear and helpful model.

      Weaknesses:

      The authors show some surprising and important findings, but they would need a little more effort to really substantiate these. Especially the role of the GABA shunt should be genetically tested, as they did for ICL and the glyoxylate shunt.

      We thank the reviewer for the positive evaluation of our work. We agree that genetic validation of the GABA shunt would significantly strengthen our conclusions. However, generating the required mutant strains in Mtb is a technically demanding and time-consuming process that is unfeasible within the revision timeframe. In light of this, we have revised the manuscript by merging the GABA shunt discussion with the glyoxylate shunt section. This reorganization contextualizes the GABA shunt within a broader discussion, while adopting more cautious language in the concluding statement to reflect its hypothetical nature. Future genetic validation, including the generation of appropriate mutant strains, has been included among the future perspectives in the conclusions.

      Page 11, lines 10-13; page 14, lines 3-7.

      Also, dataset 1 is not very convincing, it is only based on transcriptomics and shown with up or down; this is not a strong base for major conclusions. As a minimum, one would want actual differences, preferably on the protein level, where it really counts.

      We thank the reviewer for this comment. We would like to clarify that Dataset S1 compiles transcriptomic and proteomic data from previously published studies, which represent the rational basis of our investigation. These data are consistently cited throughout the manuscript. The dataset was included solely as a convenience tool for the reader, to provide easy access to the relevant published information. To avoid any misunderstanding regarding its scope, we have renamed the file to 'Dataset S1 - Publicly available transcriptomic datasets referenced in this study'. Our conclusions derive from the integration of these published data with the novel biochemical and metabolomic evidence generated in this study. Further, to assist the reading, we added a clarifying description at top of “DE” column.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Clarify the definitions of "growth defect" and "growth arrest" under LI and DFO conditions, respectively.

      (2) In Figure 2A, specify the unit of the y-axis. Is it on a log scale?

      (3) Raw data of metabolomics and 13C isotope tracing experiments should be either deposited in public websites or provided as a separate file.

      We thank the reviewer for these comments.

      Regarding the definition of 'growth defect' and 'growth arrest': we replaced 'defect' with 'slowdown' to better reflect the observed phenotype under LI conditions.

      Regarding Figure 2A: we have specified the unit of the Y-axis and clarified whether the scale is logarithmic in the figure legend. We have done that for all the figures containing charts with Y/X axis in logarithmic scale. We added secondary tick marks in the charts of Figure 5G.

      Regarding raw data availability: the metabolomics data have been deposited in the Zenodo database. The reference number has been added to the manuscript."

      Reviewer #2 (Recommendations for the authors):

      It is mentioned that measurement of the activity of these two enzymes in cell-free extracts revealed the presence of PCA activity in the DFO condition (Figure 5E), but not of MEZ activity (data not shown). Activity measurements are a great added value, but then activities should be shown, also for MEZ.

      We thank the reviewer for this suggestion. We agree that showing enzyme activity data adds value to the manuscript. As recommended, activity measurements have been included in the supplementary materials (Figure 5 – figure supplement 1).

    1. eLife Assessment

      This paper provides a novel and valuable method improve the accuracy of predictions of the impact of insecticide-treated net (ITN)-based strategies for malaria control and elimination by using sub-national estimates of the duration of ITN access and use over time from cross-sectional survey data and annual country ITNs received. The authors propose a sophisticated methodological framework that accounts for many sources of uncertainty, providing compelling evidence.

    2. Reviewer #1 (Public review):

      This paper aims to improve the accuracy of predictions of the impact of ITN strategies by developing a method to estimate duration of ITN access and use over time on a subnational scale from cross-sectional survey data and the numbers ITNs received annually. The subnational estimates are then input into a mathematical model to predict clinical cases under different ITN distribution strategies.

      Strengths:

      The approach is novel and addresses a useful and timely topic. It makes use of available routine data, and has considered all of the relevant components of ITN distributions.

      The authors have made revisions, particularly to the methods, appendices and title - leaving the paper easier to follow, and with a clear, consistent aim. The assumptions are clearly stated.

      Weaknesses:

      The weaknesses are shared with other models of a similar complexity - it is not easy for a casual reader to fully understand the model or the implications of the assumptions which were required to be made. That routine data is used is good for availability, but data quality may be an issue in some places.

    3. Reviewer #2 (Public review):

      Summary:

      The authors design a custom Bayesian model to estimate the probabilities of access, use and use given access of insecticide-treated nets in six African countries, providing sub-national estimates and inferring the average duration of ITN use and access. An individual-based model was employed to simulate malaria epidemics and estimate the effectiveness of different ITN distribution strategies. The study finds that the mean probability of use or access did not reach 80% (a universal coverage formerly targeted by WHO) for any of the regions even for biennial campaigns, demonstrates that switching from triennial to biennial distribution campaigns increases population use by 7.9%, and evaluates the impact of employing more efficient ITNs on P. falciparum prevalence.

      Strengths:

      The authors developed a data-driven model that accounts for data collection imperfections and sources of uncertainty while differentiating between ITN use and access. They developed a methodology to infer the timing of mass campaign from publicly available data instead of assuming fixed dates. The probability of use given access allows determining the regions where ITN distribution is least effective. This work can help better inform future interventions by identifying regions where increasing mass campaign frequency or employing better ITNs are most effective. Finally, in addition to insights on ITN access and use for the six countries analyzed, the paper contributes with a methodological framework that can likely be extended to other countries.

      Weaknesses:

      Since the models employed are rather complex, the methodology description may be hard to follow for some readers. In addition, the models assume many hypotheses, including exponential decay of ITN use/access and narrow prior distributions. It is worth noting that, in the revised version of the manuscript, the authors justified the choice of exponential decay and narrow prior distributions, and made a significant effort to clarify the methodology and the model equations.

      Comments on revised version:

      I appreciate the improvements made to the text. The methodology description is much clearer now. I have no further suggestions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper provides a novel method to improve the accuracy of predictions of the impact of ITN strategies, by using sub-national estimates of the duration of ITN access and use over time from cross-sectional survey data and annual country ITNs received.

      Strengths:

      The approach is novel, makes use of available data, and has considered all of the relevant components of ITN distributions.

      Weaknesses:

      (W1.1) The main message of the paper was not very clear, and did not seem to fit the title. The title focuses on sub-national tailoring of ITN, but the abstract did not feature results directly about SNT. It was not very clear what the main result of the paper was - there are several ITN observations in the results and discussion. Most did not seem to be directly about SNT, but rather sub-national differences in use and access were accounted for in the analyses. It was not clear if the same conclusions would be reached without accounting for sub-national differences, but the estimates and predictions could be expected to be more accurate.

      Thank-you for highlighting this. We agree the title could be improved to better reflect the main messages of the paper and have now updated it to “Heterogeneity of use, access and retention of insecticide-treated nets: implications for subnational tailoring to maximise malaria control”. All parameters are estimated at a subnational level; this is not always the case a national level. We therefore do not have national-level models without subnational differences that our results could be compared to.

      (W1.2) Some of the results seemed to me to be apparent even without a modelling exercise (eg high coverage could not be maintained between campaigns, use would be higher with 2-yearly distributions rather than 3-yearly) or were not in themselves new insights (eg estimates of the duration of use). It would be helpful to clearly state what the novel results are in the abstract, the first paragraph of the discussion and the conclusions, and to make sure that the title is consistent.

      It is our understanding assessments on ITN coverage are often made from infrequent surveys, for example from MIS. These are typically conducted six months postcampaign and may miss notable reductions in use and access beyond this. Comparisons on ITN use and access are also frequently made directly between DHS surveys, which can be misleading in isolation if the time between campaigns and surveys is not considered. We have tried to highlight this more clearly in relation to Burkina Faso with the following text:

      “The observed decrease in use and access across many regions in Burkina Faso may therefore be a by-product of DHS surveys being conducted at progressively later dates relative to the most recent campaign; this does not necessarily indicate an underlying trend in decreasing use or access over longer timescales.”

      We do believe modelling exercises, such as the methodology presented here, can help generate improved estimates of ITN use and access over time than estimates from surveys alone, which can be biased by the relative timings of campaigns. It is also our understanding previous studies have generated national estimates of ITN retention. We are not aware of any previous studies that have estimated the duration ITNs continue to be used for, which is arguably of greater epidemiological importance than retention time. To best knowledge, these have also not been estimated at subnational scales previously.

      We acknowledge the novelty of some results were not clearly presented previously and are grateful to the reviewer for highlighting this. We have now highlighted some of the novel findings more clearly in the abstract, with the following text:

      “However, subnational variation in ITN retention and the duration that ITNs remain in use have not previously been quantified.”

      “Our results highlight that although transmission intensity remains an important factor for subnational tailoring of malaria control interventions, other factors, such as ITN use given access, meaningfully influence optimal deployment strategies.”.

      We have also highlighted the novelty and relevance of our findings more clearly in the first paragraph discussion, with the following text:

      “Funding constraints have also increased the need for consideration of subnational tailoring, with many recommendations being made on the basis of transmission intensity in the World Health Organisation (2025) Subnational Tailoring Reference Manual. However, a key uncertainty in assessing the potential impact of different ITN interventions has been how long nets remain in use rather than how long they are retained, and how this varies between regions. Here, to our best knowledge, we present the first estimates of subnational variation in ITN retention and the duration that ITNs remain in use, and also quantify for the first time how ITN use, access and retention vary between subnational regions across multiple African countries. Our work supports the change in guidance to optimal coverage as it highlights ITN interventions have notable differences in impact between settings, and that distributing fewer but more effective ITNs, particularly pyrethroid-chlorphenapyr products, is likely to be more impactful than maximising long-term coverage through increased campaign frequencies with pyrethroid-only ITNs. Our work also broadly supports World Health Organisation (2025) recommendations for subnational tailoring, particularly the consideration of deprioritisation of ITN distribution in very low transmission settings. However, our results provide new indications that deprioritisation of areas with higher ITN use given access may lead to greater resurgences in cases, highlighting that subnational tailoring decisions could be optimised further by considering additional factors to transmission intensity alone.”

      The novelty and relevance of our results are also now highlighted in the following text, which has been incorporated into the concluding paragraph:

      “In conclusion, the work indicates that universal coverage targets of 80% are unlikely to be consistently met due to waning overall ITN use in the intervening years between triennial mass campaigns. Improved coverage can be achieved through more frequent biennial distributions, though this is unlikely to be feasible at scale given the current funding landscape. Indeed, when resources are constrained, deprioritisation of ITN mass campaigns in certain settings is being increasingly considered through subnational tailoring of malaria control interventions. Our work highlights that the relationship between transmission intensity (whether measured in terms of prevalence or clinical cases) and intervention impact is non-linear, and notable resurgences in cases may follow when campaigns are deprioritised in all but very low transmission settings. This broadly supports WHO subnational tailoring guidance, which suggests consideration of deprioritising distribution of ITNs in regions with PfPR<sub>2-10</sub> < 1% (World Health Organization, 2025). However, while the World Health Organization (2025) Subnational Tailoring Reference Manual proposes that the withdrawal of ITNs in favour of indoor residual spraying should be considered in areas with low ITN use, here we estimate that ITN use alone appears to be a notably poorer predictor of the impact of ceasing mass campaigns than use given access. Our findings suggest that regions with higher use given access may experience disproportionately greater resurgences in cases following deprioritisation. This implies that regions with low use given access may warrant consideration for cessation of ITN distribution, rather than decisions being based solely on low overall ITN use irrespective of whether communities have sufficient ITN access. However, subnational differences in ITN use, access and retention are key knowledge gaps in many settings, and when estimated from infrequent surveys they are highly sensitive to bias arising from the timing of surveys relative to when campaigns were conducted. To our knowledge, this study is the first to estimate subnational variation in ITN retention and the first to estimate the duration that ITNs remain in use, which is of greater epidemiological relevance than retention time. It also provides a novel framework to correct for biases in estimates of ITN use and access arising from when campaigns were conducted. Although campaigns have historically aided increasing ITN use and access over time, we estimate the mean duration of ITN use is consistently shorter than mean retention times in all regions. This raises questions about whether punctuated distribution of ITNs through campaigns is the optimal mechanism for maximising their effectiveness and cost-effectiveness. Maximising the cost-effectiveness of interventions has become increasingly pertinent in the current funding context, and consideration of alternative distribution strategies, such as increased distribution through continuous distribution channels, including school- or community-based distribution, may be warranted. Frameworks such as the one presented here, which take into account the potential for impact from different net types and the high variability of ITN duration and use, could support NMP decision making on how best to maximise impact from available funds. Whilst such frameworks may be a useful tool, local knowledge of factors impacting ITN access and use as well as operational decision making will be paramount for NMP-led tailoring of subnational strategies.”

      (W1.3) On L236, the link to SNT is stated: "the models indicate trends that can support subnational tailoring of ITNs". They could indeed, but SNT itself is not done in this paper. It seems to be about improving sub-national predictions of the impact of single ITN strategies, by taking into account sub-national variation in access and use duration. This is useful, and the model developed has novel aspects.

      Thank-you for highlighting this. We hope our updated title and response to W1.12 below help address this. Where relevant we have also framed our findings in relation to the World Health Organization’s Subnational tailoring of malaria strategies and interventions: refence manual which was published following our original submission; examples of this are highlighted in our response above to W1.2.

      (W1.4) Individual countries may have records on when nets were distributed to the regions rather than needing to use the annual country number of nets together with the DHS data. It could be helpful to say what the analysis steps would be in that case.

      We have now added the following text of appendix 3.2 to clarify how the methodology could be adapted:

      “In contexts where national malaria programmes or other stakeholders have knowledge of the timings of mass campaigns (i.e. when there is no uncertainty in ɸ<sub>ij</sub>), the methodology can be adapted by deterministically evaluating the time since the last campaign (equation S18) for each time point.”

      (W1.5) There were several assumptions that needed to be made in building the model. There is some validation of the timing of the distributions (L633 "verified where possible through discussion with interested parties nationally and internationally") and the fit of estimated access and use to survey data, and agreement between predictions of prevalence and MAP estimates. It would be helpful to say which assumptions are important for the results (and would be key knowledge gaps) and which would not make a difference. It might be possible to validate the net timing model using a country where net distributions are known reasonably well.

      Thank-you for raising this. We acknowledge that to investigate which assumptions are less likely to make a meaningful difference, we would ideally have conducted a full sensitivity analysis on these. This however would be challenging, since many of these are structural assumptions rather than numerical ones (for example, the assumption of an exponential decay in use and access) which would require the entire methodology to be adapted to conduct a sensitivity analysis. We did validate our estimated campaign timings against some known subnational campaign timings for Senegal. However, we could not source data on when all campaigns were conducted for all regions of Senegal to the nearest month to be able to conduct validation against this. We were also not able to source other use and access data from separate data sources to the DHS to be able to validate our discrete-time models of historical use and access. PfPR2-10 estimates are however fitted to equivalent MAP estimates. These were validated against DHS estimates of PfPR6-59mo, which were not used at any stage to fit our models. We have made slight changes to the original wording in relation to this at the end of appendix 5.2.

      (W1.6) What was assumed about what happens to old nets after a mass campaign was not clear. This assumption is likely to affect the predictions of access for the biennial distributions.

      To generate our initial estimates of the mean duration of use and retention time with our hierarchical model, we assume nets are only distributed to individuals who do not already have ITNs (appendix 2). This initial step is necessary for our methodology, but is relaxed later under our discrete-time model where we assume ITNs are distributed at random such that individuals with an ITN are equally likely to receive a new ITN (and replace their existing one) following a mass campaign (appendix 4). Much of the aforementioned sections has been rewritten and we hope this is now clearer.

      (W1.7) L312 and elsewhere: That use given access declines with net age is plausible. However, I wondered if this could be partly a consequence of the assumptions in the model (eg the two exponential decays for access and use, the possible assumption that new nets displace the current ones when there is a mass campaign).

      Declining use given access as nets age is not affected by model assumptions. Due to being fitted independently of each other, there are no constraints that would prevent a faster decay in access than use. Had the data supported this, this would have led to use given access increasing over time since the last campaign. The data did not support this. Further clarification that use and access are fitted independently of each other is has now been provided in the following text:

      “All subsequent analyses described are conducted independently for use and access”

      (W1.8) The Methods section on Estimating historical use and access seemed to be aimed at readers familiar with formulae, but I think it could lose other interested readers. It could be useful to explain a little more about what is happening at each step and also why.

      Thank-you for highlighting this. We have re-written this section in the main manuscript, now named ‘Historical use, access and retention times’, where we now only highlight key equations and provide a high-level overview of the methodological steps. We have sought to provide clearer explanations here behind the rationale for each step to ensure maximum accessibility for interested readers. The original wording was used as a basis for the newly provided series of appendices which provide further technical detail; this wording has also been heavily re-drafted to improve clarity of each step.

      (W1.9) The model was fitted to MAP estimates of PfPR2-10, which themselves come from a model. It may be that there is different uncertainty in the MAP estimates for different regions. I couldn't see this on the graph, but maybe the uncertainty is small. Was this taken into account in the fitting?

      We only used median MAP estimates of PfPR2-10 to calibrate the baseline EIR for each region in our model. We have clarified our rationale in appendix 5.2:

      “Since the relationship between baseline EIR and PfPR2-10 here is specific to malaria simulation, MAP uncertainty estimates were not propagated through to our estimates in baseline EIR since these would not faithfully represent its true uncertainty.”

      (W1.10) Was uncertainty from each estimated component integrated into the other components?

      Thank-you for highlighting this as this indicates we had failed to clearly indicate this. To confirm, we propagate uncertainty in each component through to our estimates of cases averted. New text has been provided to clarify this in the following text:

      “Region-specific uncertainty in ITN efficacy, use, retention, and the relative contributions of continuous and campaign channels is therefore propagated through to our estimates of cases averted.”

      Further details are also provided in the preceding text of the same paragraph. The central 95% credible intervals of cases averted shown in figures 5.C and 6 and associated figure supplements are reflective of this uncertainty.

      (W1.11) Eyeballing Figure 2 (Burkina Faso), there is a general pattern of decline in all the regions, some differences between the regions and some differences in how well the model fits between the regions. If possible, it could be helpful to say how much better the fit was when using regionspecific compared to countrywide parameter values for access and use, and how different the results would be.

      In the “Universal coverage: was it achievable under triennial mass campaigns” results section, we have now provided further emphasis that the observed decrease from DHS data may be driven by surveys being conducted progressively later in relation to the last campaign:

      “The observed decrease in use and access across many regions in Burkina Faso may therefore be a by-product of DHS surveys being conducted at progressively later dates relative to the most recent campaign; this does not necessarily indicate an underlying trend in decreasing use or access over longer timescales.”

      In the case of Burkina Faso (figure 2.A), aside from months when very small numbers of individuals were surveys where either 0% or 100% use or access was reported, no other data lie outside our 95% credible interval for any region.

      We are unable to generate comparisons with countrywide parameters as these are not generated when fitting our discrete-time model, even though they are a by-product of the initial hierarchical model used to generate initial estimates of region-specific ITN retention, which was a necessary methodological step. We hope the extensive revision of the text in the methods and appendices helps to improve the clarity on this. Where national estimates are provided, these are population-weighted means of the subnational median posterior estimates. New text is included in appendix 1 to clarify this:

      “National and continental values are reported as population-weighted summaries of the median subnational estimates generated from the discrete-time models”

      (W1.12) The question of moving from a campaign every three to every two years may not be the most pertinent question in the current funding landscape. I realise that a paper is in development for a long time, but it would be helpful to comment on what else the model could be used for when fewer rather than more nets are likely to be available.

      We acknowledge the funding landscape has changed substantially, but we still believe this work has important implications in the current context. We have emphasised this further in the following text:

      “If budget constraints necessitate the deprioritisation of campaigns, our results highlight that this should be avoided, if possible, in regions with moderate to high transmission intensity, particularly those with mean annual incidence exceeding 100– 150 clinical cases per 1,000 people. Shortening campaign intervals from three to two years in moderate- and high-transmission regions is projected to avert more cases than the additional cases that may arise from ceasing campaigns in some lower-transmission settings. Additionally, although pyrethroid–chlorfenapyr ITNs are more costly, the additional cases projected to be averted by them relative to pyrethroid-only and pyrethroid–PBO ITNs are substantial. In certain national contexts it may be more cost-effective for biennial pyrethroid-chlorfenapyr campaigns to be conducted in fewer subnational regions even under reduced budgets. However, more thorough economic analyses will be needed to understand this fully. Moreover, as ITNs remain one of the most cost-effective malaria control interventions, improving the impact of them could still be more cost-effective than the introduction of new untested interventions (Topazian et al., 2023; Schmit et al., 2024).”

      We have also related some of our findings to the WHO Subnational Tailoring Reference Manual (as highlighted in W1.2), which we hope better relates our findings to the current context.

      Reviewer #2 (Public review):

      Summary:

      The authors design a custom Bayesian model to estimate the probabilities of access, use and use given access of insecticide-treated nets in six African countries, providing sub-national estimates and inferring the average duration of ITN use and access. An individual-based model was employed to simulate malaria epidemics and estimate the effectiveness of different ITN distribution strategies. The study finds that the mean probability of use or access did not reach 80% (a universal coverage formely targeted by WHO) for any of the regions, even for biennial campaigns, demonstrates that switching from triennial to biennial distribution campaigns increases population use by 7.9%, and evaluates the impact of employing more efficient ITNs on P. falciparum prevalence.

      Strengths:

      The authors developed a data-driven model that accounts for data collection imperfections and sources of uncertainty while differentiating between ITN use and access. They developed a methodology to infer the timing of a mass campaign from publicly available data instead of assuming fixed dates. The probability of use given access allows for determining the regions where ITN distribution is least effective. This work can help better inform future interventions by identifying regions where increasing mass campaign frequency or employing better ITNs are most effective. Finally, in addition to insights on ITN access and use for the six countries analyzed, the paper contributes a methodological framework that can likely be extended to other countries.

      Weaknesses:

      Since the models employed are rather complex, the description of the methodology may be hard to follow for most readers. In addition, the models assume many hypotheses, including:

      (W2.1) Exponential decay of ITN use/access.

      We do acknowledge different modelling studies have typically assumed either an exponential decay or an “S-shaped” smooth-compact loss function, with many of these studies having been validated against cluster-randomised trial data for both functional forms. We believe the ITN age distribution data across the DHS surveys inspected provides reasonable evidence to support the use of an Exponential decay function here. We have now included a proof (appendix 2.1) demonstrating an exponentially distributed ITN age distribution will be yielded for an exponential decay function with the same rate parameter; this is true under periodic ITN distribution and becomes an approximation for a finite number of surveys. We now also included additional text (appendix 2.2) highlighting the empirical ITN age distributions appear to support our exponential decay assumption.

      (W2.2) The decay rates for the probability of the ITN repelling and killing a mosquito are the same.

      Although the same decay rate parameter (\gamma_N) is present in our expressions for the probability of repellency and mortality (equations (53) and (54)), the half-life of the latter is shorter, since repellency is assumed to decay towards a constant value. These structural forms are not unique to this paper but are shared among all malaria simulation-based studies with ITN interventions. This decay rate parameter has been estimated in previous studies (Sherrard-Smith et al., 2022; Churcher et al., 2024), and we carry through uncertainty estimates from those previous studies into the work presented here; additional text has been added to clarify this:

      “Uncertainty in ITN repellency and mortality parameters (equation (53) and (54)) is also propagated forward to this study by simulating random draws from previous posterior distributions (Sherrard-Smith et al., 2022; Churcher et al., 2024) across each distribution event and realisation.”

      (W2.3) Given a time instant, all individuals in the same administrative unit and have the same probability of using a net;

      Our discrete-time model estimates the proportion of the population with use and access at each time instant. We purposefully do not conflate this with the probability of use and access, which can vary between individuals within the same subnational unit of analysis (urban and rural regions of each administrative-one area). We are grateful this point has been raised as it indicates we had not communicated this sufficiently clearly before. We hope the extensive re-draft of the ‘Historical use, access and retention times’ methods section has helped address this, in particular in the following text preceeding equation (7):

      “We do not assume the probability of access is the same for all individuals in a region at a given point in time. Instead, we assume the probability any given individual has access to an ITN at time t<sub>j</sub> can be described by a Beta distribution”

      (W2.4) ITN use/access decay models do not depend on the distribution strategy (e.g. bienal vs trienal distribution).

      We may not have fully understood this point, but in terms of our historical models of use and access, assumptions are not imposed on the frequency of previous campaigns. Instead, historical campaign timings are estimated from data from DHS surveys and the AMP Net Mapping Project (now detailed in appendix 3.1); historical estimated intervals could be either two or three years (or indeed any interval) as informed by this data. In terms of the duration of use and retention time, these are estimates how long a net would continue to be used, or provide access, if an individual were not to replace it at earlier date; these estimates are therefore independent of campaign intervals, and we have now added addition text to provide additional clarity:

      “However, throughout this study, the durations of use and retention time are always estimates of how long an individual continues to use or have access to a net in the absence of future replacement; estimates of these are therefore reflective of behaviour or ITN durability and not distribution patterns themselves.”

      We do acknowledge under our approach, use immediately following a campaign is agnostic of campaign frequency; however, given an absence of data on how use changes following a switch from triennial to biennial campaigns, we believe this was a reasonably conservative assumption. Further confirmation is now provided in the following text, with additional preceding context:

      “Future campaigns, whether conducted every two or three years, are therefore assumed to achieve a consistent initial level of use.”

      (W2.5) The Bayesian model assumes some narrow prior distributions.

      Thank-you for highlighting this. We acknowledge the need for further justification for the choice of priors. We have provided this in depth for the hierarchical model of the mean duration of use and access (in appendix 2.2). Further justification for the choice of priors for the discrete-time model are also now provided in appendix 4.2).

      The impact of these hypotheses on the estimated parameters is not explored in the paper, and no sensitivity analyses are performed, although some limitations are discussed.

      We fully acknowledge we had not conducted sensitivity analyses for many of our assumptions, and we have now tried to provide better justification for our assumptions. The assumptions most likely to influence inference are structural components of the modelling framework rather than scalar parameters that can be varied independently in a conventional sensitivity analysis. Many of the assumptions highlighted above are structural, such as the assumption of an exponential decay (W2.1). In the case of our assumption of exponential decay, multiple elements of the methodology are restricted by this (for example, when correcting for biases that arise from nets being lost between campaigns and survey times when estimating the timing of campaigns in appendix 3.1). Investigating the sensitivity of this assumption over an assumed smooth compact function would require extensive adaptation of the methodology that would be beyond the scope of this paper. Some other assumptions, such the assumption of the same decay rate parameter for repellency and mortality (W2.2) have been estimated in the previous studies referenced and have been validated against cluster-randomised, controlled trials. We nevertheless recognise our justification of some assumptions could have been expanded upon previously, and we hope the changes highlighted above go towards addressing this.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (R1.1) I looked for the reference WHO 2024b for the recent optimal allocation guideline, but there were just three WHO 2024 references in the bibliography. In addition, what exactly the 80% rule applies to is not clear - this could be explained so it is clearer what result to compare to it (or explain that the rule itself is not clear).

      We have used the eLife LaTeX/BibTex template for citations throughout and acknowledge this doesn’t show letter suffixes in the reference list for multiple author-year entries. We unsure of how to address this given this is generated by the official template, though we note that when citations are clicked on in the document, the relevant citation is then shown at the top of the page on the web version.

      (R1.2) L24 'estimated', but this seems more like a prediction. The words 'estimated' and 'predicted' should be carefully used throughout when combining statistical and mechanistic modelling.

      This has now been changed.

      (R1.3) The point estimates should always have measures of uncertainty.

      The rationale for the omission of credible intervals for some point estimates has now been clarified in the manuscript (appendix 1). The following text has been added:

      “Additionally, in relation to uncertainty estimates, credible intervals are shown for all subnational quantities that are directly estimated in our models. National and continental values are reported as population-weighted summaries of the median subnational estimates generated from the discrete-time models (appendix 4) and therefore do not correspond to explicitly estimated model parameters, so credible intervals are not shown for these aggregated estimates.”

      (R1.4) It would be helpful to justify the choice of ADM1 as the geographical unit.

      We have clarified the rationale for this on the following text:

      “Here, (subnational) regions are defined as the first administrative unit below the country level and are further divided into rural and urban areas to align with DHS stratification”

      (R1.5) The terminology was slightly confusing: in some places, it sounded as if regions were the sub-national regions, in others as if they were different things (eg L74, L105). L45 'and' seems odd here.

      ‘Region’ is used interchangeably with ‘subnational region’ at points in the paper to aid the flow of the text. We hope the use of paratheses around (subnational) in the updated text quoted above (and on the following text) helps provide clarity:

      “here, the units of analysis are consistently referred to as (subnational) regions”

      (R1.6) Spurious accuracy in some estimates, e.g. L52.

      This was a result cited from Bertozzi-Villa et al. (2021) for which uncertainty estimates were not available. We hope the response to R1.3 above helps clarify the rationale for omitting credible intervals for some estimates generated here.

      (R1.7) L68 'lose' instead of 'loose'.

      Now corrected.

      (R1.8) L534. I suspect that the model was actually fitted in Stan via the R interface rstan.

      Language adjusted accordingly.

      (R1.9) L633 'through' rather than 'though'.

      This section has been heavily redrafted and we have checked for typos.

      Reviewer #2 (Recommendations for the authors):

      The paper is well-written and presents an important contribution to better aid interventions. The proposed models are reasonable, but because of their complexity, even readers who work with epidemic modelling might have issues understanding the methodology.

      We thank the reviewer for highlighting that the methodology may be difficult to follow. The methods section has now been substantially rewritten to provide a clearer conceptual description of the modelling framework, with detailed model specification and derivations moved to the appendices. We hope this restructuring will allow readers to follow the modelling approach at a high level in the main text with technical details contained in the appendices.

      To improve the clarity of the methods section, I suggest:

      (R2.1) Include a list of symbols with the meaning of each variable defined in the text.

      Definitions for symbols are now also shown in appendix 1 – tables 1-5.

      (R2.2) Include a centralized full description of each model, clearly stating the priors and likelihood (similarly to a Stan code).

      There are two models that are fitted with Stan (the hierarchical retention model and discrete-time use/access model). To improve clarity for the hierarchical model, priors are now presented in a single block (equations 11 – 17) in appendix 2.2, with the likelihood (equation 18). For the discrete-time model, we have split the presentation of the priors (equations 37 – 42) and the likelihood expressions (equations 43 – 45) into different subsections (respectively appendices 4.2 and 4.3).

      (R2.3) If needed, include additional data preprocessing in the form of an algorithm.

      Although we have not included an algorithm outlining the preprocessing steps, we have ensured sufficient detail has been provided to facilitate replicability. For example, in appendix 1, we now outline how use and access are inferred from DHS data:

      “ITN use is inferred from DHS data (ICF, 2025) on whether individuals slept under an ITN the previous night, while all individuals who used an ITN are assumed to have access; when fewer than two individuals used an ITN, the ITN is assumed to be able to provide access at random to up to two individuals in a household.”

      (R2.4) Mention the main hypotheses and limitations of the model in the main text.

      We have ensured key assumptions of the model are stated in the re-written ‘Historical use, access and retention times’ methods subsection; for example, in the following text:

      “Due to the sparsity and irregularity of DHS and MIS surveys, we were unable to investigate seasonal fluctuations in either access or use; we therefore assume that nets provide access or are used continuously over some period of time.”

      (R2.5) Including a flowchart or diagram that provides an overview of the proposed framework could be helpful.

      We have now included a flowchart of methodological steps in appendix 1 – figure 1.

      (R2.6) Line 89: Define NMP before presenting the acronym.

      We have ensured this is defined in the first instance on line 39.

      (R2.7) Equation (1): Explain why you chose the Exponential distribution (e.g. constant hazard), as this is one of the main hypotheses of the model.

      As highlighted in our response to W2.1, we have now included justification of this assumption in the final paragraph of appendix 2.2.

      (R2.8) Equation (2): Although Equation (2) passes a clear message of how alpha_i^x is distributed, I wonder if it is mathematically correct to express the limit this way, since the argument of the limit is a random variable. Maybe the limit should be applied to gamma_i^x instead.

      Thank-you for highlighting this. We acknowledge the limit behaviour was expressed in a short-hand manner that is not strictly mathematically correct. Indeed, the limit should be applied to the decay rate parameter gamma (now shown in equation 10). In appendix 2.1, we have now provided a proof demonstrating the rate parameter of the pooled ITN age distribution should tend to the same decay rate as the assumed exponential loss function.

      (R2.9) I think the difference between pho_i^x (Equation (1)) and alpha_i^x (Equation (2)) is not very clear in the text.

      In the context of access, rho_{i(l)} and alpha_{i(l)} are respectively the duration an ITN l is retained for and its age at the time of a survey. We hope the redrafted appendices make this clearer, in addition to the inclusion of the new parameter tables in appendix 1.

      (R2.10) Line 479: Typo (and or).

      Updated wording is now contained in appendix 2.

      (R2.11) Line 711: Typo (The limit is equal to infinity).

      This has now been corrected.

      (R2.12) Equation (15): I could not understand this equation. What is rho(s) and rho(s \in I), where I is one of the intervals mentioned in this equation?

      Rho(tau_ik) was introduced as simplified notation for the probability density of the timing of campaign k in region i (tau_ik) but we acknowledge this was not explained clearly. We also acknowledge this equation presented a lot of concepts at once. The equation attempted to describe the probability density of the last campaign in region i relative to time t_j, denoted phi_ij. We no longer make use of this previously notation (rho) for the probability density. This equation has been updated to equation (30), with incremental explanation of its construction now provided on lines in appendix 3.2.

      (R2.13) Line 642: What is t?

      The use of $t_j \ni t$ was previously used to indicate that the discrete time point t_j lies within continuous time t. We acknowledge this was a non-standard use of notation and was not clearly explained. This section (now in appendix 4) has been rewritten without this notation. The use of t and t_j to denote continuous time and discrete time points respectively is now defined in the core notation table (appendix 1 – table 1).

      (R2.14) The proposed model has narrow hyperhyperpriors because of convergence issues. Are the estimated parameters sensitive to the choice of hyperhyperpriors?

      We acknowledge limited justification was previously provided for the choice of hyperhyperpriors. We have now provided additional justification within appendix 2.2.

      (R2.15) Since the proposed Bayesian models are relatively complex, it might be useful to provide convergence diagnostic plots in the supplement.

      Convergence diagnostics were inspected using the ShinyStan packagxe. Chains showed satisfactory convergence based on standard diagnostics. We have not included diagnostic plots due to the large number of parameters in the fitted models. Under the hierarchical model (appendix 2) for ITN use, 146 region-specific parameters (one for each region), 12 country-level hyperparameters (two for each country), and four hyperhyperparameters were estimated. Under the discrete-time model (appendix 4), a further 876 parameters (six for each region) were estimated. In total, 1,038 parameters were fitted for the ITN use models. The same number of parameters were estimated for the ITN access models, giving a total of 2,076 estimated parameters.

    1. eLife Assessment

      In this manuscript, the authors investigate programmed DNA elimination (PDE) across nematodes using a large-scale cytological approach. This work is potentially significant because it expands PDE beyond a few known nematodes to a much broader set of Rhabditidae species, providing an important resource for investigating PDE's evolutionary origins and functions. The strength of evidence, however, is incomplete; the technique used to evaluate PDE is insufficient to provide unambiguous support for the phenomenon, so additional methods, such as genomic sequencing from a few species spanning the range of elimination levels, would be required to confirm these findings. This research would be of interest to geneticists, evolutionary biologists, and those working on the regulation of genome integrity.

    2. Reviewer #1 (Public review):

      Summary:

      Launay et al., conducted a screen of PDE in 25 new Rhabditidae species through cytological approaches and found PDE is detected in 17 out of 25 species, representing 12 out of 17 genera within the family. This work is significant because it expands PDE from a few known nematodes to a much broader set of Rhabditidae species.

      Strengths:

      By demonstrating PDE across many genera with the exception of C. elegans and some other Caenorhabditis species, the study provides an important resource for investigating PDE's evolutionary origins, mechanisms of genome reorganization and DNA repair, and its functional consequences.

      Most of the observed PDEs were supported by solid evidence through a survey-style cytological screen (PDE detected in 17/25 species and in 12/17 genera), which supports the main claim of widespread occurrence.

      Weaknesses:

      Although most PDE claims are supported by solid evidence, some of the existing data do not describe the depth of characterization, e.g., how many replicates were conducted for each species? How reproducible are the claimed PDEs between embryos in terms of timing and cell identities destined for PDE? Is it possible to validate a subset of PDE with independent evidence, especially for those with marginal PDE? This is important because some dying embryos may fail to maintain their chromosome integrity and release some of the broken DNAs, some others may suffer from noise such as intracellular parasites, for example, microsporidia, or even highly condensed mitochondrial DNAs.

    3. Reviewer #2 (Public review):

      Summary:

      Programmed DNA elimination is increasingly recognised as an important phenomenon across many species, including in animals. Exactly how widespread is still unclear, and the function of PDE is even more mysterious in most species where it has been described. PDE has been discovered in several nematode species, and in this manuscript, the authors carry out a more extensive search for PDE. They find PDE in many species, indicating that it is widespread across the phylum.

      Strengths:

      The large number of species across many different clades provides good evidence that the phenomenon has evolved many times independently. The work will therefore prompt many further studies characterising individual species, and potentially linking the evolution of the phenomenon to other features of these species' ecological characteristics.

      Weaknesses:

      The major technical weakness of this project is the assay that is used to evaluate PDE. First, this assay is clearly insensitive, as the authors acknowledge, O. tipulae, which has PDE, does not appear in their screen. Second, the assay gives no information about breakpoints and only limited, non-quantitative information about how much DNA is eliminated. Thus, their data really is only a preliminary screen, which would need to be confirmed by genomic assays.

    4. Reviewer #3 (Public review):

      Summary:

      Somatic programmed DNA elimination (PDE), also known as chromatin diminution, has primarily been studied in parasitic nematodes, such as Ascaris species, in which it was discovered almost 140 years ago. Recently, PDE has also been reported in three non-parasitic nematode species. In this manuscript, Launay et al present the results of a large-scale cytological and evolutionary study of PDE across 29 free-living nematode species belonging to the Rhabditidae family, for which they established a phylogeny based on 18S and 28S ribosomal RNA sequences. By combining DNA staining and telomere DNA FISH labeling in developing embryos, they convincingly document the formation of lagging fragments and/or the loss of long germline telomeres in 17 species, during one particular division of somatic precursor cells.

      Strengths:

      (1) The whole study is well executed, and the results are convincing.

      (2) The authors present compelling evidence that PDE is an ancestral feature of Rhabditidae nematodes.

      (3) This study provides a valuable resource of lab-tractable species for future PDE studies.

      Weaknesses:

      (1) Some clarifications are necessary to make the figures more reader-friendly.

      (2) Important references to ciliates are missing.

    5. Author response:

      We thank you and the three reviewers for their careful examination and critical assessment of our work.

      All acknowledge the significance of revealing the widespread occurrence of programmed DNA elimination (PDE) in nematodes, a phenomenon long considered a parasitic specificity. The reviewers, particularly Reviewer #2 and the Editors, have raised important concerns regarding confirming PDE with more sensitive methods, in particular using genomic data to characterize breaksite motifs across the phylogeny and to better understand the amount and nature of eliminated sequences across species. While we fully agree that such confirmation would ideally complement our discovery, this approach extends beyond the scope of the current manuscript. Our primary aim was to inform the scientific community of the widespread occurrence of PDE in the short term.

      In the longer term, an ambitious collaborative effort is currently underway to produce high-quality genome assemblies of several 100s of nematode species (ENA: PRJEB36817) , covering the diversity of Rhabditina and beyond. These will enable precisely characterising PDE, ultimately addressing these concerns. However, given the scale of this project, aiming at telomere-to-telomere assemblies - which can be particularly challenging for species that perform PDE - it will take considerable time. We believe the community should be informed of the widespread nature of PDE now, rather than waiting for this genomic data.

      Nevertheless, we would like to emphasize that PDE has already been confirmed using genomics in the three clades where we have identified it cytologically: through our own work in Mesorhabditis (1) and Letcher et al., in prep, and also in Caenorhabditis (2) and Oscheius (3, 4). We will state this explicitly in our revision.

      For these reasons, and to avoid overstepping extensive genomic studies that are underway, we will maintain our focus on the cytological description in this manuscript.

      In addition to the above-mentioned concern, we will also address the other points:

      Reviewer #1:

      “Although most PDE claims are supported by solid evidence, some of the existing data do not describe the depth of characterization, e.g., how many replicates were conducted for each species? How reproducible are the claimed PDEs between embryos in terms of timing and cell identities destined for PDE? Is it possible to validate a subset of PDE with independent evidence, especially for those with marginal PDE? This is important because some dying embryos may fail to maintain their chromosome integrity and release some of the broken DNA, some others may suffer from noise such as intracellular parasites, for example, microsporidia, or even highly condensed mitochondrial DNA.

      we will provide the missing information concerning number of observed embryos (using DNA stainings or DNA-FISH), and better explain and illustrate the reason why the observed fragments cannot be attributed to intracellular parasites, or to the consequence of dying embryos.

      Reviewer #3:

      Some clarifications are necessary to make the figures more reader-friendly.

      This will be improved, thank you for pointing this out

      Important references to ciliates are missing.

      Thank you for pointing this out. We will improve the comparisons that can be made with the mechanism of PDE found in ciliates.

      References

      (1) C. Rey, C. Launay, E. Wenger, M. Delattre, Programmed DNA elimination in Mesorhabditis nematodes. Curr Biol 33, 3711-3721.e5 (2023).

      (2) L. Stevens, S. Sun, N. Haruta, L. Xiao, N. Uwatoko, M. Kieninger, K. Sato, A. Yoshida, D. Absolon, J. Collins, A. Sugimoto, T. Kikuchi, M. Blaxter, Programmed DNA elimination was present in the last common ancestor of Caenorhabditis nematodes. bioRxiv [Preprint] (2025). https://doi.org/10.1101/2025.10.23.681605.

      (3) T. C. Dockendorff, B. Estrem, J. Reed, J. R. Simmons, S. B. Zadegan, M. V. Zagoskin, V. Terta, E. Villalobos, E. M. Seaberry, J. Wang, The nematode Oscheius tipulae as a genetic model for programmed DNA elimination. Curr Biol 32, 5083-5098.e6 (2022).

      (4) P. M. Gonzalez de la Rosa, M. Thomson, U. Trivedi, A. Tracey, S. Tandonnet, M. Blaxter, A telomere-to-telomere assembly of Oscheius tipulae and the evolution of rhabditid nematode chromosomes. G3 (Bethesda) 11, jkaa020 (2021).

    1. eLife Assessment

      This is a fundamental study of individual variation and the contribution of learning to behavioural individuality. The experimental design of massively parallel behavioural phenotypes is outstanding and the conclusions are supported by a compelling and rigorous analysis across a large number of experiments in thousands of individuals across genotypes and conditions. The dataset further represents an advance in studying visual associative learning thanks to the ability to make longitudinal measurements of many behavioural decisions within the same animals. These results are a major contribution to the understanding of the sources of behavioural individuality.

    2. Reviewer #1 (Public review):

      "Learning is a fundamental source of individuality," by Manna and colleagues, interrogates different sources of variation in individual behavior. The authors place individual flies in a Y-shaped arena, which is a common design in the field, and illuminate the arms of the Y with blue versus green light. They track the color preference of individual animals and also perform operant conditioning, meaning that they teach the fly to avoid a particular color/arm by generating a foot shock when the fly enters that arm. There are a number of things that are impressive about this setup: The authors are able to collect data on thousands of individual flies of many different strain backgrounds, and they demonstrate a strong change in color preference after conditioning. This is nice, because in past papers, visual learning ability has been modest and difficult to study. To put a number on it, in this paper, animals on average don't show a color preference at the start of the assay, spending around 30% of their time in the one arm illuminated green, and the remaining time in the two arms illuminated blue. After conditioning, the average animal spends only 23% of its time in the green arm.

      The authors run 64 animals through the assay for each of 88 wild-type strains (maybe? see Major Point 1 below) and see considerable strain-specific (genetic) variation in the change in time spent in the shocked color after conditioning. Some strains show no learning, while others spend <10% of their time in the shocked color after conditioning. They also, I believe, see that some strains have more variability across individuals, which would suggest that some strains have stronger canalization at the development or circuit function level than others, i.e., some genotypes produce more consistent copies of the individual, others less consistent copies. (Or, some genotypes produce robust circuits, and others produce noisy circuits.)

      Finally, the authors argue statistically that learning itself increases variability in individual performance. This makes a lot of sense to me intuitively. Learning changes the physical/chemical properties of circuits in the brain, and because it evolves over time and interacts with environmental variables, it seems like it should send different animals down different channels. Or, at a conceptual level, if I learn to play the piano and my sister doesn't (because of some genetic difference between us or something stochastic), this learning experience will cause all sorts of other differences in our behavior as time passes. I also think the authors do have enough data to be able to make this finding. However, the presentation of the argument in this portion of the paper is hard for me to understand, and I am not an expert in statistics, so the strength of the result is difficult for me to evaluate.

      Major points

      (1) It's difficult to track through the paper the number of animals tested for different assays. At the beginning, it says N=5632, which works out to 64 flies for each of the 88 DGRP strains. 64 happens to be the number of parallel Y arenas they have. Later in the methods, there's a description of more variation within the set of 64 for each strain, two different parent sets per strain, different sexes, conditioned and unconditioned. And, while the results text focuses on the color learning, the methods discuss additional assays (place learning, multi-day learning).

      Given the numbers, does each run of the 64 mazes include all the tested flies of one strain, or are flies of many strains included in each batch? Do different flies do different assays (color, place, multi-day), or do they all do all the assays? Perhaps there is a table including this information already in the supplement, but I recommend making it much clearer in the main results text and methods. While the dataset is large, if it is split over many conditions and/or if batch and genotype confound each other, this will affect the robustness of the results and how strong the conclusions can be.

      (2) The data presentation in Figure 1 is elegant and easy to follow, but getting into Figure 2 and subsequently, I get lost in the statistics and have trouble understanding what is being measured. My understanding of the big picture is that while genetics and individual randomness contribute a lot to behavior, the evidence for learning as an amplifier of individuality is that variance in behavior among animals of the same strain increases over time in the conditioned group (i.e., the group that is doing the most learning, or a specific kind of learning), but not in the control group. This idea is illustrated in the flattening distributions in the cartoons in Figure 1A. The authors should include graphs of the real data that use the same format as in that cartoon. Instead, the graphs present "residuals," and I don't know what those are. I suspect it's "variation left over after accounting for effects of strain and individual stochasticity." I see the residuals being tracked per strain over time in Figure 2H, but I don't see the change over time in other graphs. I'm looking for something simple like, "variation within the strain at the beginning of learning and at later time points in learning." (But I'm not sure exactly what instantaneous measurement would be the focus in longitudinal analyses of learning behavior.)

      (3) Figure 3 is a cool stab at tracking down the precise mechanism by which a stochastic environment interacts with learning to send individuals along different behavioral routes. But again, like in Figure 2, I don't have the sophisticated understanding of statistics to understand exactly what the graphs are telling me, or how they relate to the underlying measurements. I'm relying on the results text alone to reach a conceptual understanding, and just taking the graphs on trust.

      So, overall, the authors have a very nice body of work here, and with the potential to add a new facet to our understanding of the origins of diversity in animal behavior. In addition to the interpretations they focus on here, this dataset also represents an advance in studying visual associative learning in general, and quite an amazing ability to make longitudinal measurements of many behavioral decisions within the same animals. Improving the data presentation to make it easier to follow for a larger swathe of researchers, especially in figures 2 and 3, will increase its potential impact.

    3. Reviewer #2 (Public review):

      Summary:

      The authors set out to test the extent to which differences in learning capacity and experience contribute to behavioural variation in a genetically identical population under identical environmental conditions.

      Strengths:

      The authors developed and used a scaled-up version of a simple two-choice behavioural paradigm, allowing them to test thousands of individuals across multiple genotypes. They then deployed clever and powerful statistical analysis methods and provided compelling evidence for a role of variability in learning in the expression of behavioural variation.

      Weaknesses:

      There are no major weaknesses, although some level of longitudinal analysis to strengthen the evidence for a strict definition of individuality would be a welcome extension of a future study. In addition, it would have been very interesting, although understandably beyond the current scope, to delineate a potential source of learning variability in the brain.

    4. Author response:

      Reviewer 1:

      Clarification of sample sizes, assay structure, and experimental design.

      Reviewer 1 noted that the number of animals tested across strains, assays, sexes, parent sets, conditioned and unconditioned groups, and longitudinal conditions is difficult to track through the manuscript. Given the extent of the experimental and data processing procedures such as filtering for inactive or injured flies, we agree that a summary table and/or a visual schematic of the full experimental setup would be helpful.

      Importantly, the vast majority of individuals was used for the main experiment where we conditioned the flies to avoid the green arm, and where the colors of the arms were fixed throughout the assat. A smaller number of flies were tested in the validation experiments (such as different types of conditioning). In each experiment, 64 flies were always set up per genotype and their behaviour was measured in parallel. Usually, around ~60 flies passed the filtering step before analysis (filtering due to inactivity or injured flies). Among those 60-ish flies per genotype the distribution of flies of different sex or flies raised in different replicate vials was balanced. Different individual flies were tested across different assays, except in the multiday experiment, where each individual was tested across four different assays.

      We will add a supplementary summary that includes how many flies were tested across assays, how individuals, males, females, replicates and genotype were distributed across batches (and in the multiday experiments how they were distributed across experiments), and how many flies were filtered out from the final analysis. 

      Clearer presentation of the statistical argument that learning amplifies individuality.

      Reviewer 1 also noted that the presentation of the statistical analyses, particularly in Figure 2, was difficult to follow (e.g. what is residual individuality, how is it tracked over time, and why not replace it with something simpler like variance?).

      Our experimental design combines multiple, replicated environments and genotypes. For example, genetically identical flies from genotype A, are raised under identical developmental environments that are replicated two times in two vials. The same is true for genotype B. Individuals from both genotypes are then tested under different conditions, i.e. control and conditioned. 

      As we saw, combinations of these factors can change both the means and variance of distributions of individual behaviours in both genotype- or environment-specific manner. Normally, variance would be a good estimate for expressed individuality within a genotype, and comparison of variances would be a good estimate of change in individuality due to some factor (genotype, replicate, or type of conditioning).

      However, we saw that the resulting shape of the data in these experiments, (the shape of the distributions) was incompatible with the classical definition of extent of individuality measured by variance. While it would be more intuitive to track variance over time, we found that this measure obfuscates some obvious changes in the normal shape of the distributions of individual behaviours, as can be visually observed for example between conditioned and control experiments. This is why we moved to develop the measure of residual individuality. Residual individuality aims to measure exactly this dimension of individuality that is missed by measures of variance. We will add a schematic presentation of residual individuality in Figure 2 to explain more explicitly and visually what is being measured here, and what residual individuality represents. This should shed more light on how these analyses support the conclusion that learning increases behavioural variability among individuals in both Figure 2 and Figure 3. The schematic should provide more intuition on how to interpret the data to those unfamiliar with some of the statistics. Besides the schematics, we will also add more intuitive visualizations of the behaviour data in the supplementary, including representations of how within-strain distributions of behaviour change before and during learning or in control condition for all strains, so that the reader may inspect them in more detail.

      Improved explanation of Figure 3 and the link between statistical outputs and behavioural measurements.

      Reviewer 1 also noted that the analyses in Figure 3 are difficult to interpret without relying heavily on the Results text. Hopefully the added schematic in Figure 2 that explains what Divergence represents should address this note and make the interpretation of Figure 3 easier. Indeed, upon reflection, we agree that the label “Divergence” is quite vague. The “Divergence” in fact shows again residual individuality, and how it changes with every made decision in the case where we compare distributions of flies that start at green versus the blue arm. We further subset the distributions by clustering flies that share the same individual initial color bias or similar learning score and measure residual individuality for them as well. Here, value 0 means the two distributions have the same shape, and higher values mean the shapes are more different. We will rename Divergence to “Residual individuality Start” to make it clear that this is conceptually the same type of measurement, and revise the figure legends accordingly so that they match the new schematic in Figure 2. This should hopefully clarify what the figures show. We will also add a schematic to depict how change in the shape of the distribution with each decision can affect residual individuality.

      Reviewer 2:

      Clarification of the term “deterministic” when referring to genetic sources of variation.

      Reviewer 2 noted that describing genotype as a deterministic source of variation could be confusing, since gene expression and downstream cellular phenotypes are themselves noisy and stochastic. Indeed, gene expression as a phenotype is noisy, but also at the core it is a result of G x E (albeit the environment at the molecular scale). What we meant to emphasize here is that an individual’s genotype can be considered a fixed variable that determines phenotype expression across environments. The environment also determines the phenotype, again, in concert with genotype, but it will always vary over time. We agree with the reviewer that the wording should be made stricter to avoid confusion.

      We changed this sentence from “In every individual, behaviour is shaped by deterministic, genetic factors and by environmental events throughout lifetime, which may be stochastic and can occur at the molecular, cellular, organismal and even population scales.” to “In every individual, behaviour is shaped by fixed genetic factors and by variable environmental events throughout lifetime, which may be stochastic and can occur at the molecular, cellular, organismal and even population scales.” 

      Longitudinal analysis and neural sources of learning variability.

      Reviewer 2 suggested that additional longitudinal analysis could further strengthen the evidence for individuality, and that identifying neural sources of learning variability would be an interesting future direction. We appreciate these suggestions and very much agree with them. But as it was pointed out by the reviewer, this was beyond the scope of this study. Nonetheless, it may be good to note that we have in fact already started this (ongoing and quite extensive) experimental endeavour to identify neural sources of individuality, which we hope will be soon available as a follow-up study.

      Within the current study we were able to track behaviour longitudinally within a 20-minute experiment, and in one case over multiple days, though for only a smaller subset of flies. Broader conclusions on how behaviour would change over longer timeframes (except those already included in the manuscript) could not be made with the current dataset. We have added a figure in the supplement where the reader can visually explore the temporal changes to the distributions of behaviour. More extensive study to see how individuality evolves over longer time frames is indeed planned for the future.

      We thank the reviewers again for their thoughtful and constructive comments. We believe that addressing these points improved the manuscript.

    1. eLife Assessment

      This paper presents a valuable methodology for genetic manipulation of Blastocystis. Although some imaging data are compelling, higher-quality figures together with more rigorous biochemical assays would strengthen support for the authors' claims. With the experimental evidence and graphics improved, the study would be of interest both to researchers investigating mitochondrial evolution under anaerobic conditions and to medical biologists studying human pathogens.

    2. Reviewer #1 (Public review):

      Summary:

      This paper presents a toolkit for the transformation of Blastocystis. The authors have screened a number of selectable agents, promoters and reporter genes and present their findings. This resource will be of immense use to those in the Blastocystsis field, as well as those seeking to establish transformation tools in other species where such tools do not yet exist. Establishing new transformation tools is extremely challenging, and the authors have done an excellent job.

      Strengths:

      The authors have carried out a systematic screen of promoters, reporter genes and selectable agents. They have screened numerous for each, and all the data is presented. It is good to see when things did not work as well as when things did, so this data set is extremely useful indeed.

      Weaknesses:

      The findings are reported by reporter gene assay (microscopy). No evidence is given using genetics. The authors claim that the DNA is maintained episomally. However, could it be possible that there is integration? No PCRS/RT-PCRs are shown (although it can safely be assumed that the DNA/RNA is present where the transformation was successful), nor are any Western blots. These would have been useful to show that the P2A ribosomal skipping had occurred, and that proteins were expressed individually rather than as a polyprotein.

    3. Reviewer #2 (Public review):

      This manuscript presents a substantial technical advance for the genetic manipulation of Blastocystis by establishing an integrated workflow for stable episomal transgenesis, antibiotic selection, clonal recovery, and reporter-based imaging in the ST7-B subtype. The study is particularly valuable because it combines multiple previously fragmented approaches into a coherent and practically applicable toolkit, including endogenous regulatory elements, optimized electroporation conditions, selectable markers, and anaerobic compatible fluorescent reporters. This methodological work greatly expands the molecular toolbox and future studies focused on both basic and infection biology can now build on the ability to express and localize proteins in fixed as well as live cells.

      The microscopy data are convincing and clearly demonstrate functional reporter expression and successful recovery of stable transgenic lines. Nevertheless, because this is primarily a methodological paper, the study would be further strengthened by the inclusion of Western blot validation of reporter expression and bicistronic constructs. In particular, biochemical analysis of the P2A-containing constructs would help assess the efficiency of ribosomal skipping and exclude the possible presence of uncleaved fusion proteins, thereby providing stronger support for the interpretation of the imaging data and the functionality of the expression system.

    4. Reviewer #3 (Public review):

      Summary:

      The primary objective of this study was to establish a practical and functional framework for the propagation of stable transgenic cell lines of Blastocystis, a common animal gut microeukaryote. Although the work focused on Blastocystis ST7-B, a subtype with relatively low prevalence in humans, this choice is justified by its association with more frequent negative health effects. Beyond their relevance to the medical field, the methodological advances described here have the potential to also expand cell biology studies of this anaerobic organism, including its unusual mitochondria and redox metabolism.

      Strengths:

      Prior to this work, genetic tools for Blastocystis were very limited, relying on a single strong promoter-terminator combination. The authors successfully expanded the available promoter set across a range of expression strengths by testing two dozen variants in luciferase-based assays. Critically, they developed an integrated workflow from a modular transgenic construct design, to an expanded inventory of molecular components (promoters, reporters), optimized DNA delivery, stepwise antibiotic resistance-mediated clonal selection and propagation, and to reporter validation. The evaluation of several anaerobiosis-compatible labeling strategies for live (and fixed) cell optical imaging will be particularly useful, with the SNAP-tag system appearing especially promising for Blastocystis.

      Weaknesses:

      The presented data generally provide solid support for the conclusions that the work reached, but clarification of reasoning and several inconsistencies, as well as amendments to the visual presentation of the data, would be highly beneficial, as detailed below.

      (1) Episomal persistence of the construct:<br /> The manuscript repeatedly assumes, including in its title, that constructs persist in Blastocystis in their episomal form, but no direct evidence is provided. Although this interpretation is plausible, it should be identified more clearly as provisional. Nuclear genomic integration (e.g., via NHEJ) remains a possible explanation unless supporting evidence or rationale is provided to exclude it. Testing whether the phenotype persists without drug-mediated selection in the generated transgenic cell lines would help strengthen the case for episomal maintenance.

      (2) Promoters and terminators:<br /> 2.1) There is a discrepancy between the claimed number of loci (14), from which promoters used to drive luciferase expression were derived, and those detailed as having been actually generated in Table 1 (11). This inconsistency should be corrected or explained, as it creates uncertainty around the accuracy of the dataset.<br /> 2.2) Based on the presented evidence, constructs benchmarked in bioluminescence assays differed only in their promoter composition. Although terminator selection is mentioned in the Methods section, no additional details are provided; for instance, Table 1 and Figure 2 only list 23 promoters in total. Figure 2A likewise shows only promoter-dependent variation. If the terminator was held constant (LeguP1?), this should be stated explicitly. The authors may then consider revising the wording of having tested "23 promoter-terminator pairs" to better reflect that only promoters varied.<br /> 2.3) Promoter benchmarking was done with a plasmid lacking a selection marker, so it is unclear how the maintenance of the luciferase construct was ensured. Without selection, the observed reporter intensity could reflect differential or stochastic plasmid retention rather than promoter strength alone. The luminescence assay was performed 16-18 hours after transfection, but the rationale for this particular timeframe should be explained. In this context, the authors should explicitly state whether the experiments shown in Fig.2A represent biological triplicates or technical triplicates from a single transfection.

      (3) Figure 2:<br /> 3.1) Several aspects of the current design may lead to ambiguity for the reader. The boxplots are colour-coded, but it is unclear whether the colours carry meaning or are purely decorative. Because the data are already spatially separated into bins, additional random colouring is redundant and may suggest distinctions that are not intended. In addition, part A of Figure 2 is split into two panels, with the scale for the left panel shown in the right panel and some of the boxplot colours falling in the range of the scale, but not in line with their counterparts in the left panel. Because the colour use is not consistent, it is difficult to tell whether the same scale should be applied to both panels or how it should be interpreted.<br /> 3.2) The left panel of part A uses a diverging blue-white-red colour scheme, which is most appropriate when the midpoint represents a meaningful central value such as zero. Because the values shown in this graph are only positive, a non-diverging 2-colour scale or a colour palette such as 'viridis' would make the plot easier to interpret.<br /> 3.3) A black background should be avoided: 'B' and 'C' labels are invisible, and it draws attention to a distracting design feature rather than the data themselves.

      (4) Figure 3:<br /> 4.1) Individual snapshots should be separated more clearly, either by using a white background or by adding visible borders to make the overall composition clearer. As currently displayed, some boundaries between fluorescent channels resemble image artifacts rather than intentional panel divisions.<br /> 4.2) In parts B-D, the legend should explain more clearly what each image shows, and the figure itself would benefit from annotations. There seem to be three sub-panels in each 'condition' of part B (as well as C and D): while the middle and rightmost panel can be easily inferred to represent the fluorescent protein and bright-field image, what the leftmost panels represent is not specified. If DAPI was used to dye DNA, an explanation why mostly multiple labelled regions are visible should be provided.<br /> 4.3) Cell morphology and appearance differ markedly between UnaG/smURFP and SNAP-tag images, which should be explained. A microscope issue is mentioned in the main text, but if that was the cause, the authors should consider replacing the images, as the current distortions complicate interpretation.

  2. Jun 2026
    1. eLife Assessment

      This manuscript presents a useful computational framework for systematically characterising how heterogeneity in initial conditions or biophysical parameters shapes the dynamic behaviour of protein signalling networks, with potential relevance to understanding adaptive drug resistance. While the approach represents a significant methodological contribution, the extent to which its conclusions are biologically informative remains debated, as the model is only qualitatively compared with experimental data and lacks quantitative validation. As a result, the strength of evidence supporting the mechanistic claims is viewed as incomplete.

    2. Joint Public Review:

      In this manuscript, the authors proposed an approach to systematically characterise how heterogeneity in a protein signalling network affects its emergent dynamics, with particular emphasis on drug-response signalling dynamics in cancer treatments. They named this approach Meta Dynamic Network (MDN) modelling, as it aims to consider the potential dynamic responses globally, varying both initial conditions (i.e., expression levels) and biophysical parameters (i.e., protein interaction parameters). By characterising the "meta" response of the network, the authors propose that the method can provide insights not only into the possible dynamic behaviours of the system of interest but also into the likelihood and frequency of observing these dynamic behaviours in the natural system.

      The authors study the Early Cell Cycle (ECC) network as a proof of concept, focusing on pathways involving PI3K, EGFR, and CDK4/6 with the aim of identifying mechanisms that may underlie resistance to CDK4/6 inhibition in cancer. The biochemical reaction model comprises 50 state variables and 94 kinetic parameters, implemented in SBML and simulated in Matlab. A central component of the study is the generation of large ensembles of model instances, including 100,000 randomly sampled parameter sets intended to represent intra-tumour heterogeneity. On the basis of these simulations, the authors conclude that heterogeneity in kinetic rate parameters plays a stronger role in driving adaptive resistance than variation in baseline protein expression levels, and that resistance emerges as a network-level property rather than from individual components alone. The revised manuscript provides additional clarification regarding aspects of the simulation and filtering procedures and frames the comparison with experimental data as qualitative. Nonetheless, the study is best interpreted as a theoretical and exploratory analysis of the model's behaviour under heterogeneous conditions. Consequently, questions remain regarding the biological grounding of the sampled parameter regimes and the extent to which the reported frequencies of resistance-associated behaviours can be directly interpreted in physiological terms.

      While the authors propose a potentially useful computational framework to explore how heterogeneity shapes dynamic responses to drug perturbation, a number of important conceptual and methodological concerns remain to be addressed:

      (1) The sampling of kinetic parameters constitutes the backbone of the manuscript, yet important concerns remain regarding its biological grounding and transparency. Although the revised version provides additional clarification on the exploration of "model instances", it is still not sufficiently clear how parameter values and initial conditions are generated, nor how the chosen ranges relate to biological measurements. The kinetic rates are sampled over broad intervals without explicit justification in terms of experimentally measured bounds or inferred distributions. As a consequence, it remains uncertain whether the ensemble of simulated behaviours reflects physiologically plausible cellular regimes or primarily the properties of the assumed parameter space. In this context, the large-scale sampling (100,000 parameter sets) resembles a Monte Carlo exploration of the model rather than a biologically calibrated representation of tumour heterogeneity.

      Furthermore, the adequacy of the sampling strategy in such a high-dimensional space (94 free parameters) remains open to question. In the absence of biologically informed constraints, the combinatorial space of possible parameter configurations is vast, and it is unclear to what extent the sampled ensembles can be considered representative. This issue is particularly relevant because the manuscript interprets the frequency of resistance-associated behaviours as indicative of their likelihood.

      The validation presented in Figure 7 does not fully resolve these concerns. The comparison with experimental data is qualitative, and the simulations are performed in arbitrary time units, which complicates direct interpretation alongside time-resolved experimental measurements. Moreover, certain qualitative discrepancies between simulated and experimental trends (e.g., persistent versus decreasing CDK4/6 activity) are not thoroughly discussed. As this figure represents the primary empirical reference point in the manuscript, the extent to which the model captures experimentally observed dynamics remains uncertain.

      Finally, aspects of presentation continue to limit transparency. Parameter ranges are described at different points in the manuscript but are not consolidated clearly in the Methods, and the definition of initial conditions remains ambiguous - particularly whether these correspond to conserved quantities or to the dynamic variables used to initialise simulations. In addition, the exact number of model instances underlying specific analyses and figures is not always explicit. Greater clarity on these issues is essential for assessing reproducibility and for interpreting the quantitative claims of the study.

      (2) A central conclusion of the manuscript is that heterogeneity in protein-protein interaction kinetics is a stronger driver of adaptive resistance than heterogeneity in protein expression levels. To assess the latter, the authors fix a nominal set of kinetic parameters and generate 100,000 random initial concentrations for the 50 model species. However, according to the simulation protocol described in the manuscript, each trajectory includes three phases: (i) simulation under starvation conditions to equilibrium, (ii) mitogenic stimulation to a second ("fed") equilibrium, and (iii) application of drug treatment. The equilibrium concentrations reached in phases (i) and (ii) are determined by the kinetic parameters of the model and are independent of the initial concentrations, provided the system converges to a stable steady state. In dynamical systems terms, stable equilibria are defined by the parameter set and attract all initial conditions within their basin of attraction. Since the kinetic parameters are fixed in this experiment, the pre-treatment equilibrium that serves as the starting point for drug application should likewise be fixed. Under these conditions, it is therefore not unexpected that sampling a large number of initial concentrations has limited influence on the treated dynamics.

      This raises conceptual questions about the interpretation of the comparison between kinetic and expression heterogeneity. If the system converges to a unique stable steady state prior to treatment, then variability in initial concentrations does not propagate into variability in drug response, and the observed dominance of kinetic heterogeneity may partly reflect this structural property of the model rather than a biological principle. Clarification is needed regarding whether multiple steady states exist under the nominal parameter set, and if so, how basins of attraction are explored.

      More broadly, it remains unclear why initial protein concentrations can be sampled independently of the kinetic parameters. In biological systems, steady-state expression levels are typically determined by the underlying kinetic rates. A more consistent approach might require constraining initial concentrations to correspond to equilibrium states of the chosen parameter set, thereby introducing relationships between at least some of the 50 initial conditions and the 94 kinetic parameters. Finally, the manuscript employs a non-standard terminology regarding "initial conditions," which may further obscure interpretation of these results and would benefit from clarification.

      (3) The technical implementation of the modelling and simulation framework remains difficult to evaluate due to insufficient methodological detail. Although the authors state that kinetic parameters are randomly sampled, the manuscript does not specify the distributions from which parameters are drawn, nor whether potential correlations between parameters are considered or explicitly ignored. Without this information, it is not possible to assess how implicit modelling assumptions shape the ensemble of simulated behaviours. Given that the conclusions rely on frequency-based interpretations across sampled parameter sets, greater transparency regarding the sampling procedure is essential.

      A further concern relates to the parameter filtering step. The authors report that the "vast majority" of sampled parameter sets produced systems that were "too stiff," and that these were excluded on the grounds that stiff dynamics are not biologically plausible. However, the manuscript does not clearly define how stiffness is assessed, nor why stiffness is interpreted as biologically unrealistic rather than as a numerical property of the formulation. In standard practice, stiff systems are typically handled using appropriate implicit solvers rather than being discarded. Similarly, parameter sets that produce negative state values are excluded, yet such behaviour may arise from numerical artefacts rather than from intrinsic model inconsistency. The rationale for excluding these parameter sets, rather than adapting the numerical scheme, is not sufficiently justified.

      The reported rejection rate - approximately 90% of sampled parameter sets - is substantial and raises questions regarding the interplay between model structure, parameter ranges, and numerical methods. As currently described, the filtering step appears to select parameter sets based primarily on computational tractability rather than on experimentally motivated biological criteria. The manuscript would be strengthened by clarifying whether the retained parameter sets are representative of biologically meaningful regimes, and by distinguishing clearly between exclusions based on biological plausibility and those arising from numerical considerations.

      Finally, important aspects of the simulation protocol require clarification. The model is simulated under "fasted" and "fed" conditions until equilibrium is reached, yet the criterion used to determine convergence is not specified. It would be important to describe how equilibrium is assessed (e.g., based on the norm of the time derivatives). Additionally, it remains unclear whether the mitogenic stimulus applied in the "fed" phase is assumed to be constant over time and, if so, how this assumption relates to biological experimental conditions. Greater detail on these implementation choices is necessary to ensure interpretability and reproducibility.

      (4) The manuscript states that the modelling conclusions are strongly supported by existing literature; however, the validation presented does not fully substantiate this claim. As noted above, the comparison with CDK2 and CDK4/6 experimental data remains qualitative, and the use of arbitrary simulation time units complicates interpretation of temporal agreement. The extent to which the model quantitatively or mechanistically recapitulates experimentally observed dynamics therefore remains uncertain.

      The claim that the model reproduces known resistance mechanisms is also difficult to assess in light of Figure S10, where a large fraction of network nodes (~80%) appear implicated in resistance under some conditions. If most components of the network can, in at least some parameter regimes, be associated with resistance phenotypes, the resulting lack of selectivity weakens the strength of model-based validation. It becomes challenging to distinguish specific mechanistic insights from generic consequences of network connectivity.<br /> In addition, the Supplementary Information notes that certain components of the mitogenic and cell-cycle pathways were abstracted or excluded in order to maintain computational tractability. While such abstraction is understandable in a large ODE framework, it raises interpretative questions. Proteins identified as potential resistance drivers within the model may, in some cases, represent aggregated or simplified pathway effects. Clarifying in the main text how such abstractions may influence the attribution of resistance mechanisms would strengthen the biological interpretation of the results.

      Drug inhibition is central to the manuscript's conclusions. The revised version clarifies that inhibition is implemented as a fixed fractional modification of specific kinetic rate laws. This abstraction is appropriate for exploring network-level responses, but it represents a stylised perturbation rather than a pharmacologically calibrated model of drug action. For full interpretability and reproducibility, the mathematical form of the modified rate laws, as well as the timing of inhibition relative to network equilibration, should be specified unambiguously. The biological implications of the findings depend critically on understanding this modelling choice.

      The one-at-a-time perturbation analysis presented in Figure 5 provides an interpretable ranking of first-order control points across the ensemble and offers mechanistic insight into primary sensitivities of the network. However, many targeted therapies act on multiple components, and resistance frequently arises through combinatorial mechanisms. The reported rankings should therefore be interpreted as identifying primary influences under isolated perturbations, rather than as a comprehensive account of multi-target drug behaviour.

      Overall, the manuscript succeeds in presenting a conceptual and exploratory framework for analysing how signalling network topology can shape the qualitative landscape of adaptive responses under heterogeneous kinetic conditions. Its principal contribution lies in establishing a systematic platform for large-scale in silico exploration. At the same time, the current limitations in biological calibration, parameter grounding, and validation constrain the extent to which the conclusions can be interpreted as predictive or quantitatively representative of specific tumour contexts. Addressing these issues would further strengthen the connection between the theoretical landscape described here and experimentally observed resistance dynamics.

    3. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment

      This manuscript presents a useful computational framework for systematically characterising how heterogeneity in initial conditions or biophysical parameters shapes the dynamic behaviour of protein signalling networks, with potential relevance to understanding adaptive drug resistance. While the approach represents a significant methodological contribution, the extent to which its conclusions are biologically informative remains debated, as the model is not qualitatively or quantitatively validated against experimental data. As a result, the strength of evidence supporting the mechanistic claims is viewed as incomplete.

      We thank the editors and reviewers for their further assessment of the manuscript. The revised public review raises several issues that overlap with points addressed in our previous response, particularly around the intended scope of MDN modelling, the interpretation of parameter sampling, and the qualitative nature of the experimental comparison. In this final revision, we have made targeted clarifications in the main text, Methods, figure legends, and Supplementary Information to make these points more explicit for readers. We emphasise that the present work is intended as a theoretical and exploratory framework for mapping the qualitative dynamic behaviours accessible to a fixed network topology, rather than as a quantitatively calibrated model of a specific tumour or cell line.

      Joint Public Review:

      In this manuscript, the authors proposed an approach to systematically characterise how heterogeneity in a protein signalling network affects its emergent dynamics, with particular emphasis on drug-response signalling dynamics in cancer treatments. They named this approach Meta Dynamic Network (MDN) modelling, as it aims to consider the potential dynamic responses globally, varying both initial conditions (i.e., expression levels) and biophysical parameters (i.e., protein interaction parameters). By characterising the "meta" response of the network, the authors propose that the method can provide insights not only into the possible dynamic behaviours of the system of interest but also into the likelihood and frequency of observing these dynamic behaviours in the natural system.

      The authors study the Early Cell Cycle (ECC) network as a proof of concept, focusing on pathways involving PI3K, EGFR, and CDK4/6 with the aim of identifying mechanisms that may underlie resistance to CDK4/6 inhibition in cancer. The biochemical reaction model comprises 50 state variables and 94 kinetic parameters, implemented in SBML and simulated in Matlab. A central component of the study is the generation of large ensembles of model instances, including 100,000 randomly sampled parameter sets intended to represent intra-tumour heterogeneity. On the basis of these simulations, the authors conclude that heterogeneity in kinetic rate parameters plays a stronger role in driving adaptive resistance than variation in baseline protein expression levels, and that resistance emerges as a network-level property rather than from individual components alone. The revised manuscript provides additional clarification regarding aspects of the simulation and filtering procedures and frames the comparison with experimental data as qualitative. Nonetheless, the study is best interpreted as a theoretical and exploratory analysis of the model's behaviour under heterogeneous conditions. Consequently, questions remain regarding the biological grounding of the sampled parameter regimes and the extent to which the reported frequencies of resistance-associated behaviours can be directly interpreted in physiological terms.

      While the authors propose a potentially useful computational framework to explore how heterogeneity shapes dynamic responses to drug perturbation, a number of important conceptual and methodological concerns remain to be addressed:

      (1) The sampling of kinetic parameters constitutes the backbone of the manuscript, yet important concerns remain regarding its biological grounding and transparency. Although the revised version provides additional clarification on the exploration of "model instances", it is still not sufficiently clear how parameter values and initial conditions are generated, nor how the chosen ranges relate to biological measurements. The kinetic rates are sampled over broad intervals without explicit justification in terms of experimentally measured bounds or inferred distributions. As a consequence, it remains uncertain whether the ensemble of simulated behaviours reflects physiologically plausible cellular regimes or primarily the properties of the assumed parameter space. In this context, the large-scale sampling (100,000 parameter sets) resembles a Monte Carlo exploration of the model rather than a biologically calibrated representation of tumour heterogeneity.

      Parameters were sampled from a uniform distribution spanning values 10-5 to 104. Conserved totals were sampled from the range 100 to 104. Each of these is roughly in line with measured spans of orders of magnitude for parameter values and protein expression (REF). Again, we would like to point out that we intentionally kept our ranges broad, and sampled from uniform distributions, to assess upper bounds of heterogeneity, not biologically informed heterogeneity. We also comment on the likely effects of expanding these ranges in our response to (26) in our original rebuttal.

      Main text has been updated to include this information. LINES: 175-179

      Furthermore, the adequacy of the sampling strategy in such a high-dimensional space (94 free parameters) remains open to question. In the absence of biologically informed constraints, the combinatorial space of possible parameter configurations is vast, and it is unclear to what extent the sampled ensembles can be considered representative. This issue is particularly relevant because the manuscript interprets the frequency of resistance-associated behaviours as indicative of their likelihood.

      This was addressed extensively in our original rebuttal, response to point (3). A new section was added to the supplementary text, along with new figures demonstrating the validity of the claims.

      The validation presented in Figure 7 does not fully resolve these concerns. The comparison with experimental data is qualitative, and the simulations are performed in arbitrary time units, which complicates direct interpretation alongside time-resolved experimental measurements. Moreover, certain qualitative discrepancies between simulated and experimental trends (e.g., persistent versus decreasing CDK4/6 activity) are not thoroughly discussed. As this figure represents the primary empirical reference point in the manuscript, the extent to which the model captures experimentally observed dynamics remains uncertain.

      This was addressed in the original rebuttal, response to point (12). The actual time units are arbitrary in the sense that they are determined by the units of the parameters in our model. It is important to understand that the meta-dynamic analysis is not calibrated to data and so the meaning of time units is far less important than the distribution of behaviours. We have updated the figure to reflect the arbitrary units of time in our simulations.

      Finally, aspects of presentation continue to limit transparency. Parameter ranges are described at different points in the manuscript but are not consolidated clearly in the Methods, and the definition of initial conditions remains ambiguous - particularly whether these correspond to conserved quantities or to the dynamic variables used to initialise simulations. In addition, the exact number of model instances underlying specific analyses and figures is not always explicit. Greater clarity on these issues is essential for assessing reproducibility and for interpreting the quantitative claims of the study.

      (2) A central conclusion of the manuscript is that heterogeneity in protein-protein interaction kinetics is a stronger driver of adaptive resistance than heterogeneity in protein expression levels. To assess the latter, the authors fix a nominal set of kinetic parameters and generate 100,000 random initial concentrations for the 50 model species. However, according to the simulation protocol described in the manuscript, each trajectory includes three phases: (i) simulation under starvation conditions to equilibrium, (ii) mitogenic stimulation to a second ("fed") equilibrium, and (iii) application of drug treatment. The equilibrium concentrations reached in phases (i) and (ii) are determined by the kinetic parameters of the model and are independent of the initial concentrations, provided the system converges to a stable steady state. In dynamical systems terms, stable equilibria are defined by the parameter set and attract all initial conditions within their basin of attraction. Since the kinetic parameters are fixed in this experiment, the pre-treatment equilibrium that serves as the starting point for drug application should likewise be fixed. Under these conditions, it is therefore not unexpected that sampling a large number of initial concentrations has limited influence on the treated dynamics.

      This raises conceptual questions about the interpretation of the comparison between kinetic and expression heterogeneity. If the system converges to a unique stable steady state prior to treatment, then variability in initial concentrations does not propagate into variability in drug response, and the observed dominance of kinetic heterogeneity may partly reflect this structural property of the model rather than a biological principle. Clarification is needed regarding whether multiple steady states exist under the nominal parameter set, and if so, how basins of attraction are explored.

      More broadly, it remains unclear why initial protein concentrations can be sampled independently of the kinetic parameters. In biological systems, steady-state expression levels are typically determined by the underlying kinetic rates. A more consistent approach might require constraining initial concentrations to correspond to equilibrium states of the chosen parameter set, thereby introducing relationships between at least some of the 50 initial conditions and the 94 kinetic parameters. Finally, the manuscript employs a non-standard terminology regarding "initial conditions," which may further obscure interpretation of these results and would benefit from clarification.

      This was addressed in the original rebuttal, response to point (4). Text was modified to clarify what was meant by initial conditions to clarify that this meant the conserved total for the protein species. A supplementary figure (supp. fig. 4) was added to demonstrate that changes to the conserved totals of protein species does, in fact, shift the dynamics and steady state equilibria of protein species. Text was updated throughout the paper to ensure that our definition of ‘initial conditions’ was consistent throughout the text.

      (3) The technical implementation of the modelling and simulation framework remains difficult to evaluate due to insufficient methodological detail. Although the authors state that kinetic parameters are randomly sampled, the manuscript does not specify the distributions from which parameters are drawn, nor whether potential correlations between parameters are considered or explicitly ignored. Without this information, it is not possible to assess how implicit modelling assumptions shape the ensemble of simulated behaviours. Given that the conclusions rely on frequency-based interpretations across sampled parameter sets, greater transparency regarding the sampling procedure is essential.

      Updated the main text to clarify random sampling from a log transformed uniform distribution. LINES: 175-179

      A further concern relates to the parameter filtering step. The authors report that the "vast majority" of sampled parameter sets produced systems that were "too stiff," and that these were excluded on the grounds that stiff dynamics are not biologically plausible. However, the manuscript does not clearly define how stiffness is assessed, nor why stiffness is interpreted as biologically unrealistic rather than as a numerical property of the formulation. In standard practice, stiff systems are typically handled using appropriate implicit solvers rather than being discarded. Similarly, parameter sets that produce negative state values are excluded, yet such behaviour may arise from numerical artefacts rather than from intrinsic model inconsistency. The rationale for excluding these parameter sets, rather than adapting the numerical scheme, is not sufficiently justified.

      The reported rejection rate - approximately 90% of sampled parameter sets - is substantial and raises questions regarding the interplay between model structure, parameter ranges, and numerical methods. As currently described, the filtering step appears to select parameter sets based primarily on computational tractability rather than on experimentally motivated biological criteria. The manuscript would be strengthened by clarifying whether the retained parameter sets are representative of biologically meaningful regimes, and by distinguishing clearly between exclusions based on biological plausibility and those arising from numerical considerations.

      This was extensively addressed in the original rebuttal, response to points (6) and (7). Main text was updated to clarify that a solver specific for stiff systems was used. Furthermore, we addressed this issue but consequential analysis revealed that lack of drug response and not achieving steady state in the simulated time period now accounted for the majority of filtering. This had no effect on the distributions of behaviours identified in our analyses. Main text was updated to reflect these changes. Rejection rate was explicitly discussed in main text.

      Finally, important aspects of the simulation protocol require clarification. The model is simulated under "fasted" and "fed" conditions until equilibrium is reached, yet the criterion used to determine convergence is not specified. It would be important to describe how equilibrium is assessed (e.g., based on the norm of the time derivatives). Additionally, it remains unclear whether the mitogenic stimulus applied in the "fed" phase is assumed to be constant over time and, if so, how this assumption relates to biological experimental conditions. Greater detail on these implementation choices is necessary to ensure interpretability and reproducibility.

      This was addressed in the original rebuttal, response to point (8). Clarification about simulations were added to main text, including explicitly stating that mitogenic and drug inputs were continuous stepwise functions and how steady state equilibrium was defined/calculated.

      (4) The manuscript states that the modelling conclusions are strongly supported by existing literature; however, the validation presented does not fully substantiate this claim. As noted above, the comparison with CDK2 and CDK4/6 experimental data remains qualitative, and the use of arbitrary simulation time units complicates interpretation of temporal agreement. The extent to which the model quantitatively or mechanistically recapitulates experimentally observed dynamics therefore remains uncertain.

      This was addressed in the original rebuttal, response to points (13) and (14). Wording was changed to remove the suggestion of strong evidence and the tone was shifted to reflect reasonable qualitative support for our analysis, not strong evidence.

      The claim that the model reproduces known resistance mechanisms is also difficult to assess in light of Figure S10, where a large fraction of network nodes (~80%) appear implicated in resistance under some conditions. If most components of the network can, in at least some parameter regimes, be associated with resistance phenotypes, the resulting lack of selectivity weakens the strength of model-based validation. It becomes challenging to distinguish specific mechanistic insights from generic consequences of network connectivity.

      In addition, the Supplementary Information notes that certain components of the mitogenic and cell-cycle pathways were abstracted or excluded in order to maintain computational tractability. While such abstraction is understandable in a large ODE framework, it raises interpretative questions. Proteins identified as potential resistance drivers within the model may, in some cases, represent aggregated or simplified pathway effects. Clarifying in the main text how such abstractions may influence the attribution of resistance mechanisms would strengthen the biological interpretation of the results.

      This was addressed in the original rebuttal, response to points (15). The discussion was significantly revised to reflect our reasoning with respect to our conclusions. We completely understand that more work could be done to verify our claims, however, our intention is to demonstrate the generalised relationship between network heterogeneity and drug resistance, not to predict patient-specific resistance mechanisms.

      Drug inhibition is central to the manuscript's conclusions. The revised version clarifies that inhibition is implemented as a fixed fractional modification of specific kinetic rate laws. This abstraction is appropriate for exploring network-level responses, but it represents a stylised perturbation rather than a pharmacologically calibrated model of drug action. For full interpretability and reproducibility, the mathematical form of the modified rate laws, as well as the timing of inhibition relative to network equilibration, should be specified unambiguously. The biological implications of the findings depend critically on understanding this modelling choice.

      All equations were included in the supplementary model files, including typeset ODEs, as requested by the reviewers. R15 and R27 contain the relevant equations, which specify the exact implementation of the drug inhibition. Number of time units per simulation phase now included in main text. LINES: 166 – 168

      The one-at-a-time perturbation analysis presented in Figure 5 provides an interpretable ranking of first-order control points across the ensemble and offers mechanistic insight into primary sensitivities of the network. However, many targeted therapies act on multiple components, and resistance frequently arises through combinatorial mechanisms. The reported rankings should therefore be interpreted as identifying primary influences under isolated perturbations, rather than as a comprehensive account of multi-target drug behaviour.

      Overall, the manuscript succeeds in presenting a conceptual and exploratory framework for analysing how signalling network topology can shape the qualitative landscape of adaptive responses under heterogeneous kinetic conditions. Its principal contribution lies in establishing a systematic platform for large-scale in silico exploration. At the same time, the current limitations in biological calibration, parameter grounding, and validation constrain the extent to which the conclusions can be interpreted as predictive or quantitatively representative of specific tumour contexts. Addressing these issues would further strengthen the connection between the theoretical landscape described here and experimentally observed resistance dynamics.

      Joint Recommendations for the authors:

      (1) Supplementary Figure S4 is not sufficiently explained in its current form. The structure of the figure, the meaning of its colour coding, and the intended interpretation are not clearly described, making it difficult for readers to extract the key message without substantial inference. Given that the manuscript relies heavily on large-scale ensemble analyses, clear visual communication is essential. A more detailed legend, explicit definition of axes and colour scales, and improved visual labelling would substantially enhance clarity, accessibility, and reproducibility.

      Supp. Fig. 4 legend updated with additional detail. LINES: Supp. Text. 256 - 263

      (2) The approximately 90% rejection rate of sampled parameter sets should be reported explicitly in the main text of the manuscript rather than only in the Supplementary Information. Given the central role of large-scale parameter sampling in the study, this level of exclusion is a critical aspect of the modelling workflow and directly affects the interpretation of robustness and representativeness. Clear disclosure in the main text would allow readers to properly evaluate the effective size of the analysed ensemble and the implications of the filtering procedure for the generality of the conclusions.

      This was explicitly addressed in the original rebuttal.

      (3) The model would benefit from quantitative validation against experimental data. In Figure 7C, the authors note in the response letter that the simulations are performed in arbitrary time units. However, the figure itself labels the time axis in hours, which may lead readers to infer a direct quantitative correspondence between simulated and experimental time courses. If the simulations are not calibrated to real time, this labelling is potentially misleading and should be corrected. Either the model should be explicitly time-calibrated and quantitatively compared to experimental measurements, or the figure should clearly indicate that the time axis is dimensionless. Clarifying this point is essential to avoid overinterpretation of the agreement between model and data.

      Label updated.


      The following is the authors’ response to the original reviews.

      Joint Public Reviews:

      In this manuscript, the authors proposed an approach to systematically characterise how heterogeneity in a protein signalling network affects its emergent dynamics, with particular emphasis on drug-response signalling dynamics in cancer treatments. They named this approach Meta Dynamic Network (MDN) modelling, as it aims to consider the potential dynamic responses globally, varying both initial conditions (i.e., expression levels) and biophysical parameters (i.e., protein interaction parameters). By characterising the "meta" response of the network, the authors propose that the method can provide insights not only into the possible dynamic behaviours of the system of interest but also into the likelihood and frequency of observing these dynamic behaviours in the natural system.

      The authors studied the Early Cell Cycle (ECC) network as a proof of concept, specifically focusing on PI3K, EGFR, and CDK4/6, with particular interest in identifying the mechanisms that cancer could potentially exploit to display drug resistance. The biochemical reaction model consists of 50 equations (state variables) with 94 kinetic parameters, described using SBML and computed in Matlab. Based on the simulations, the authors concluded the following main points: a large number of network states can facilitate resistance, the individual biophysical parameters alone are insufficient to predict resistance, and adaptive resistance is an emergent property of the network. Finally, the authors attempt to validate the model's prediction that differential core sub-networks can drive drug resistance by comparing their observations with the knock-out information available in the literature. The authors identified subnetworks potentially responsible for drug resistance through the inhibition of individual pathways. Importantly, some concerns regarding the methodology are discussed below, putting in doubt the validity of the main claims of this work.

      While the authors proposed a potentially useful computational approach to better understand the effect of heterogeneity in a system's dynamic response to a drug treatment (i.e., a perturbation), there are important weaknesses in the manuscript in its current form:

      (1) It is unclear how the random parameter sets (i.e., model instances) and initial conditions are generated, and how this choice biases or limits the general conclusions for the case studied. Particularly, it is not evident how the kinetic rates are related to any biological data, nor if the parameter distributions used in this study have any biological relevance.<br /> (2) Related to this problem, it is not clear whether the considered 100,000 random parameter samples sufficiently explore parameter space due to the combinatorial explosion that arises from having 94 free parameters, nor 100,000 random initial conditions for a system with 50 species (variables).<br /> (3) Moreover, the authors filter out all the cases with stiff behaviour. This filtering step appears to select model parameters based on computational convenience, rather than biological plausibility.<br /> (4) Also, it is not clear how exactly the drug effect is incorporated into the model (e.g., molecular inhibition?), nor how it is evaluated in the dynamic simulations (e.g., at the beginning of the simulation?). Moreover, in a complex network, the results may differ depending on whether the inhibition is applied from the start or after the network has reached a stable state.<br /> (5) On the same line, the conclusions need to be discussed in the context of stability, particularly when evaluating the role of initial conditions. As stable steady states are determined by the model parameters, once again, the details of how the perturbation effect is evaluated on the simulation dynamics are critical to interpret the results.<br /> (6) The presented validation of the model results (Fig. 7) is only qualitative, and the interpretation is not carefully discussed in the manuscript, particularly considering the comparison between fold-change responses without specifying the baseline states.

      We thank the reviewers for their thoughtful and constructive comments. In response to their comments, we have undertaken a substantial revision to address all the comments, improve clarity, transparency, and robustness while preserving the paper’s core contribution: a principled, scalable framework (MDN) for mapping how molecular heterogeneity and network architecture shape adaptive drug-response dynamics. At a high level, we clarified the study design and analysis goals, tightened definitions, and added methodological detail where it most advances interpretability. Importantly, these updates leave the analytical pipelines and major conclusions unchanged.

      Conceptually, we now make explicit that our objective is coverage of the output space of qualitative dynamics supported by the network topology, not exhaustive enumeration of parameter space. To support this, we added a convergence analysis and clarified that “triplicates” refers to independent ensembles used to demonstrate reproducibility. We also refined how we describe and implement initial conditions (as conserved total abundances that encode expression heterogeneity) and reframed filtering as minimal numerical/feasibility checks, using rejection sampling to obtain the prespecified ensemble size. Solver choices and input modelling (constant step mitogen/drug) are now spelled out succinctly.

      We expanded the model specification and rationale (complete reaction list with rate laws and brief biological justifications in the Supplement) and unified terminology throughout. Figures and legends have been overhauled for readability and accuracy, with missing labels added and ordering corrected. For validation, we clarified the nature of the single-cell reporter readout, improved Figure 7’s presentation, and emphasised - consistent with our aims - that comparisons are qualitative.

      Finally, we have rewritten the Discussion to centre on interpretation, implications, and connect our findings to the literature. It now: (i) frames MDN as a systems-level framework that links molecular heterogeneity to qualitative signalling “meta-dynamics” and adaptive escape under constant drug pressure; (ii) highlights two key findings: an asymmetry in control (interaction kinetics exert stronger, more consistent influence than protein abundance) and a topology-driven convergence whereby a vast parameter space funnels into a finite set of recurrent behaviours; (iii) shows that resistance is a network-level property, with many possible routes but a small set of recurrent hubs/modules dominating; and (iv) provides a qualitative alignment with single-cell reporter data while clarifying the intent and limits of that comparison. Moreover, we now explicitly discuss limitations (rate-law simplifications, broad priors, determinism, and modular abstractions) and outline next steps for future research, including data-constrained priors and stochastic extensions.

      We believe these revisions materially strengthen the manuscript and fully address all the reviewers’ comments. A detailed, point-by-point response follows.

      Joint Recommendations for the Authors:

      (1) It is confusing exactly what are the different sets evaluated in each cases, e.g. "generated 100,000 model instances, each with the same set of ICs but a unique set of randomly generated parameter values" (lines 299-300), "generated 100,000 model instances (in triplicate), each with the same set of 'nominal' parameter values (see supplementary Table S1), and a unique set of ICs, and repeated the analysis as performed previously" (lines 366-368), "combined the 1000 IC sets with each parameter set to create 1000 model instances" (lines 382-383), "repeated for 1000 parameter sets, allowing us to observe how frequently IC variation induced adaptive resistance independent of the chosen parameter set" (lines 386-387). A small table or just a clearer explanation is needed.

      In response to these comments, we have revised the main text to clarify the process of model instance generation. Specifically, we have made changes at page 7: line 297 - page 8: line 302, page 8: lines 305 - 310, page 9: lines 372-378, and page 9: line 384 – page 10: line 399 in the revised main text.

      We have also added a new Figure (Figure S1) to the supplementary file to allow readers to visualise the model generation process for each relevant set of experiments. Supplementary figures are referenced in the main text where appropriate.

      (2) The authors mentioned performing each simulation in triplicate, which is puzzling as the model is based on deterministic ODEs with fixed parameters for each simulation. Under such conditions, one would anticipate identical results from multiple simulations with the same initial conditions and fixed parameters. Perhaps the authors expect the model to exhibit chaos or aim to assess the precision of the parameter estimates through triplicate simulations. Further clarification from the authors would be valuable to comprehend the rationale behind conducting triplicate simulations in a deterministic setting.

      We agree that repeating deterministic ODE simulations with identical inputs would be redundant. In our study, “triplicate” referred instead to generating three independent ensembles of 100,000 unique model instances each, where model parameters (or initial conditions) were randomly resampled. These ensembles were analysed separately to assess whether the inferred meta-dynamic distributions converged robustly. Indeed, the distributions from the three replicates were nearly indistinguishable, confirming that the results are reproducible and not artefacts of a particular random draw.

      We have revised the main text to clarify this distinction (page 8: lines 305 - 310) and added an extended explanation for meta-dynamic behaviour convergence in the new section Error Convergence in the supplementary text (page 6: lines 184 - 210).

      (3) While the lack of a connection between model parameters and biological data (mentioned in the public review) may not be a fatal flaw in the manuscript, the concern about the 100,000 random samples being insufficient to explore the parameter space is valid. In a thought experiment, considering the high and low rate for each parameter and the combinatorial explosion of possibilities (2^94), the number of simulations performed (100,000) represents only an extremely small fraction of the entire parameter space (~1/10^(23)). This limitation might not accurately capture the true heterogeneity present inside a solid tumour. One potential solution is to determine biological bounds on model parameters through data fitting, which can provide more meaningful constraints for the simulations. Alternatively, increasing the number of simulations and adopting more efficient sampling techniques can enhance the coverage of possible parameter sets.

      We thank the reviewer for this insightful comment. We agree that the 94-dimensional parameter space is vast, and that 100,000 simulations represent only a fraction of the total combinatorial possibilities. However, the objective of our study is not to exhaustively sample the entire parameter space, but rather to sufficiently sample the ‘output space’ - that is, the complete spectrum of qualitative dynamic behaviours the network topology can generate. The key question is whether 100,000 model instances are sufficient for the distribution of these output dynamics to converge.

      To formally address this, we have performed a convergence analysis, which is now detailed in the new supplementary section "Error Convergence" (Supplementary text page 6: lines 184 - 210) and illustrated in Supplementary Figure S12. This analysis demonstrates that the mean squared error (MSE) between dynamic distributions from N and 2N simulations exponentially decreases as N increases, and the distribution of protein dynamics changes negligibly well before reaching 100,000 instances. Furthermore, performing the entire analysis in triplicate with independent random seeds yielded nearly identical meta-dynamic maps (average standard deviation < 0.04%), giving us high confidence that we have robustly captured the network's behavioural repertoire.

      We believe this convergence occurs because the system is degenerate: many distinct parameter sets within the high-dimensional space map to the same qualitative outcome (e.g., 'rebound' or 'decreasing'). Our goal was to capture the set of possible outcomes, not every unique parameter combination that leads to them.

      Regarding the parameter range, we intentionally chose a broad, unbiased range (10<sup>-5</sup> to 10<sup4></sup>)as a proof-of-concept to delineate the theoretical upper limit of heterogeneity the network can support, thereby capturing even rare but potentially critical resistance dynamics. We agree with the reviewer that a future direction is to constrain these ranges using biological data. Such an approach would transition from defining what is possible (the focus of this manuscript) to predicting what is probable in a specific biological context. We have added this important point to the Discussion (page 16: lines 663-679) to highlight this avenue for future work.

      (4) One of the manuscript's main results indicates that protein interactions play a more significant role in driving adaptive resistance than protein expression. To explore the impact of protein expression, the authors fixed a nominal parameter set and generated 100,000 initial concentrations of the 50 proteins in the ODE model. However, the simulations' equilibrium concentrations in the "starvation" and "fed" phases, which form the initial condition for the treated phase, are uniquely determined by the nominal model's kinetic parameters and not the initial conditions, which remain identical for each simulation. From a dynamical systems perspective, stable steady states are determined by the model parameters and attract all initial conditions within their basin of attraction. As a result, a random sampling of the initial conditions has a limited impact on the model dynamics. The authors' conclusion that "the ability of expression to induce resistance also seems to be dependent on the master parameter set" can be explained by this dynamical systems perspective, where the resistance state corresponds to a stable steady state determined by the master parameter set. Considering this, the evidence presented in the manuscript may not fully support the authors' conclusion regarding the importance of protein expressions relative to protein dynamics. The discrepancy might be attributed to a possible misunderstanding of this point, and further clarification from the authors could be helpful.

      We thank the reviewer for the thoughtful perspective. We agree that, in a monostable system with fixed kinetic parameters and fixed conserved totals, varying only the initial split among moieties (e.g., X vs pX) will not change the final steady state; trajectories converge to the same attractor. In our analysis, however, “initial conditions” predominantly refer to total protein abundances (e.g., X_tot = X + pX + complexes), used as a proxy for expression heterogeneity. These totals are invariants on the simulated timescale (no synthesis/degradation in the pre-equilibration phases), and therefore alter the value of the steady state under a given parameter set. In other words, our IC sampling mostly varies conserved totals rather than merely redistributing a fixed total; hence the equilibrium reached after the starvation/fed pre-equilibrations depends on the sampled totals and the kinetics. This can be seen in the new Supplementary Figure S4, showing that changing the ICs does shift the eventual steady state even when kinetic parameters are fixed.

      We have revised the text to: (1) define ICs explicitly as total abundances for multi-state species, (2) distinguish “initial split” from “conserved totals,” and (3) clarify that expression effects are context-dependent rather than universally dominant (page 4: lines 139-141 and page 10: lines 413-416)

      (5) Additionally, it is important to note that the random sampling of 100,000 initial concentrations might not sufficiently explore the vast space of possible initial conditions. In the thought experiment mentioned earlier, where each protein can have high or low expression concentrations, there are approximately 2^(50) = ~10^(15) possible combinations of initial concentrations. Thus, the 100,000 random simulations only represent around ~1/10^(10) of the possible initial conditions in this simplistic scenario. Consequently, this limited sampling of initial conditions may not provide enough information to draw meaningful conclusions, even if the initial conditions were more directly linked to kinetic rates.

      Please see our response to Comment (3). Briefly, our ICs are continuous total abundances (conserved moieties), not binary high/low states; many IC configurations converge to the same qualitative attractors, so we estimate distributional properties rather than enumerate all combinations. Our convergence diagnostics (independent replicates and sample-size doubling) show that the meta-dynamic distributions stabilise well before N=100,000 (see Supplementary Figure S12). We have clarified this in the Supplementary Information (Error Convergence section) with the new convergence results.

      (6) The authors implement a parameter selection step in the manuscript, where they filter out parameter sets that lead to what they term non-biological simulations. However, the rationale for determining if a given parameter set results in a stiff system of ODEs remains unclear. The authors cite references [38,39] to support the claim that stiff equations are not biologically plausible. Still, upon review, it is evident that [38] does not include the term "stiff," and [39] discusses using implicit methods to simulate stiff ODE models without specifically commenting on the biological plausibility of stiff systems. The manuscript lacks direct evidence to justify the conclusion that filtering out parameter sets that result in stiff ODE systems is reasonable. Since the filtering step accounts for the majority of discarded parameter sets, a stronger foundation is required to support the statement that stiff equations are non-biological.

      We thank the reviewer for pointing out the issue in our original justification. The reviewer is correct: stiff systems are a common feature of biological models, and our claim that they are likely ‘biologically implausible’ was not well substantiated. The filtering of these model instances was, in fact, due to a computational limitation rather than a biological principle. The issue was that these parameter sets produced systems of ODEs that were so numerically stiff they were unsolvable within a reasonable timeframe by the SUNDIALS ODE solver suite, which is specifically designed for such systems.

      Following the reviewer's comment, we investigated the source of this prohibitive stiffness. We discovered it was not an intrinsic property of the parameter sets themselves, but rather an artifact of our simulation setup. The extreme stiffness occurred almost exclusively during the initial integration timesteps, caused by the large initial discrepancy between the concentrations of active and inactive protein forms. This large discrepancy created the conditions for overtly stiff solutions i.e. unsolvable with implemented ODE solve settings. To overcome this problem, we set a large maximum number of steps in the ODE solver for the first couple of time points, enabling the solver to overcome the excessively stiff portion of the solve. We found that the vast majority of the previously 'unsolvable' model instances could now be successfully simulated. Consequently, the number of parameter sets discarded due to solver failure is now negligible (< 1%), and this filtering step no longer accounts for the majority of discarded parameter sets. Most importantly, the distributions of dynamics were not significantly altered by this adaptation.

      We have revised the " Sampling and filtering of model instances (page 5: lines 174 – 189)" part in the Methods section to reflect this more accurate understanding. We have corrected our original claim regarding the biological plausibility of stiff systems and corrected our use of the references. Ref [38] was included to demonstrate that models of biological systems are stiff, which was a major conclusion of that paper, and [39] was originally included to demonstrate that solving ODEs is reliant on solvers that can integrate stiff systems. Upon review, ref [39] has been removed.

      Overall, this investigation has made our analysis more robust by allowing us to include a wider, more representative range of parameter sets, and has tangibly improved the quality of our study.

      (7) Additionally, it is important to consider the standard method for accounting for stiff systems, as presented in [39], which involves using implicit numerical methods for ODE simulation. The authors mention using numerical methods from the SUNDIALS suite, which includes implicit methods, but the specific numerical method used remains unclear. Furthermore, it would be valuable for the authors to disclose the number of parameter sets that were filtered to obtain the final set of 100,000 accepted parameter sets. This information would provide insights into the extent of filtering and the proportion of parameter sets that were excluded during the selection process.

      We apologise for the lack of specific detail and have now updated the text. To clarify, all ODE simulations were performed using the CVODE solver from the SUNDIALS suite. This solver employs an implicit, variable-order, variable-step Backward Differentiation Formula (BDF) method, which is robust and specifically designed for handling the stiff systems common in biological network modelling. We have now explicitly stated this in the "ODE model construction, modelling, and simulations (page 4: lines 162 – 164)" section of the Methods.

      Regarding the filtered parameters, we have included a revised and detailed discussion of this in the "Sampling and filtering of model instances (page 5: lines 174 – 189)" part in the Methods section (see our response to comment (6) above). Briefly, after applying the filters, ~40–45% of instances did not reach steady state within the simulation timeframe, and ~50–55% did not meet the minimum drug-response criterion. Approximately 10% satisfied all criteria and were retained for analysis. Importantly, we employed ‘rejection sampling’ and continued drawing until we had N = 100,000 accepted instances that satisfied all the criteria.

      (8) An important step in the simulation process described by the authors is the simulation of the "fasted" and "fed" states until an equilibrium is reached. However, it is not clear how the authors determine if the system has reached an equilibrium. It would be helpful if the authors could provide more information regarding the criteria used to assess equilibrium in the simulations. Regarding the "fed" state, it is not explicitly stated whether the mitogen stimulus is assumed to be constant throughout the "fed" experiment. Considering the dynamic nature of mitogen stimulation in biological systems, it would be beneficial if the authors could clarify this assumption and discuss its biological relevance.

      We apologise for the lack not specifying this in the original text. A simulation was considered to have reached equilibrium when the concentration of every protein species changed by < 1% over the final 100 time steps of the simulation phase. We have now added this criterion to the "Sampling and filtering of model instances (page 5: lines 177 – 179)" part of the Methods section.

      Regarding the second part of the comment, in our simulations, both the mitogenic and the drug inputs were modelled as constant, stepwise functions that, once turned on, remained at a fixed concentration for the remainder of the simulation. The biological rationale for this choice was to rigorously test for bona fide adaptive resistance. By maintaining a constant mitogenic and drug pressure, we can ensure that any observed recovery in the activity of downstream proteins is due to the internal rewiring and adaptation of the signalling network itself, rather than an artefact of the removal or decay of the external stimulus/drugs. We have now clarified this rationale in the "ODE model construction, modelling, and simulations (page 4: lines 168 – 171)" part of the Methods section.

      (9) The "Description of Model Scope and Construction" section in the Supplementary Information should include explicitly the model reactions and some discussion about their specific form (e.g., why is '(((kc2f1*pIR*PI3K) / (1 + (pS6K/Ki2))) + (kc2f2*pFGFR*PI3K))' representing the phosphorylation rate of PI3K, with pS6K in the denominator?).

      The reviewer is right to ask for model justification. We have expanded the Supplementary “Description of Model Scope and Construction” section (page 2: line 63 – page 5: line 185) to include a complete reaction list with rate laws and a brief rationale for each. We also explain the specific PI3K phosphorylation term: activation by pIR and pFGFR is attenuated by pS6K via a denominator, which captures the well-described S6K-mediated negative feedback that reduces activation (e.g., via IRS1 phosphorylation).

      (10) In line 349, the statement "Given that CDK46cycD is only strongly suppressed in just under 60% of the model instances (Figure 3C)" lacks clarity regarding where to look to interpret the 60% value. If this means that 4 out of the 7 model instances are resistant, and the other 2 proteins also have the same percentage of resistance, then there is no apparent reason to focus solely on CDK46cycD.

      The reviewer is correct; the figure reference was an error, which has been rectified in the main text (page 9: line 355). The actual figure reference was to Supplementary Figure 2A, which shows the heatmap of all the frequencies for each protein dynamics for all the active protein forms. CDK4/6cycD shows a sustained decreasing dynamic for 59.93% of model instances, which is where this number was derived. We have also now explicitly referenced this number in the supplementary Figure 2A legend.

      We focus on CDK4/6cycD because it is the direct pharmacological target of CDK4/6 inhibitors. Our point was to suggest that even when the target is suppressed in the majority of instances (~60%), this does not reliably propagate to uniform downstream inhibition across the network, thus highlighting emergent, network-driven adaptive responses.

      (11) We observed that in Fig. 5A, the authors show that multiple pathways are blocked. However, it is unclear whether they reduced the value of one parameter in the experiment or simulated multiple combinations of parameter inhibition. Considering the large number of parameters (94) in the model, if the authors simulated all possible combinations of parameter inhibition, the number of combinations would be significantly more than 94. An actual inhibitor typically has an inhibitory effect on multiple molecules. Therefore, it would be necessary to identify the parameters that lead to drug resistance when multiple molecules are inhibited. However, examining the inhibition patterns for all 94 parameters would be practically impossible. As a potential approach, we suggest using ensemble learning techniques, such as random forests, to handle this problem efficiently. With a dataset of binary outputs indicating the presence or absence of resistance for a sufficient number of inhibition patterns, ensemble learning can be applied to find the parameters that contribute to drug resistance. Popular feature selection algorithms like Boruta could be utilised to identify the most relevant parameters. The results obtained by ensemble learning are similar to the ranking in Fig. 5C, potentially providing a more robust validation of the authors' findings. By incorporating these additional analyses, the authors could strengthen the reliability and significance of their results related to parameter inhibition and drug resistance.

      We appreciate the suggestion and the opportunity to clarify. Figure 5A depicts multiple pathways were interrogated, but in the analysis, parameters were inhibited one at a time (OAT) - not in combination. We have revised the figure legend and added a section named “Protein knockdown perturbation analyses (page 6: lines 228 – 233)” in the Methods section to make this explicit. Moreover, some additional text in the main text has been slightly modified to make this clearer (page 11: lines 462-463, page 24: lines 856-857).

      We chose the OAT design intentionally to obtain causal, first-order attribution of control points across a broad parameter ensemble without confounding from simultaneous co-inhibition. This provides an interpretable ranking of primary drivers (Figure 5C) that is consistent with the paper’s mechanistic focus. We agree that a multi-target inhibition approach could be a useful next step; however, an exhaustive combinatorial screen is beyond the scope of this proof-of-concept. In such future studies, the ensemble learning, as suggested by the reviewer, could be layered onto our MDN framework to assess robustness of the ranking under co-inhibition.

      (12) In explaining the parameterization of the model, we find an implication of a quantitative model. However, upon examining the results in Fig. 7D, we observe that they are only qualitatively correct. When comparing Figs. 7A and 7C, we note that many model instances are immediately suppressed, and the time scale remains unknown. We believe it would be essential for the authors to explain how the model of this study maintains its quantitative nature despite the results in Fig. 7. If such an explanation cannot be provided, it raises concerns regarding the biological reliability of several findings within this study.

      While our framework is built on quantitative ODEs, the validation we present in Figure 7 is indeed qualitative. This is an intentional and key feature of our study's design. Our goal was not to build a calibrated, quantitative model of a specific cell line (e.g., MCF10A), but rather to establish a proof-of-concept theoretical framework that systematically explores the full spectrum of dynamic behaviours a given network topology can possibly generate. To achieve this, we intentionally sampled parameters from a very broad, unbiased range to delineate the theoretical upper limit of heterogeneity. This in silico population is therefore designed to be far more heterogeneous than any single isogenic cell line.

      The striking qualitative agreement seen between our meta-dynamic distributions and the single-cell data in Figure 7D is thus not a failure of quantitative prediction, but rather a strong validation of our core premise: that a significant degree of signalling heterogeneity exists in cell populations and that our framework can effectively capture its emergent properties.

      Regarding the specific comment on Figure 7C, we apologise for the lack of clarity. Nominally, we chose to simulate for 24 hours however, the x-axis in our simulations represents arbitrary time units, as the timescale is dependent on the meaning/units of the parameter values. The goal is to compare the qualitative shape of the response (e.g., rebound, sustained decrease), not the absolute time in hours. Moreover the rapid initial suppression seen in many of our model instances (Fig 7C) is a direct parallel to the rapid suppression seen in the experimental data (Fig 7A). This initial phase is followed by a wide variety of adaptive behaviours (or lack thereof) in both our simulations and the real cells, which is the key phenomenon we are studying.

      We have revised the text (page 14: lines 598-601) and Figure 7’s legend to state more explicitly that our validation is qualitative and to clarify the purpose of our broad, uncalibrated approach. We have also added a note in the Discussion (page 18: lines 744-747) that calibrating this framework with cell-line-specific data is a natural next step for generating quantitative, context-specific predictions.

      (13) Related to the previous point, the experimental data is presented as fold-change during CDK4/6 inhibition, and we notice that the initial fold-change at time 0 varies between 1 and 1.8. The difference in initial fold-change is unclear to us, as our understanding of fold-change typically corresponds to the change from baseline, typically represented by the protein concentration at time 0.

      Furthermore, while the experimental data exhibits uniformly decreasing CDK4/6 activity, a substantial number of simulations indicate constant CDK4/6cycD, showing a significant qualitative discrepancy between the simulations and experimental findings. This disparity makes it difficult for us to interpret the comparison between the two datasets effectively, given the complexities in comprehending the experimental fold-change figure.

      As Figure 7 serves as the primary validation of model simulations in the manuscript, we believe that the current presentation may not provide a compelling reason to believe that the model accurately captures experimental data. To enhance clarity and validation, we suggest overlaying the experimental data over the simulations or considering the median and 10/90% percentile of the experimental data, which may potentially offer improved readability and facilitate a more robust interpretation of the comparison.

      The experimental data from Yang et al. (ref 55, main text) measures kinase activity using a nucleus-to-cytoplasm translocation reporter system, wherein a bait protein is phosphorylated by the target kinase causing it to translocate from the nucleus to the cytoplasm. Hence, the y-axis represents the ratio of nuclear vs. cytoplasmic fluorescence, not a fold-change from a t=0 baseline. The variation in the starting value (between 1 and 1.8) reflects the inherent heterogeneity in the reporter's localization across individual cells even before the drug is added. We have updated the y-axis label and revised Fig. 7’s legend to state this explicitly.

      The most likely explanation for the discrepancy between experimental dynamics and our simulation dynamics is that the experimental data comes from an isogenic cell line that is largely sensitive to CDK4/6 inhibition. Our simulations are derived from a very wide parameter sweep, where the intent is to represent all possible cell states. It is quite striking that that there is such a high correlation between the experimental data and simulations, indicating that perhaps the heterogeneity of even isogenic cell lines is significantly greater than might be intuited; a point we now mention in the revised Discussion (page 17: lines 716-727).

      It is worth noting again, that our analysis is intentionally constructed to be as heterogeneous as possible, and is not trained on any biological data that might otherwise constrain the output-behaviour space. The isogenic cell line almost certainly represents a much more constrained output-behaviour space than our analysis.

      The y-axis label has also been updated accordingly. As mentioned in (12) this result is intended as a qualitative validation, showing that cell lines indeed have highly variable signalling dynamics. Given the range of parameters tested, we think it is surprising that the degree of agreement between the experiment and our analysis is as high as it is. Again, we believe this suggests that heterogeneity may be more prevalent than is intuited. We do not believe we have made any strong quantitative claims in the main text, and we certainly aim to work towards biological, quantitative validation in the future. Finally, we altered the wording of the results heading (page 14: line 562) to make it clear that we are only making qualitative claims and removed the claim that the evidence was strong.

      With these clarifications and corrections, we believe the validation is now much more compelling. The key point is not a perfect quantitative match, but the strong similarity in the distribution of heterogeneous behaviours.

      (14) The authors mention simulating treatment with 10nM of CDK4/6i or Ei, but specific details on how this treatment is included in the model simulations are not provided. This lack of information makes it challenging to fully evaluate the comparison between model simulations and experimental evidence in Figure 7. It would be highly appreciated if the authors could clarify how the treatment with CDK4/6i or Ei is incorporated into the simulations to facilitate a better understanding and interpretation of the results.

      To clarify, the effects of the inhibitors were incorporated directly into the kinetic rate laws of their respective target reactions.

      CDK4/6 inhibitor (CDK4/6i): This was modelled as an inhibitor of the formation of the active CDK4/6-cyclin D complex. We have now explicitly detailed this in the description for reaction R27 in the "Description of Model Scope and Construction" section of the Supplementary Information.

      Estrogen Receptor inhibitor (Ei): This was modelled as an inhibitor of the estrogen-dependent activation of the Estrogen Receptor. This is now explicitly detailed in the description for reaction R15 in the same supplementary section.

      It is however important to reiterate that our goal in Figure 7 is qualitative, shape-based comparison; therefore, we used a fixed fractional inhibition (reported in Methods) rather than a calibrated IC50/Hill model.

      (15) The authors state strong support for their modelling conclusions based on the literature. However, we still have concerns regarding the validation of the model against CDK2 or CDK4/6 data in Figure 7, as it appears less convincing to us. Furthermore, the authors list known resistance mechanisms that are replicated in their modelling. Nevertheless, we find the conclusion somewhat weakened by Figure S10, where approximately 80% of the nodes are implicated in some form of resistance pathway. This raises questions about the model's selectivity, as many proteins included in the model seem to drive resistance in some manner. In the Supplementary Information, the authors mention excluding or abstracting some protein species from the mitogenic and cell cycle pathways to manage computational resources effectively. This abstraction makes it difficult to determine if the proteins identified as potential drivers of resistance genuinely drive resistance or might represent abstractions of other potential drivers. To enhance the manuscript's clarity and address potential concerns about the model's selectivity and abstraction, we suggest providing more details and discussion in the main text.

      The reviewer's observation that a large number of nodes are implicated in resistance pathways in Figure S10 is correct. However, we argue this is not a weakness of the model's selectivity, but rather a key finding that reflects the biological reality of adaptive resistance. The literature is replete with a wide and growing number of distinct mechanisms of resistance even to a single class of drugs (1,2), which supports the idea that cancer can co-opt a wide variety of network nodes to survive.

      Figure S10 is not a binary map where every implicated node is equal, instead it is a likelihood map, where the colour and weight of the connections represent how often a particular interaction participates in driving resistance across the theoretical full range of possible network dynamics. The figure shows that while many nodes can contribute to resistance, they do so in a hub-like manner i.e. small subsets of nodes coordinate to drive resistance. This provides a rationalised, data-driven prioritisation of the most dominant and recurrent resistance strategies. We draw two important conclusions from this work 1) Resistance likely occurs due to resistance hubs, not individual proteins, and 2) that the frequency of a resistance hub in an MDN analysis is likely proportional to the frequency of that hub emerging as a resistance mechanism in a population of cells and patients.

      Regarding the issue of abstraction, the reviewer is correct that this is an inherent feature of any tractable systems model. In our case, several species in the mitogenic/cell-cycle pathways are module-level proxies to control model size. The highly implicated "hub" nodes in our model likely represent critical cellular processes that are themselves composed of several individual protein interactions.

      To address these concerns, we have significantly revised the Discussion (page 16: lines 681 – 694) to: (1) frame resistance as a network-level phenomenon; (2) show that our frequency-based ranking is selective, prioritising the most probable, recurrent mechanisms; and (3) clarify that - given model abstraction -our findings implicate critical processes (modules), not just single proteins, as the drivers.

      Overall, these changes do not alter our main conclusions: adaptive resistance is an emergent, network-level property; many routes exist, but a smaller set of nodes/modules consistently carry the largest influence across heterogeneous contexts.

      (16) We consider that the figures and legends, including the supplementary information, are inadequately explained. The information provided is insufficient for us to comprehend the figures fully, leading to the need for interpretation on our part as readers. This could potentially introduce biases when trying to understand the claims made by the authors. To improve our understanding, it would be essential for the authors to assign appropriate labels to the figures and provide comprehensive explanations in the legends. For example, in Fig 3, we suggest labelling the tree diagrams in panels A and B, as well as the colour bars. We also recommend applying the same approach to other figures, adding accurate axis labels and descriptions of colour gradients to enhance clarity.

      We thank the reviewer for this critical feedback. To address this comment, the figure legends have been revised where appropriate and greatly expanded to improve their comprehension. Moreover, we have added explicit labels to all previously unlabelled components, such as the cluster dendrograms and colour code bars in Figure 3A, B.

      (17) To enhance readability, we recommend interchanging the order of Figures 1 and 2 in the sequence they appear in the main text. Alternatively, the text can be adjusted to refer to the figures in the correct order. Additionally, attention should be given to the bottom of Fig 1, which appears to be cropped or cut off. Furthermore, the incorrect word spacing in some figure elements, such as Fig. 3A title, Fig. 5B title, and Fig. 6B y-label, should be corrected for improved visual presentation.

      Following the reviewer’s comment, the order of Figures 1 and 2 has been switched to reflect the order in which they are referred to in the main text. These Figures have been re-exported to fix unintentional word spacing errors.

      (18) We recommend that the language used to refer to the initial conditions in the manuscript is clarified and homogenised. Currently, the authors use different terms such as "basal expression," "protein expression," "state variable values," or "initial conditions" to refer to them. This variation in terminology can be confusing for readers. In particular, the use of "basal expression" is problematic, as it typically refers to the leaky value of a reaction in the absence of an inducer, making it another biophysical parameter of the system rather than an initial condition. To enhance clarity and consistency, we suggest the authors decide on a single term to refer to the initial conditions throughout the manuscript and provide a clear explanation of its meaning to avoid any confusion. This will help readers better understand the concept being discussed and prevent any potential misinterpretations.

      We thank the reviewer for this very helpful suggestion. To resolve this and improve clarity, we have homogenized the language throughout the manuscript. We now clarify the use the following 3 terms in their specific contexts:

      We use “protein abundances” exclusively for the conserved total abundances of multi-state species (e.g., Xtot = X + pX + complexes) that are sampled across instances to represent expression heterogeneity.

      We use ‘initial conditions’ to refer to initial values of the state variables in a model simulation. This term is related to protein abundance as the setting of initial conditions for conserved species sets the protein abundance. This is explicitly stated in the text (page 3: lines 87 - 91).

      We use “state variables” to refer to the time-dependent model species.

      We avoid the term “basal expression” in technical descriptions. Where a biology-facing phrase is helpful, we use “protein expression level”. This is used when referring to the biological concept that the initial conditions are intended to represent, i.e. the heterogeneity in protein amounts across a cell population.

      We have performed a thorough search-and-replace to ensure this new convention is applied consistently and have removed the potentially confusing term "basal expression" from the revised manuscript.

      (19) Why are saturable functions (e.g., Michaelis-Menten functions) ignored in the model? What are the potential consequences?

      The main objective of this work was to perform a large-scale, systematic exploration of a high-dimensional parameter space (94 parameters) to map the full repertoire of qualitative dynamic behaviours a network topology can support. Using saturable functions like Michaelis-Menten kinetics would have roughly doubled the number of parameters to be explored (from k to Vmax and Km for each enzymatic reaction), making a parameter sweep of this scale computationally intractable. We therefore prioritised the breadth of the parameter search over the depth of kinetic detail, which we believe is the appropriate choice for a proof-of-concept study focused on heterogeneity.

      This simplification has potential consequences. A major one is that our model cannot capture phenomena that arise specifically from enzyme saturation, such as zero-order kinetics or certain forms of ultrasensitivity (switch-like responses). However, we argue that this is an acceptable trade-off for two main reasons: (1) Our analysis is based on classifying broad, qualitative response shapes (increasing, decreasing, rebound, etc.). Mass-action kinetics are fully capable of generating this rich spectrum of behaviours; and (2) by varying the mass-action rate constants over nine orders of magnitude (from 10<sup>-5</sup> to 10<sup4></sup>), our parameter sweep effectively samples a vast range of reaction efficiencies. A very low rate-constant can approximate the behaviour of a saturated, low-efficiency enzyme, while a high rate-constant can approximate a highly efficient, non-saturated one. In this way, the broad sweep of the rate parameter partially reflects the effects that would be captured by varying Vmax and Km.

      For transparency, we have added a brief rationale to the “ODE model construction, modelling, and simulations” part of the Methods (revised main text, page 4: lines 153-155) and the "Description of Model Scope and Construction" section in the Supplementary file (Supplementary text page 2: lines 63-73).

      (20) Given the relevance of the concept of "heterogeneity" in this work, a short discussion about biochemical noise and its implications on the analysis (e.g., why it is not included, and if it will be a next step) would be appreciated.

      Our MDN modelling framework represents heterogeneity by creating an ensemble of deterministic models, where each model instance has a unique set of kinetic parameters and/or initial protein abundances. We propose that this is a powerful way to mechanistically represent the functional consequences of all sources of cellular variation. Over time, the effects of genetic mutations, epigenetic states, and even the time-averaged impact of intrinsic biochemical noise will manifest as changes in the effective interaction strengths and protein concentrations within a cell. Our large-scale parameter/IC sweep is designed to systematically explore the full range of dynamic behaviours that can emerge from this underlying biological variation. Therefore, our approach does not compete with stochastic modelling but is complementary to it. While stochastic simulations can capture the dynamic trajectories of single cells, our framework provides a panoramic view of the entire spectrum of possible stable phenotypes that can emerge at the population level. We agree that modelling intrinsic biochemical noise (stochasticity arising from finite copy numbers), e.g. using chemical Langevin or SSA, is a possible extension in future work but expected to be very computationally expensive. We have added a brief discussion on this as future direction in the revised Discussion.

      (21) We have noticed that the first four paragraphs of the Discussion section overlap with the Introduction, as they mainly reiterate the significance of the study itself rather than focusing on the specific results obtained. To avoid redundancy and provide a more cohesive and informative discussion, we recommend that the authors shift the focus of the Discussion section towards presenting potential interpretations, even if they are not definitive, of the results obtained. By doing so, the Discussion will serve as a valuable platform for deeper analysis and insightful observations, allowing readers to better comprehend the implications and significance of the research findings.

      We thank the reviewer for this structural feedback. Following the reviewer's feedback, we have significantly rewritten and restructured the Discussion section. The redundant introductory material has been removed.

      The rewritten Discussion centres on interpretation, implications, and connect our findings to the literature. It now: (i) frames MDN as a systems-level framework that links molecular heterogeneity to qualitative signalling “meta-dynamics” and adaptive escape under constant drug pressure; (ii) highlights two key findings: an asymmetry in control (interaction kinetics exert stronger, more consistent influence than protein abundance) and a topology-driven convergence whereby a vast parameter space funnels into a finite set of recurrent behaviours; (iii) shows that resistance is a network-level property, with many possible routes but a small set of recurrent hubs/modules dominating; and (iv) provides a qualitative alignment with single-cell reporter data while clarifying the intent and limits of that comparison. Moreover, we now explicitly discuss limitations (rate-law simplifications, broad priors, determinism, and modular abstractions) and outline next steps for future research, including data-constrained priors and stochastic extensions.

      We believe this substantial revision has transformed the Discussion into a much more insightful and valuable part of the manuscript that directly addresses the reviewer's concerns.

      (22) The supplemental text file containing the model equations can be a bit challenging to read and understand. It would be greatly beneficial if the authors could consider generating a file using a typesetting program.

      We have now included a typeset list of state variable equations and ODEs, along with the original model files.

      (23) The authors mentioned that some model parameterizations result in negative solutions, which is surprising. Access to the model equations would help understand why this happens and is crucial for researchers who may want to use this approach. Clarifying the model equations' presentation would enhance transparency and aid other researchers in applying this method for similar research questions.ach. Clarifying the model equations' presentation would enhance transparency and aid other researchers in applying this method for similar research questions.

      The reviewer is correct to be surprised by the mention of negative solutions, as negative concentrations are physically impossible. We clarify that these are not a result of any structural flaw in our model's equations but are a well-known, although rare, numerical artifact of floating-point arithmetic in computational solvers.

      Our model is constructed using standard mass-action and first-order kinetics, which structurally guarantee non-negativity. However, when a species' concentration approaches the limits of machine precision (i.e., becomes a very small number extremely close to zero), the ODE solver can, in rare instances, numerically undershoot zero, resulting in a small negative value. If this occurs, it can lead to instability in subsequent integration steps.

      This is not a biological phenomenon but a computational one. Therefore, the standard and appropriate procedure, which we follow, is to implement a filter that discards any simulation trajectory where such a numerical instability occurs.

      (24) The reference listed for the CDK4/6 and CDK2 measurements is Yang et al. [55] in the figure caption, but as Xe et al. in lines 559-561 of the manuscript.

      The text has been updated to match citation.

      (25) We suggest that the authors revise and cite a previous study conducted by Yamada et al. (Scientific Reports, 2018), which presents an approach to expressing cell heterogeneity as a probability distribution of model parameters.

      Following this suggestion, we have revised the Discussion (see response to comment (21)) to include and discuss Yamada et al. (Scientific Reports, 2018), which models cell heterogeneity as a probability distribution over parameter values.

      (26) In the manuscript, on line 677, the authors state, "This indicates that there is an upper limit to the degree to which parameter sets can influence the qualitative shape of a protein's dynamic within a given network topology." We wish to highlight that this finding may not be particularly surprising. Given that the parameters were randomly determined within a specific range, it is understandable that altering the number of parameter samples would not substantially impact the distribution of model instances.

      We thank the reviewer for this insightful comment, which allows us to clarify the significance of this finding. While it is true that any sampling from a fixed distribution will eventually converge statistically, our conclusion is not about statistics but about the intrinsic, constraining properties of the network's topology. The novelty is not that the distribution converges, but that it converges to a surprisingly limited and finite repertoire of qualitative dynamic behaviours. A complex, non-linear network with nearly 100 free parameters could theoretically generate an almost endless variety of complex dynamics. Our finding is that this specific biological topology acts as a powerful filter, robustly channelling the vast majority of the near-infinite parameter combinations into a small, recurring set of functional outputs (increasing, decreasing, rebound, etc.).

      The reason for this finite limit is mechanistic, as the reviewer's comment prompted us to investigate further. Our parameter sweep already covers an extremely wide, 9-order-of-magnitude range. As we pushed parameter values to even greater extremes in exploratory simulations, we found they do not generate novel, complex dynamic shapes. Instead, they tend to drive network nodes into saturated states- either permanently "on" (maximally activated) or permanently "off" (minimally activated). In both cases, the node becomes unresponsive to upstream perturbations.

      Therefore, further expanding the parameter range would be unlikely to uncover new behavioural categories; it would simply increase the proportion of model instances classified as "no-response." This demonstrates a fundamental principle: the network topology itself enforces an upper limit on its dynamic complexity. We think this inherent robustness is what allows for reliable cellular signalling in the face of constant biological variation. We believe this is a non-trivial finding, and we have revised the Discussion (page 16: lines 664 - 680) to state this conclusion and its implications more clearly.

    1. eLife Assessment

      This study represents an important contribution to the study of decision-making under risk, bringing an interdisciplinary approach spanning economic theory, behavioral neuroscience, and computational modeling to test how choice preference is influenced by rare and extreme events. The authors aim to test whether rats are indeed sensitive to these rare and extreme events despite their infrequent occurrence, and to isolate behavioral evidence for avoidance of "Black Swans" - rare and extreme losses. The evidence for specific sensitivity to rare and extreme events however remains incomplete, owing in part to the difficulty of isolating the effect of these events beyond that arising from risk preferences more generally in both task design and in the computational modeling of the choice behavior. Despite this, and given the approach here brings a relatively novel and highly interdisciplinary perspective, this paper will be of broad interest to those seeking to understand animal behavior through the lens of economic choice and decision theory.

    2. Reviewer #2 (Public review):

      Summary:

      This paper attempts to examine how rare, extreme events impact decision-making in rats. The paper used an extensive behavioural study with rats to evaluate how the probability and magnitude of outcomes impact preference. The paper, however, provides limited evidence for the conclusions because the design did not allow for the isolation of the rare, extreme events in choice. There are many confounding factors, including the outcome variance and presence of less-rare, and less-extreme outcome in the same conditions.

      Strengths.

      (1) The major strength of the paper is the significant volume of behavioural data with a reasonable sample size of 20 rats.

      (2) The paper attempts to examine losses with rats (a notoriously tricky problem with non-human animals) by substituting time-outs as a proxy for losses. This allows for mixed gambles that have both gain and loss possible outcomes.

      (3) The paper integrates both a behavioural and a modelling approach to get at the factors that drive decision-making.

      (4) The paper takes seriously the question of what it means for an event to be rare, pushing to less frequent outcomes than usually used with non-human animals.

      Weaknesses:

      (1) The primary issue with this work is that the primary experimental manipulation fails to isolate the rare, extreme events in choice. As I understand the task, in all the conditions with a rare extreme event (e.g., 80 pellets with probability epsilon), there is also a less-rare, less-extreme event (e.g., 12 pellets with probability 5). In addition, the variance differs between the two conditions. So, any impact attributable to the rare, extreme event could be due to the less rare event or due difference in the variance (or other statistical moments, like skew or kurtosis). That the distributions can be shown to be different under specific assumption to value maximizing agents (e.g., with Jensen Gaps and Table 2) is not really relevant to what rats are sensitive and what drive their behaviour. The design here does not support the conclusions. Finally, by deliberately confounding rarity and extremity, the design does not allow for assessing the impact of either aspect on rat behaviour.

      (2) The RL modelling work also fails to show a specific impact of the rare extreme event. As best as I can understand Eq 2, the model provides a free parameter that adds a bonus to the value of either the two options with high-variance gains (A and V in the paper) or to the two options with high-variance losses (F and V in the paper). Or equivalently to the ones with "Jackpots" vs the ones with "Black Swans" (see Point 1 above as to how these different aspects are all confounded in this design). This parameter seems to only depends on whether this option could have possibly yielded the rare, extreme outcome (i.e., based on the generative probability) and was not connected to its actual appearance. [This point is unclear as the text says this, but the rebuttal states otherwise; plus some options never received the REE, see Table S11]. That makes it a free parameter that just bumps up (or down) the probability of selecting a pair of options. That may be due to presence of the REE or the other rare event or just the variance difference. Moreover, in the case of the "black swan" or high-variance loss conditions, this seems very much like a loss aversion parameter, but an additive one instead of a multiplicative one. Is there a theoretical claim here that "extreme losses" need an additive loss-aversion parameter?

      (3) The paper presented the methods and results with lots of neologisms and fairly obscure jargon (e.g., fragility, total REE sensitivity). That might it very hard to decipher exactly what was done and what was found. For example, on p. 4, the use of concave and convex was very hard to decipher; the text even has to repeat itself 3 times (i.e., "to repeat" and "in other words") and is still not clear. It would be much clearer (and probably accurate) to say that the options varied along the variance dimension, separately for gains and losses. Option A was low-variance gains and losses. Option B was low-variance losses and high-variance gains. Option C was high-variance losses and low-variance gains, and Option D was high-variance losses gains. That tells much more clearly what the animals experienced without the reader having to master a set of new terminologies around fragility and robustness, which brings a set of theoretical assumption unnecessarily into the description of the experimental design. Alternatively, if the authors are wary of using the term "variance" because other moments of the distribution also differ, they could use "high-value gains" or "high-value losses" or something else which does not obscure the experimental design with jargon. Again, this goes back to point 1 above, whereby the different options differ on so many dimensions (as is made even more apparent in the rebuttal) that the design cannot isolate the impact of the variables of interest.

      (4) Were the probabilities shuffled or truly random (seem to be fixed sequences, so neither)? What were the experienced probabilities? Given the fixed sequences, these experienced ("ex-post") probabilities, could differ tremendously from the scheduled ("ex ante") probabilities. It's quite possible than an animal never experienced the rare, extreme event for a specific option. From Table S11, that is guaranteed to have happened in that 4 animals only ever experienced the "black swan" outcome once. It's even possible (if they only picked a specific option on the 10th/60th choices by chance), that they only ever experienced that rare extreme event. This point still cannot be known given the information provided, which does not break down outcomes by options. The Supplemental in Table S11 only gives overall numbers but does not indicate what the rats experienced for each choice/option-which is what matters here. A simple table that indicates for each of the 4 options, how often they were selected, and how often the animals experienced each of the 6-8 possible outcome would make it much clearer how closely the experience matched the planned outcomes. In addition, by restricting the rare outcome to either the 10th or 60th activations in a session, these are not random. Did the animals learn this association? The text states that they did not, but no evidence is provided.

      (5) The choice data are generally presented in an overprocessed fashion with a sum and a difference (in both figures and tables). The basic datum (probability/frequency of selecting each of the 4 options) is not provided directly in the main text, even if it can theoretically be inferred from the sum and the difference. New right side of Table S4 is probably the most valuable piece in terms of explaining what rats did and should be highlighted a lot more. Inspection of that table reveals some interesting (and potentially worrying) results. Most notably, the vast majority of responding happens on the "anti-fragile" and "robust" option, often totalling around 90% of all selections, especially amongst the most common blue rats. Alas, those were all those the two options that were deliberately assigned to the two most preferred holes in the training phase (see p. 26). Does this reflect genuine preference for reward distributions or does this reflect a spatial hole bias? The assignment strategy makes this impossible to tell apart.

      (6) There is insufficient detail provided on the inferential statistical tests (e.g., no degrees of freedom or effect sizes), and only limited information on exactly what tests were run and how (bootstrapping, but little detail). Without code or data (only summary information is provided in the supplement), this is difficult to evaluate. In addition, the studies seem not to pre-registered in any way, leaving many research degrees of freedom. Not all studies need to be pre-registered and sometimes discovery of new things requires exploratory work, but preregistration does provide additional safeguards against overemphasizing post-hoc detected patterns-a serious issue in behavioural science. Moreover, this promotes transparency in reporting results and analyses, allowing for a better assessment of the strength of evidence for a claim. For example, here, were any alternative analysis pipelines attempted? Also, there were many sub-groupings of the animals and subsequent comparisons between them which all seemed post-hoc. On what grounds were these divisions made-were other divisions examined as well?

      (7) On p. 12 (Fig 4), there is an attempt to look at the impact of a rare, extreme event by plotting a measure of preference for the 10 trials before/after the rare, extreme event. In the human literature, the main impact of experiencing a rare, extreme event is what is known as the wavy recency effect (See Plonsky et al. 2015 in Psych Review for example, now cited). What this means is that there tends to there tends to be some immediate negative recency (e.g., avoiding a rare gain) followed by positive recency (e.g., chasing the rare gain). Typically, this refers to the specific option that yielded that outcome. First, as the other analyses do, the current analysis combines choice of the option that yielded the rare outcome with choice of other options, so that cannot directly assess the impact of the rare, extreme event on choice. Also, using a 10-trial window would thus obscure any impact of this rare, extreme event. There is mention of the very next trial, but an analysis that looks at the 10-trial time course trial-by-trial could reveal any impact that might be predicted from the human literature.

      (8) As I understood the method (p. 31), the assignment of options to physical locations was not random or counterbalanced, but deliberately biased to have one of the options in the preferred location. This would seem to create a bias towards a particular option and a bias away from the other options, which confounds the preference data in subsequent analyses. Table S4 reinforces this concern where the vast majority of response are clustered in the two most preferred options from training.

      (9) Are delays really losses? This is a big assumption. Magnitude and delay are different aspects of experience, which are not necessarily commensurable and can be manipulated independently. And, for the model, how were these delays transformed into outcomes for the model. Eq 1 skips over that. Is there an assumption of linearity? In addition, I was not wholly clear if the delays meant fewer trials in a session or if the delays merely extended the session and meant longer delays until the next choice period.

      Other points:

      (1) I think the authors still misunderstand the concept of "hot-stove effects". The idea is that the experience of a very bad outcome can lead to avoiding the situation again (i.e., not sampling that option) and can provide the appearance of oversensitivity to that bad outcome. Here, that might be more thought as "black-swan avoidance". Imagine if, to the rat, all options are equal in value, then some initial bad luck in encountering the black swan might make the animal avoid that option, even though with enough experience, then it would have been equal in value.

      (2) I am still not convinced that the Jensen inequalities add to this paper in terms of understanding the rat behaviour. That may be more suited for a different paper about the statistical and mathematical properties of certain generative distributions, but not here given what rats actually choose and experience.

      (3) Providing the data open access is very good. The code, however, should be equally available and not just upon request. Code needs to be available for assessment during peer review and for reproducibility checks. There are substantial enough problems with reproducibility in the field that code availability should be a minimum criterion for publication (see Miske et al., 2026 in Nature for the most recent large-scale evaluation of this problem).

      (4) The paper still somewhat mischaracterizes the literature on rare events, posing it as a series of "exceptions", rather than recognizing that a huge chunk of the literature uses rare events rarer than 10%. Also, there is even existing terminology in that literature for exactly the situation that is being created here-rare treasures (aka jackpots here) and rare disasters (aka Black Swans here).

      (5) Defining the observed behaviour in terms convexity, instead of stating choices more plainly obscures what is done/found. This is especially the case here because convex and concave mean different things when applied to gains/losses in terms of whether or not that option can lead to the REE. The use of the terms obscures rather than clarifies and probably is best left for the discussion (and maybe the intro) when mapping from theoretical distributions to the experiment at hand. In the paper, even the bottom of p.5 seems to incorrectly define "Total Sensitivity" as the combined proportion of selecting convex options in either domain, which does not map how convex is defined in Fig 1B or elsewhere in the text.

      (6). Fig 1C is baffling. Why are probabilities drawn moving away from the origin? The standard scientific plotting convention is for numbers to grow when moving away from the origin. That would be vastly clearer. Also, the color coding is confusing. Green-red maps onto convex-concave, but that would naturally seem to indicate gains vs losses, not convex vs concave. And why are probabilities growing larger in both directions from the origin? Much more sensible to communicate the procedure would likely be a standard plot of magnitude vs probability.

      (7) Discussion: I think the main difference between the human situations discussed and this experiment is that humans have not experienced those rare "black swan" outcomes. Rather, they hear about the disasters that are possible and do not incorporate that information, as discussed in the description-experience literature already cited in this paper (though not in that context).

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors investigate the impact of rare and extreme events on rodents' decisionmaking under risk, in gain and loss contexts. They describe the behavior of 20 rats performing a four-armed bandit task, where probabilistic gains (sugar pellets) and losses (time-out punishments) can - in some arms - incorporate extremely large - but rare - outcomes. They report that most rats are sensitive to rare and extreme outcomes despite their infrequent occurrence, and that this sensitivity is primarily driven by extreme loss events which they try to avoid, rather than extreme gains that they seek to obtain.

      They finally propose a modification of standard reinforcement-learning, which features a specific sensitivity to rare and extreme outcomes and can account for the observed behavior.

      Strengths:

      The manuscript really taps into a surprisingly neglected but very relevant aspect of decision-making: the effect of rare and extreme events (REE). The authors have developed an experimental setup that seemingly allows investigation of this aspect, which is not trivial given the idiosyncratic properties of rare and extreme events.

      The parameters of the experimental setup seem also to be well thought off: basically, in the absence of REE, some options are objectively better than others (because, in expectation, they overall deliver more food, or minimize time-out punishments), but this ordering reverses if REE are taken into account. This allows for a clean test of the integration of REE in the rodent's decision-making model.

      The data is presented and analyzed in a very descriptive but exhaustive and transparent way, down to the description of individual rodent's behavior.

      Weaknesses:

      While the description and analyses of the behavioral patterns are rigorously done under the economic lens of risky decision-making, the authors' interpretation heavily relies on the assumption that rodents have built the correct model of the task during the training. Extensive details are provided about the training procedure, and the observed behavior at the end of the training, but it remains virtually impossible to disambiguate choices due to imperfect learning to choices made due to intrinsic preferences for risk or REE.

      As detailed in Material and Methods, the animals were progressively overtrained following standard behavioral procedures. During this process, they experienced all available options, including both positive and negative REE. We assume that repeated exposure to these REE supported learning, as would be expected for any event occurring throughout such an extended training phase. The rats ultimately displayed an asymmetric pattern of choices: they consistently avoided the Black Swan, indicating that they had learned its negative consequences, yet they did not systematically seek the Jackpot. If their behavior were driven solely by incomplete learning or by an inherent preference for risk or REE, we would expect to see the opposite pattern systematic Jackpot seeking or inconsistent avoidance of the Black Swan.

      By nature, gains (food pellets) and losses (time-out punishments) are somewhat incommensurable so the interpretation of the asymmetry due to outcome valence is also subject to interpretation. There might be some additional subtleties due e.g. satiety that could come from gaining REE (i.e. the delivery of 80 pellets from the Jackpot).

      As described in Material and Methods, we used mouse pellets (20 mg) instead of rat pellets (45 mg) to prevent satiety during Jackpot delivery (80 pellets). We also selected gains (sweet pellets) and losses (delays) that we have successfully used in previous rat decision-making paradigms, such as the rat gambling task (Adams et al., 2017; doi: 10.1523/ENEURO.0094-17) and the loss-chasing task (Breysse et al., 2021; doi: 10.1111/ejn.14895). Notably, if the Jackpot induced satiety, one would expect animals to stop seeking it yet this was not systematically observed. Nonetheless, we added a sentence to the Discussion on page 18 of the manuscript to acknowledge that we cannot fully exclude the possibility that satiety contributed to the lack of systematic Jackpot Seeking.

      In its current form, the paper is quite hard to digest. This is naturally the case with interdisciplinary work (here mixing economists and neurobiologists). But I am afraid that with the current frame, the paper is going to miss its target, in terms of audience.

      We have rewritten entirely and the english was corrected thanks to ChatGPT. We hope that the paper is now easier to digest.

      The proposed model seems somewhat disconnected from the behavioral patterns: while the model suggests an effect of REE at the decision stage (i.e. with specific decision weights for those rare events), this formalism seems at odds with the observation that REE (notably in the loss domain) has an impact of subsequent behavior - (Black Swans tend to reinforce Total Sensitivity to REE) which rather suggests an effect at the learning stage.

      We agree with the referee that this may appear surprising at first glance. However, we would first like to emphasize that the general model allows REE to influence learning—that is, to contribute to the updating of the Q-subvalues. Moreover, even when REE are incorporated only as decision weights, as is the case for most rats, this does not imply that REE are unimportant during learning. In fact, the model assumes that REE are learned once and for all when they first occur during a trial of the corresponding option. Unreported simulation exercises indicate that a more gradual learning of maximal and minimal values would likely yield similar results.

      Second, the Before/After analysis shows that the behavioral response to Black Swans is locally small in terms of both total and one-sided sensitivities. This suggests that such effects are likely too subtle to be captured by this class of models for most rats. We have added this clarification to the revised version (page 17).

      Discussion:

      This study convincingly demonstrates that REEs are processed rather uniquely, which makes sense given their evolutionary relevance. REE has indeed been somewhat neglected in previous research, and this study therefore opens an interesting new front on the fundamental aspects of decision under risk. The authors have devised an original theoretical and empirical framework that will be useful for the community, and the combination of economics analysis and rodent behavior constitutes a thoughtprovoking ground to think about the nature of risk preferences. The interpretation and mechanistic account of these aspects, as well as their generalizability outside the specific context of this study, remain to be strengthened.

      We have modified the discussion to further insist on the translational aspect of the study and its interest for various populations (page 22). We hope that the generalizability is now strengthened.

      Reviewer #2 (Public Review):

      Summary:

      This paper attempts to examine how rare, extreme events impact decision-making in rats. The paper used an extensive behavioural study with rats to evaluate how the probability and magnitude of outcomes impact preference. The paper, however, provides limited evidence for the conclusions because the design did not allow for the isolation of the rare, extreme events in choice. There are many confounding factors, including the outcome variance and presence of less-rare, and less-extreme outcomes in the same conditions.

      Strengths:

      (1) The major strength of the paper is the significant volume of behavioural data with a reasonable sample size of 20 rats.

      (2) The paper attempts to examine losses with rats (a notoriously tricky problem with non-human animals) by substituting time-outs as a proxy for losses. This allows for mixed gambles that have both gain and loss possible outcomes.

      (3) The paper integrates both a behavioural and a modelling approach to get at the factors that drive decision-making.

      (4) The paper takes seriously the question of what it means for an event to be rare, pushing to less frequent outcomes than usually used with non-human animals.

      Weaknesses:

      (1) The primary issue with this work is that the primary experimental manipulation fails to isolate the rare, extreme events in choice. As I understand the task, in all the conditions with a rare extreme event (e.g., 80 pellets with probability epsilon), there is also a less-rare, less-extreme event (e.g., 12 pellets with probability 5). In addition, the variance differs between the two conditions. So, any impact attributable to the rare, extreme event could be due to the less rare event or due difference in the variance. The design does not support the conclusions. Finally, by deliberately confounding rarity and extremity, the design does not allow for assessing the impact of either aspect.

      We agree with the referee that both the REE and the rare (≈10% frequency) but non-extreme outcomes are present in the relevant options. However, the rare but non-extreme reward is not large enough to make the convex option attractive and to shift choice away from the concave option. In other words, unlike REE, these outcomes do not reverse stochastic dominance in our design (as noted in Material and Methods). We have explored modified designs for human subjects in which the rare but non-extreme outcomes are removed. Preliminary results indicate that the behavioral phenotypes observed in rats also emerge in humans under these modified conditions, suggesting that REE are the primary drivers. We have added a statement to the Discussion (page 22) to clarify this point.

      We elaborate further in our response to point (3) below on why analyses based solely on variance are insufficient when dealing with REE. To clarify the role of rare and extreme outcomes in distinguishing convex from concave options, we provide two new columns to Table 2 in the Materials and Methods, in our reply to point (3).

      Finally, although a detailed analysis of rare but non-extreme outcomes lies outside the scope of this paper, the symmetric treatment of extreme and frequent outcomes can be addressed straightforwardly using strong First-Order Stochastic Dominance. Classical decision-theoretic approaches indeed satisfy this property.

      (2) The RL-modelling work also fails to show a specific impact of the rare extreme event. As best as I can understand Eq 2, the model provides a free parameter that adds a bonus to the value of either the two options with high-variance gains (A and V in the paper) or to the two options with high-variance losses (F and V in the paper). This parameter only depends on whether this option could have possibly yielded the rare, extreme outcome (i.e., based on the generative probability) and was not connected to its actual appearance. That makes it a free parameter that just bumps up (or down) the probability of selecting a pair of options. In the case of the "black swan" or high-variance loss conditions, this seems very much like a loss aversion parameter, but an additive one instead of a multiplicative one.

      We agree with the referee that the additional parameters, compared to more standard Q-learning models, specifically capture the fact that some options deliver REE while others do not. In our estimation procedure, these parameters become nonzero as soon as REE are observed for the first time for a given option. Therefore, the first step is to estimate a baseline nested model in which REEs contribute only at the learning stage (i.e., they affect the updating of Q-subvalues), while the additional parameters are constrained to zero. The next step is to compare alternative models against this baseline, allowing REEs to enter through the additional parameters. In this respect, our specification is parsimonious, especially given that very little is known about REEs in computational neuroscience. More structural modeling is certainly a promising direction for future research, and this paper constitutes a first step toward that goal.

      We provide the BIC, in addition to the AIC, to account for the presence of additional parameters in model selection and to ensure that the observed improvement in fit is not merely driven by their inclusion.

      Unlike most of the existing literature, our results extend the notion of loss aversion to extreme losses. The negative decision weight on options yielding the Black Swan can be interpreted as a differential treatment of negative REE, an issue we discuss extensively in the Discussion (page 20).

      (3) The paper presented the methods and results with lots of neologisms and fairly obscure jargon (e.g., fragility, total REE sensitivity). That made it very hard to decipher exactly what was done and what was found. For example, on p. 4, the use of concave and convex was very hard to decipher; the text even has to repeat itself 3 times (i.e., "to repeat" and "in other words") and is still not clear. It would be much clearer (and probably accurate) to say that the options varied along the variance dimension, separately for gains and losses. Option A was low-variance gains and losses. Option B was low-variance losses and high-variance gains. Option C was high-variance losses and low-variance gains, and Option D was high-variance losses and gains. That tells much more clearly what the animals experienced without the reader having to master a set of new terminologies around fragility and robustness, which brings a set of theoretical assumptions unnecessarily into the description of the experimental design. In terms of results, "Black Swan" avoidance is more simply known as risk aversion for losses.

      Because our experimental design focuses on REE, outcomes cannot be summarized only by their variance. This is well known from the large literature on so-called fat-tailed statistical distributions. Unlike the Normal distribution that is entirely characterized by its expected value and variance, fat-tailed distributions have nonzero kurtosis. This implies that a fat-tailed distribution (e.g. exponential) with the same expected value and variance as the Normal differs importantly by possessing extreme values that are much more likely in terms of frequency. To illustrate, if the distribution of pellets was assumed to be Normal with expected value set at 3.89 and variance set at 9.37 as for the convex option, the probability of getting 80 pellets would be about 2.10<sup>-16</sup>, practically zero. In contrast, this probability is smaller than, but close to 1% in our design.

      In Material and Methods, we clearly explain how our novel approach in terms of convexity relates to the moments of the reward distributions, including but not limited to the variance. To clarify further, we provide two new tables (Author response table 2 and Author response table 3) to be compared to Table 2 of the manuscript in which we report the first four moments (mean, standard deviation, skewness and kurtosis) of the full concave and convex gain distributions, reproduced for convenience

      Author response table 1.

      In Author response table 2 we report the first four moments when REE are truncated. Comparing convex and concave gains shows that the convex option has a smaller but still close mean compared to the concave option. In contrast, the former has larger variance, skewness and kurtosis compared to the latter. Therefore, interpreting choosing the convex option as reflecting “preference” for variance is at best incomplete.

      Author response table 2.

      First four moments of concave and convex gains when REE are removed

      Author response table 2 further shows that REE alone goes a long way towards explaining the differences between convex and concave options in terms of the first four moments: removing the rare and extreme value results in the concave option having now a larger mean, while the convex option still has larger variance, skewness, and kurtosis but by a smaller margin.

      In Author response table 3 we report the first four moments when both RE and REE are truncated, which shows that the convex and concave options differ only with respect to their mean (which is here also larger for concave).

      Author response table 3.

      First four moments of concave and convex gains when both RE and REE are removed

      In addition, our focus on REE implies that we go beyond mean-variance preferences that apply mostly to Gaussian distributions. It is not clear theoretically what type of utility functions would reflect preferences that combine a taste for variance, skewness and kurtosis, even though all those moments affect expected utility. See for example Phelps, C.E. “A user’s guide to economic utility functions”. J Risk Uncertain 69, 235–280 (2024) for a recent overview (on page 242, Phelps states that “In situations where risk is not normally distributed, it is ill-advised to ignore statistical parameters beyond variance, unless the deviations from normality are relatively small”).

      More importantly, our proposed measure of the convexity of the reward distributions, the Jensen gap, further reveals how even restricting the analysis to the first four moments is incomplete in the sense that it fails to characterize the difference between options: the fifth moment of the concave contributes more the Jensen gap than even kurtosis, while one needs to look at much higher moments to find significant contributions to the Jensen gap for the convex option. In that sense, there is no reason to restrict the analysis to variance, and even to skewness and kurtosis, to compare options, in general and in our particular setup as well. Note that introducing REE would result in convex distributions even in simplified designs, e.g. with 3-value support. Studying REE implies the need to look beyond variance, and our proposal is to use the Jensen gap as a measure of convexity. In the Material and Methods section of the paper, we did not develop an in depth analysis of Jensen gap so as to spare the reader confronted with an already rather technical paper.

      We thank the referee for raising the issue of whether variance is a simpler explanation of our results. To keep the main text as short as possible, we chose to refrain from adding technical complexity. We hope we made clear in our reply that the analysis cannot be restricted to variance when studying REE. We believe that Jensen gap is a useful notion in this regard. As our replies will be made publicly available, we chose not to integrate the above discussion in the main text.

      (4) Were the probabilities shuffled or truly random (seem to be fixed sequences, so neither)? What were the experienced probabilities? Given the fixed sequences, these experienced ("ex-post") probabilities, could differ tremendously from the scheduled ("ex ante") probabilities. It's quite possible that an animal never experienced the rare, extreme event for a specific option. It's even possible (if they only picked it on the 10th/60th choices by chance), that they only ever experienced that rare extreme event. This cannot be known given the information provided. The Supplemental info on p.55 only gives gross overall numbers but does not indicate what the rats experienced for each choice/option-which is what matters here. A simple table that indicates for each of the 4 options, how often they were selected, and how often the animals experienced each of the 6-8 possible outcome would make it much clearer how closely the experience matched the planned outcomes. In addition, by restricting the rare outcome to either the 10th or 60th activations in a session, these are not random. Did the animals learn this association?

      Probabilities are not random and a limited number of fixed sequences has been used, as stated in Material and Methods. We have chosen sequences that satisfy our assumptions about ex-post stochastic dominance reversal of convex over concave options when REE are added. We have added in Table S4 the choice frequencies for all four options. If the animals had learnt the 10th and 60th activation, they would exhibit a strategy in their choice that would tend to be more optimized than what is observed. For example, the options offering the possibility to obtain the Jackpot are not optimal in terms of gains for the frequent events, therefore the animals should tend to select these options only around the 10th and 60th choice. Most of their other choices should favor the options delivering the larger gains in the frequent domain. This is not what is observed. We have added this important point in the discussion (page 18).

      (5) The choice data are only presented in an overprocessed fashion with a sum and a difference (in both figures and tables). The basic datum (probability/frequency of selecting each of the 4 options) is not provided directly, even if it can theoretically be inferred from the sum and the difference. To understand what the rats actually do, we first need to see how often they select each option, without these transformations.

      As described in Material and Methods, the 4 options are combinations of 2 convex and concave sub-options for gains and losses, which is why our analysis of the behavioral data focuses on convexityrelated total and one-sided sensitivities to REE. The third dimension needed to fully characterize rats’ behavior is simply 1−ff<sub>FF</sub>, the fraction of non-Fragile choices. In addition, we also provide in Table S4 of the Supplementary Material an alternative interpretation in terms of Black Swan Avoidance and Jackpot Seeking. We have added in Table S4 the choice frequencies for all four options. Finally, all the raw data will be made available with open access and no access codes.

      (6) There is insufficient detail provided on the inferential statistical tests (e.g., no degrees of freedom or effect sizes), and only limited information on exactly what tests were run and how (bootstrapping, but little detail). Without code or data (only summary information is provided in the supplement), this is difficult to evaluate. In addition, the studies seem not to be pre-registered in any way, leaving many researchers with degrees of freedom. Were any alternative analysis pipelines attempted? Similarly, there were many sub-groupings of the animals, and then comparisons between them - were these post-hoc?

      We understand the concern of the referee for pre-registration of the referee, as an epistemic safeguard to make empirical claims more falsifiable, more transparent, and less dependent on post hoc rationalization. But the contemporary push for preregistration is often presented as an “epistemic improvement,” but in practice it functions largely as a norm of moral regulation, not a scientific necessity. The rhetoric is moralistic: preregistered research is “clean,” “transparent,” “credible,” while non-preregistered work is viewed with suspicion—even when the methodology is sound. This language is not epistemologically neutral; it enforces ought to be done, irrespective of the diversity of legitimate scientific practices.

      From a philosophy of science perspective, this is historically and conceptually problematic. Scientific progress has never followed a uniform, rule-based method. As e.g. Feyerabend has argued, major discoveries have emerged precisely because researchers were not bound by predetermined plans: they followed anomalies, improvised, reinterpreted data, and revised methods and hypotheses in light of new evidence — practices that a rigid preregistration ethos can suppress and that are not aligned with how genuine discovery often occurs.

      Even from a statistical standpoint, preregistration is far from a panacea. It reduces some degrees of freedom (mainly in confirmatory statistics), but it does not eliminate flexibility; researchers can still choose models, transformations, exclusion rules, stopping rules, etc. And more importantly: reducing flexibility is not inherently epistemically virtuous. Flexibility is often necessary to understand data properly—especially in new paradigms or first-of-their-kind experiments, which is the case for this study. Science needs exploration, opportunism, and theoretical plasticity. Preregistration is compatible with these only if it is treated as one optional tool among many—not as a universal evaluative standard.

      As the referee pointed out, this study “taps into a surprisingly neglected but very relevant aspect of decision-making.” Our work is therefore mainly exploratory: the experimental paradigm reveals new behavioral patterns in how rats cope with rare and extreme events, and much of our analysis is necessarily descriptive. We conduct formal inference only where it is methodologically appropriate — the short-term behavioral response to rare events (for which we now provide more details in the Material & methods section p.35) and the estimation of augmented Q-learning models, which follow a standard econometric approach (documented in the Material & Method section–see also our response to recommendation 4). These inferential results support the descriptive patterns that motivate this new line of research.

      (7) On p. 17, there is an attempt to look at the impact of a rare, extreme event by plotting a measure of preference for the 10 trials before/after the rare, extreme event. In the human literature, the main impact of experiencing a rare, extreme event is what is known as the wavy recency effect (See Plonsky et al. 2015 in Psych Review for example). What this means is that there tends to be some immediate negative recency (e.g., avoiding a rare gain) followed by positive recency (e.g., chasing the rare gain). Using a 10-trial window would thus obscure any impact of this rare, extreme event. An analysis that looks at a time course trial-by-trial could reveal any impact.

      We thank the referee for drawing our attention to the wavy recency effect documented in human experiments. We have added the corresponding reference in the Discussion (page 20). Regarding rats, the Before/After analysis reported in the paper suggests that there is no sizeable immediate recency effect for Jackpots. Even for Black Swans, the immediate recency effect we report remains modest when using a 10-trial window, and the analysis of the choice immediately following a REE does not show evidence of immediate negative recency. This casts doubt on the presence of such an effect in rats.

      (8) As I understood the method (p. 31), the assignment of options to physical locations was not random or counterbalanced, but deliberately biased to have one of the options in the preferred location. This would seem to create a bias towards a particular option and a bias away from the other options, which confounds the preference data in subsequent analyses.

      We agree that the design incorporated an intentional bias toward the anti-fragile option as a proof of concept. Nevertheless, Figure 8 demonstrates that animals substantially altered their choices between training and final testing, with a median change of approximately 35% across sessions. This indicates that behavior was driven by the structure of possible outcomes rather than by a stereotyped location-based preference.

      (9) Are delays really losses? This is a big assumption. Magnitude and delay are different aspects of experience, which are not necessarily commensurable and can be manipulated independently. And, for the model, how were these delays transformed into outcomes for the model? Eq 1 skips over that. Is there an assumption of linearity? In addition, I was not wholly clear if the delays meant fewer trials in a session or if the delays merely extended the session and meant longer delays until the next choice period.

      Consistent with established rodent decision-making paradigms (Adams et al., 2017 doi: 10.1523/ENEURO.0094-17; Breysse et al., 2021 doi: 10.1111/ejn.14895), we employed sweet pellets as gains and imposed delays as losses. Delays are operationalized as losses because they preclude the animal from engaging in reward-generating behavior; thus, increasing the delay duration proportionally increases the magnitude of the opportunity cost.

      (10) The paper does not sufficiently accurately represent the existing literature on human risky decision-making (with and without rare events). Here are a few examples of misrepresented and/or missing literature:

      Most studies on decision-making do not only rely on p > 10% (as per p. 2). Maybe that is true with animals, but not a fair statement generally. Some do, and some don't. There is substantial literature looking at rarer events in both descriptions (most famously with Kahneman & Tversky's work), but also in experience (which is alluded to in reference 19). That reference is not only about the situation when choices are not repeated (e.g. the sampling paradigm), but also partial feedback and full-feedback situations.

      We have corrected that statement in the main text (page 3) and we thank the referee for pointing this out.

      The literature on learning from rewarding experiences in humans is obliquely referenced but not really incorporated. In short, there are two main findings - firstly people underweight rare events in experience; second, people overweight extreme outcomes in experience (both contrary to description). Some related papers are cited, but their content is not used or incorporated into the logic of the manuscript.

      One recent study systematically examined rarity and extremity in human risky decision-making, which seems very relevant here: Mason et al. (2024). Rare and extreme outcomes in risky choice. Psychonomic Bulletin & Review, 31, 1301-1308.

      There is a fair bit of research on the human perception of the risk of rare events (including from experience) and important events like climate. One notable paper is Newell et al (2015) in Nature Climate Change.

      We agree with the referee that the related literature on REE in animal Decision Making is scant and that it is more developed in humans. We thank the referee for pointing at Mason et al. (2024), who clarify where the literature on humans stands and why combining rarity and extremity, as we also do, is important and highly relevant. We have added a new statement and references in the Introduction and Discussion (pages 3, 20, 22).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) As said above, I think the manuscript would really benefit from a rewriting, to replace some technical terms with more readable ones, and maybe rebalance the focus from the current focus on the framework (heavily loaded with economics concepts, which will be hard to digest for the eLife readership) to a higher weight on information that is critical to understand and interpret the behavior (e.g. information about training & training behavior, etc.).

      We have revised the entire manuscript to improve readability and have clarified in the main text: (1) why convexity of exposures to REE could, beyond variance, be useful for experiments in other settings that our own; (2) why the associated notion of antifragility may be applicable to other settings and therefore of broader interest; (3) what was done in the training sessions compared to the final sessions.

      (2) From Figure 8, it seems that rodent behavior is more clustered after the training (i.e. before the sessions) than after the sessions. Could that be a sign of imperfect learning?

      Figure 8 mostly suggests that there is some flexibility in the choices made and that the intended initial bias towards the antifragile choice in the design of the task could be over ridden by the rats.

      (3) The modelling section seems incomplete. I think the authors want to tease apart where REE enters the model and should propose an alternative where REE affects the learning rather than the decision.

      In fact, the general model allows REE to have an effect at the learning stage only (i.e. to contribute to the updating of the Q subvalues), when the specific decision weights attached to options delivering REE are both zero. However, our analysis shows that such a model is rejected by the behavioral data for all rats. We have clarified this point in the revised version.

      (4) Also, parameter and model recovery exercises seem mandatory (Wilson & Collins, 2019).

      We thank the referee for highlighting this valuable reference in computational modeling, particularly in the context of model identification and estimation in computational biology. In the present research, we adopted an econometric perspective on model identification—especially with regard to the integration of Q-values for gains and losses. The softmax choice function is formally equivalent to a multinomial logit model, and as is well known in econometrics, identification in such models presents non-trivial challenges. The standard approach in classical Q-learning is to multiply the Q-value by an inverse temperature parameter (also known as a precision parameter in random utility models). When extending the model to include separate Q-values for gains and losses, specifying the model in an identifiable way becomes more complex.

      To address this issue, we considered several alternative model specifications and conducted grid-based estimation of starting parameter values. This approach allowed us to examine the shape of the loglikelihood function and assess whether the parameters are globally identified, rather than only identifiable up to a linear combination. We found that the most parsimonious and empirically identified specification in our experimental paradigm is one in which Q-values for gains and losses are summed, each weighted by distinct decision weights (see our Equation 2 in the paper).

      The inclusion of decision weights for REE for each option (Equation 2) is then structurally equivalent to introducing constant terms in a logit model. The identification of these parameters follows standard econometric results on discrete choice models (e.g., Davidson & MacKinnon, 2003): since we model choices among four options, three free parameters can be estimated, leaving one degree of freedom in the specification. As mentioned in the "Modelling and Statistical Analysis" section, we further guarded against the presence of local maxima by applying a two-step estimation procedure, combining two optimization algorithms with multiple sets of starting values for the baseline model (i.e., the model without decision weights for REE). We also tested the addition of a global optimization method— simulated annealing—but found that it did not significantly improve upon our two-step procedure. This is not surprising, as our preliminary investigation of model identification, based on grid searches over starting parameter values, confirmed that all parameters were identified in our simple specification. Our intuition is that simulated annealing may yield different estimates than gradientbased methods primarily in cases where the model is not theoretically identified—suggesting that the need for such global optimization techniques can be indicative of underlying identification issues in Qlearning models.

      Regarding model comparison, we have used penalized information criteria to account for additional parameters. Although we do not report confusion or inversion matrices for our nested models, we verified that the estimated models replicate observed behaviors across all phenotypes, as shown in the main text (see bottom left panel of Figure 5 for the Total and One-Sided sensitivities). Most importantly, we conducted 100 additional simulations of 40 artificial sessions for each phenotype using the “winning” models and the median fitted parameters. These simulated rats—playing the task 100 times over 40 sessions—offer strong evidence that the selected models are valid: they quantitatively capture the behavior of all phenotypes in terms of our key metrics, Total and One-Sided sensitivities (see bottom right panel of Figure 5).

      Taken together, this methodical econometric approach to model specification and estimation gives us strong confidence in the identification and robustness of our model. Overall, while Wilson & Collins (2019) provide an interesting framework for model estimation in computational biology, we believe that a more formal theoretical analysis of model identification in Q-learning models would be a valuable addition to the field—though it lies beyond the scope of the present work. In our view, computational biologists should complement simulation-based validation and empirical fit with formal methods for assessing theoretical identifiability, particularly when estimating complex choice models.

      Davidson, R. and J.G. MacKinnon (2003) Econometric Theory and Methods. Oxford University Press (New York).

      Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547. https://doi.org/10.7554/eLife.49547

      Reviewer #2 (Recommendations For The Authors):

      (1) The paper confuses risk sensitivity and exploration in the opening lines. These are not the same.

      What we have in mind here is that uncertainty about outcomes is one of the main drivers of exploration, in the sense that there would be no need to explore in a counterfactual world with deterministic gains and losses. We have modified the opening lines of the paper to better reflect this dimension (page 2).

      (2) p. 9. "awfully long" is an unnecessary descriptor. Descriptions of methods should be more factual.

      The manuscript has been entirely rewritten.

      (3) p. 13. Most points lie on the left of the square (not right?).

      We thank the referee for pointing at this typo, that is now corrected in the text (page 8).

      (4) p. 13. Last line. "obviously" is patronizing to the readers.

      The manuscript has been entirely modified to address related points.

      (5) p. 23. The avoidance of black swans by not choosing that option sounds like a hot-stove effect (see Denrell & March, 2001). Is this evidenced here?

      To the best of our knowledge, the statement that “people tend to avoid activities they have had a negative experience of, resulting in a negativity bias” (from Jerker Denrell’s website) does not explicitly concern REE. Instead, it appears to refer broadly to reinforcement learning mechanisms driven by negative outcomes, irrespective of their magnitude or frequency. In our task, animals encounter both negative rare events (RE) and negative rare and extreme events (REE; Black Swans). Notably, the task design does not allow rats to completely avoid negative RE unless they cease performing the task altogether—a pattern typically seen in paradigms involving aversive stimuli such as electric foot shocks. The fact that all 20 rats maintained stable performance across the 41 sessions provides evidence against a pronounced hot-stove effect. This point has been incorporated into the revised discussion (page 20).

      (6) "menus" is an odd term. Better described as reward schedules?

      “Menu” has been replaced by “option” in the main text.

      (7) Why are they 20-minute sessions? I thought it was 120 trials per session? And 41 sessions? Or was this only in training?

      Each session ended after 20 minutes had elapsed, which led to approximately 120 trials (but not systematically). The choice of 20 minutes was made in order to limit the number of trials to prevent satiety. The total number of sessions ran with all 20 animals for the final testing was 41, an odd number but there was no justification to remove one session from the analysis. The training was much longer and is not included in the 41 sessions.

      (8) Really not clear why these Jensen inequalities were relevant or even calculated for these options? How is it relevant to what animals chose or experienced? They seem to be based on the generative probabilities for different options, which is not what happened in reality.

      We propose the Jensen gap as a general measure of convexity that relates to all moments of the probability distribution, as described in more detail in our answer to point (3) above. As such, we think it is a characterization of options with stochastic outcomes that could prove useful to other experimenters in alternative settings beyond our own.

      (9) Only some summary data in supplemental materials. No open data or code for recreating the experiment or analyzing the data.

      The data is available on Github (see page 38) and the code will be available upon request.

    1. eLife Assessment

      Insects can act as vectors of plant diseases, hence the study of insect-pathogen interactions is relevant for agriculture. This important study identifies in Diaphorina citri a dopamine receptor responsive to 'Candidatus Liberibacter asiaticus' infection, demonstrate direct regulation of this receptor by a microRNA, and integrate dopamine signaling into an established insect reproductive hormone framework. Multiple complementary experimental approaches convincingly support for the findings, although key conclusions rely on correlative data and the mechanistic evidence for the proposed linear signaling cascade is limited. This work will be of interest for insect physiology and vector-pathogen biology, and more broadly for citrus agriculture.

    2. Reviewer #1 (Public review):

      I read this paper with great interest based on my experience in insect sciences. Previous concerns:

      (1) The paper has an original biological question that is overly broad and mechanistically ambitious. The central biological question, namely how CLas infection enhances fecundity of Diaphorina citri via dopamine signaling, is clearly stated and well motivated by previous literature. However, my advice to the authors is that, while the general question is clear, the manuscript attempts to answer multiple mechanistic layers simultaneously. As a result, I feel that the biological narrative becomes diffuse, especially in later sections where DA, miRNA regulation, AKH signaling, and JH signaling are all proposed as parts of a single linear cascade. In summary, my key concern is that the paper often moves from correlation to causal hierarchy without fully disentangling whether these pathways act sequentially, in parallel, or redundantly. A more explicitly framed primary hypothesis (e.g., "DA-DcDop2 is necessary and sufficient for CLas-induced fecundity") may improve conceptual clarity.

      (2) On the novelty of the data, I feel they are moderately novel, with substantial confirmatory components. If I am correct, the novel contributions include the identification of DcDop2 as the DA receptor responsive to CLas infection in D. citri, the discovery that miR-31a directly targets DcDop2, which is supported by luciferase assays and RIP, and thirdly, the integration of dopamine signaling into the already-described CLas-AKH-JH-fecundity framework. My advice to the authors is to focus more on the manuscript's novelty, which lies more in pathway integration than in discovering fundamentally new biological phenomena. This is appropriate for a mechanistic paper, but should be framed as an extension of existing models rather than a paradigm shift.

      (3) On the conclusions, I recommend that the authors modify their statements a little. I feel that there are some overstated or insufficiently supported claims. For instance, the assertion that CLas "hijacks" the DA-DcDop2-miR-31a-AKH-JH cascade implies direct pathogen manipulation, but no CLas-derived effector or mechanism is identified. Also, that the model suggests a linear signaling hierarchy, but the data largely show correlation and partial dependency rather than strict epistasis. In third, the term "mutualistic interaction" may be too strong, as host fitness costs outside fecundity (e.g., longevity, immunity) are not evaluated. In conclusion, I confirm that the data support a functional association, but mechanistic causality and evolutionary interpretation are somewhat overstated.

      Comments on revised version:

      The authors provided a satisfactory revision.

    3. Reviewer #2 (Public review):

      Summary:

      Nian and colleagues comprehensively apply metabolomics, molecular, and genetic approaches to demonstrate that CLas hijacks the DA/DcDop2-miR-31a-AKH-JH signaling cascade to enhance lipid metabolism and fecundity in D. citri, while concurrently promoting its own replication.

      Strengths:

      These findings provide solid evidence of a mutualistic interaction between CLas proliferation and ovarian development in the insect host. This insight significantly advances our understanding of the molecular interplay between plant pathogens and vector insects and offers novel targets and strategies for HLB field management.

      Weaknesses:

      While the article investigates the involvement of dopamine signaling and specific microRNAs in enhancing fecundity and pathogen proliferation, it still needs to provide a detailed mechanistic understanding of these interactions. The precise molecular pathways and feedback mechanisms by which CLas manipulates dopamine signaling in Diaphorina citri remain unclear.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I read this paper with great interest based on my experience in insect sciences. I have some minor comments (and recommendations) that I believe the authors should address.

      (1) The paper has an original biological question that is overly broad and mechanistically ambitious. The central biological question, namely how CLas infection enhances fecundity of Diaphorina citri via dopamine signaling, is clearly stated and well motivated by previous literature. However, my advice to the authors is that, while the general question is clear, the manuscript attempts to answer multiple mechanistic layers simultaneously. As a result, I feel that the biological narrative becomes diffuse, especially in later sections where DA, miRNA regulation, AKH signaling, and JH signaling are all proposed as parts of a single linear cascade. In summary, my key concern is that the paper often moves from correlation to causal hierarchy without fully disentangling whether these pathways act sequentially, in parallel, or redundantly. A more explicitly framed primary hypothesis (e.g., "DA-DcDop2 is necessary and sufficient for CLas-induced fecundity") may improve conceptual clarity.

      We sincerely thank the reviewer for these constructive comments and agreed that the initial version of our manuscript attempted to integrate multiple signaling layers, which may have blurred the logical distinction between sequential, parallel, or redundant pathways. To address this concern, we have restructured the narrative to center on a clearly defined hypothesis by changing “DA/DcDop2-miR-31a-AKH-JH signaling cascade” to “DA-DcDop2 signaling axis” in Abstract (Line 33) of the revised manuscript.

      (2) On the novelty of the data, I feel they are moderately novel, with substantial confirmatory components. If I am correct, the novel contributions include the identification of DcDop2 as the DA receptor responsive to CLas infection in D. citri, the discovery that miR-31a directly targets DcDop2, which is supported by luciferase assays and RIP, and thirdly, the integration of dopamine signaling into the already-described CLas-AKH-JH-fecundity framework. My advice to the authors is to focus more on the manuscript's novelty, which lies more in pathway integration than in discovering fundamentally new biological phenomena. This is appropriate for a mechanistic paper, but should be framed as an extension of existing models rather than a paradigm shift.

      We sincerely thank the reviewer for this thoughtful and highly constructive assessment. We greatly appreciate the clear articulation of what constitutes the novel contributions of our work, and we fully agree with the characterization that the primary novelty lies in pathway integration rather than the discovery of entirely unprecedented biological phenomena. We also accept the valuable advice that our manuscript should be framed as an extension of existing models rather than a paradigm shift. In response to this insightful comment, we have carefully revised the Results part in Line 275-278 of the revised manuscript.

      (3) On the conclusions, I recommend that the authors modify their statements a little. I feel that there are some overstated or insufficiently supported claims. For instance, the assertion that CLas "hijacks" the DA-DcDop2-miR-31a-AKH-JH cascade implies direct pathogen manipulation, but no CLas-derived effector or mechanism is identified. Also, that the model suggests a linear signaling hierarchy, but the data largely show correlation and partial dependency rather than strict epistasis. In third, the term "mutualistic interaction" may be too strong, as host fitness costs outside fecundity (e.g., longevity, immunity) are not evaluated. In conclusion, I confirm that the data support a functional association, but mechanistic causality and evolutionary interpretation are somewhat overstated.

      We sincerely thank the reviewer for these insightful comments and agreed that there are some overstated or insufficiently supported claims. In response to this insightful comment, we have changed "hijacks" to "regulates" (Line 32 and 124), and "mutualistic interaction" to “coevolution” (Line 2, 34, 127, 257, 763, 806, and 842) in our revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      Nian and colleagues comprehensively apply metabolomics, molecular, and genetic approaches to demonstrate that CLas hijacks the DA/DcDop2-miR-31a-AKH-JH signaling cascade to enhance lipid metabolism and fecundity in D. citri, while concurrently promoting its own replication.

      Strengths:

      These findings provide solid evidence of a mutualistic interaction between CLas proliferation and ovarian development in the insect host. This insight significantly advances our understanding of the molecular interplay between plant pathogens and vector insects, and offers novel targets and strategies for HLB field management.

      Weaknesses:

      While the article investigates the involvement of dopamine signaling and specific microRNAs in enhancing fecundity and pathogen proliferation, it still needs to provide a detailed mechanistic understanding of these interactions. The precise molecular pathways and feedback mechanisms by which CLas manipulates dopamine signaling in Diaphorina citri remain unclear.

      These comments are extremely helpful for revising and improving our manuscript.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) In Figures 1C and 1D, please maintain consistent gene nomenclature: change "henna" to "Henna", "TH" to "Th", and "DDC" to "Ddc".

      Thanks for your great suggestion. We have changed "henna" to "Henna", "TH" to "Th", and "DDC" to "Ddc" in Figure 1C and 1D of our revised manuscript.

      (2) In Figure 7, correct "Emergy metabolism" to "Energy metabolism".

      Thanks for your valuable suggestion. We have corrected "Emergy metabolism" to "Energy metabolism" in Figure 7 of our revised manuscript.

      (3) Please specify the number of biological replicates in the figure captions.

      Thanks for your perfect suggestion. We have specified the number of biological replicates in the figure captions of Figure 1 (Line 737-738), Figure 2 (Line 757-759), Figure 3 (Line 780-782), Figure 4 (Line 799-800), Figure 5 (Line 816-819), and Figure 6 (Line 833-836).

      (4) For Figure 2I, 3J, and 5H, clarify that CLas 16s rRNA was detected by FISH. The age of the dissected females should also be described in the captions.

      Thanks for your insightful suggestion. We have added the female age (at 7 DAE) in the captions for Figure 2I (Line 752), 3J (Line 773), and 5H (Line 813) of our revised manuscript.

      (5) A blot is shown in Figure 3B but not discussed in the text. Since the manuscript describes mRNA levels, please specify whether these blots are from Northern or Western blotting and provide relevant methodological details.

      Thanks for your great suggestion. The blot in Figure 3B is Western blot result. We have added the related descriptions in Result (Line 202), Materials and Methods (Line 521-536), and figure legend (Line 766) of our revised manuscript. 

      (6) In Figure 3G-3K, an "inhibitor" was used, but its name and functional role are not described. Please give more details.

      Thanks for your valuable suggestion. We have added the detail information for “Dop2 inhibitor” in the Figure 3G-3K legend (Line 772-776) of our revised manuscript.

      (7) In Lines 23-24 of the Abstract, consider revising "their neuroendocrine regulation remains unclear" to "their neuroendocrine regulation mechanisms remain unclear" for grammatical accuracy.

      Thanks for your perfect suggestion. We have revised "their neuroendocrine regulation remains unclear" to "their neuroendocrine regulation mechanisms remain unclear" for grammatical accuracy in Line 24 of our revised manuscript.

      (8) The last sentence of the Abstract is overly long. It is recommended to split it as follows: "These findings reveal a mutualistic interaction between CLas proliferation and ovarian development in the insect host. This discovery enhances our understanding of the molecular interplay between plant pathogens and vector insects and offers novel targets and strategies for HLB field management."

      Thanks for your excellent suggestion. We have splited the last sentence of the Abstract as follows: "These findings reveal a coevolution between CLas proliferation and ovarian development in the insect host. This discovery enhances our understanding of the molecular interplay between plant pathogens and vector insects and offers novel targets and strategies for HLB field management." in Line 34-37 of our revised manuscript.

      (9) In Line 139, remove the comma between "female" and "adult".

      Thanks for your great suggestion. We have removed the comma between "female" and "adult" in Line 139 of our revised manuscript.

      (10) In Line 149, replace "d" with day.

      Thanks for your perfect suggestion. We have replaced "d" with "day" in Line 149 of our revised manuscript.

      (11) The JH determination method references a previous study but lacks a detailed description of the extraction procedure. Please include this information in the methodology section.

      Thanks for your valuable suggestion. We have added the detailed description of the JH extraction procedure in Line 511-514 of our revised manuscript.

      (12) In Figure S2, since the panel shows interference efficiencies for four genes, "treated with dsDcAKHR" should be revised to "treated with dsRNA" for accuracy.

      Thanks for your insightful suggestion. We have revised "treated with dsDcAKHR" to "treated with dsRNA" for accuracy in the Figure S2 legend.

      (13) In line 354-355, change "DcVg1-like, DcVgA1-like and DcVgR" to "DcVg1-like, DcVgA1-like, and DcVgR".

      Thanks for your great suggestion. We have changed "DcVg1-like, DcVgA1-like and DcVgR" to "DcVg1-like, DcVgA1-like, and DcVgR" in Line 350 of our revised manuscript.

      (14) The study primarily investigates the role of agomir-31a. Would antagomir-31a promote ovarian development in CLas- females? In addition, did the authors perform a rescue experiment using antagomir-31a in CLas+ females after dsDcDop2 treatment?

      Thanks for your valuable suggestion. The proposed experiments will be instrumental in further elucidating the functional role of miR-31a and represent a key direction for our future research. We will carefully consider and incorporate these approaches in our subsequent study.

      (15) The method used to determine CLas-negative and CLas-positive individuals should be described in more detail in the Materials and Methods section.

      Thanks for your great suggestion. We have added more details about CLas detection in the Materials and Methods section (Line 378) of our revised manuscript.

    1. eLife Assessment

      This fundamental manuscript presents a novel application of the SANDI (Soma and Neurite Density Imaging) model to study microstructural alterations in the basal ganglia of individuals with Huntington's disease (HD). The compelling methods are, to our understanding, the first application of SANDI to neurodegenerative diseases, provide strong evidence for HD-related neurodegeneration in the striatum, account significantly for striatal atrophy, and correlate with motor impairments. The integration of novel diffusion acquisition and modelling methods with multimodal behavioural data are both of high value in their own right, and create a framework for future studies.

    2. Reviewer #1 (Public review):

      (1) In this study, the authors aimed at characterizing Huntington's Disease (HD) - related microstructural abnormalities in the basal ganglia and thalami as revealed using Soma and Neurite Density Imaging (SANDI) indices (apparent soma density, apparent soma size, extracellular water signal fraction, extracellular diffusivity, apparent neurite density, fractional anisotropy and mean diffusivity).

      (2) The study implements a novel biophysical diffusion model that extends up-to-date methodologies and presents a significant potential for quantifying neurodegenerative processes of the grey matter of the human brain in vivo. The authors comment on the usefulness of this technique in other pathologies, but they exemplify only with multiple sclerosis. Further development of this, building evidence should be provided.

      (3) Study found that HD-related neurodegeneration in the striatum accounted significantly for striatal atrophy and correlated with motor impairments. HD was associated with reduced soma density, increased apparent soma size and extracellular signal fraction in the basal ganglia, but not in the thalami. Additionally, these affects were larger at manifest stage.

      (4) The results of this work demonstrate the impact of HD on basal ganglia and thalami which can be further explored as a non-invasive biomarker of disease progression. Additionally, the study shows that SANDI can be used to explore grey matter microstructure in a variety of neurological conditions.

      Comments on revised version.

      I have no further comments. Thank you

    3. Reviewer #3 (Public review):

      Summary:

      Ioakeimidis and colleagues studied miscrostructural abnormalities in N=56 Huntington's disease (HD) patients compared to N=57 normative controls. The authors used a powerful MRI Connectom scanner and applied the SANDI model to estimate the soma size, neurite size, soma density, and extracellular fraction in key subcortical nuclei related to HD. In the striatum, they found decreased soma density and increased soma size, which also seemed to become more pronounced in advanced HD individuals in the final exploratory analyses. The authors conducted important analyses to find whether the SANDI measures correlate with clinical scores (i.e., QMotor) and whether the variance of the striatal volume is explained by the SANDI measures. They found a relationship of SANDI measures to both.

      Strengths:

      The study is both innovative and of high interest for the HD community. The authors provide a rich pool of statistical analyses and results which anticipate the questions that may emerge in the HD research community. Statistics are carefully chosen and image processing is done with state-of-the-art methods and tools. The sample size gives sufficient credibility to the findings. Altogether, I think this study sets a milestone in the attempts of the HD community to understand neuropathological processes with non-invasive methods, and extends the current knowledge of microstructural anomalies identified in HD with diffusion MRI. More importantly, the newly identified anomalies in soma size and soma density open new avenues for studying these biological effects further, and perhaps develop these biomarkers for use in clinical trials.

      Weaknesses:

      (1) An important question is whether the SANDI measures, which require an expensive scanner and elaborate processing, are better biomarkers than the more traditional DTI measures. Can the authors compare the effect size of FA/MD with SANDI measures. In some of the plots and tables, FA/MD seem to have comparable, if not higher, correlations with QMotor or CAP scores. On the same vein, it is unclear whether DTI measures were included in hierarchical stepwise regression. I wonder if the stepwise models may have picked up FA/MD instead of SANDI measures if they are given a chance. Overall, I hope the authors can discuss their findings also in this light of cost vs. benefit of adopting SANDI in future studies, which is an important topic for clinical trials.

      (2) Similar to the above point, it is very important to consider how strong the biomarking signal is from SANDI measures compared to the good old striatal volume. Some plots seem to indicate that volumes still have the highest correlation with QMotor, and highest effect size in group comparisons. It would be helpful for the community to know where do the new SANDI measures stand compared to the most typically used volumes in terms of effect size.

      (3) The diffusion measures are inevitably correlated to some degree. Please provide a correlation matrix in supplementary material including all DWI measures to enable readers to understand better how similar SANDI measures are between each other or vs. other DTI measures. Perhaps adding volumes to this correlation matrix may also be a good future reference.

      (4) ISS stages:

      (a) The online ISS calculator requires cut-offs derived from the longitudinal Freesurfer pipeline, while the authors do not have longitudinal data. Thus, the ISS classification might be inaccurate to some degree if the authors used the FS cross-sectional pipeline. Please review this issue and see if updated cut-offs should be used to classify participants.<br /> (b) Were there really no participants with ISS 0 among 56 HD individuals, please clarify in the manuscript?<br /> (c) A note on terminology that might be confusing to some readers. According to the creators of ISS, the ISS stages are created for research only, they are not used or applied in the clinic. On the other hand, the terms "premanifest" and "manifest" have a clinical meaning, typically based on the diagnostic confidence level. The assignment of ISS0-1 to premanifest and ISS2-3 to manifest may create some non-trivial confusion, if not opposition, in some segments the HD community. The authors can keep their current terminology but will need to at least clarify to the reader that this assignment is speculative, does not fully match the clinically-based categories, and should not be confused with similarly named groups in the previous literature.

      Comments on revised version.

      The authors have moved to address many points from reviewers. The manuscript had indeed become more objective, transparent, and to the point. The amount of information and analyses is large, which perhaps is inevitable when new methods are being tested for the first time in a neurodegenerative disease.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      (1) The biological and/or mathematical meaning of the Soma and Neurite Density Imaging (SANDI) indices (apparent soma density, apparent soma size, extracellular water signal fraction, extracellular diffusivity, apparent neurite density, fractional anisotropy, and mean diffusivity) should be briefly introduced for those less familiar with this novel technique.

      Further explanations about the biological and mathematical meaning of the SANDI indices were added to the introduction on page 6.

      (2) The study implements a novel biophysical diffusion model that extends up-to-date methodologies and presents a significant potential for quantifying neurodegenerative processes of the grey matter of the human brain in vivo. The authors comment on the usefulness of this technique in other pathologies, but they exemplify it only with multiple sclerosis. Further development of this, building evidence, should be provided.

      Clinical applications of SANDI have primarily focused on MS. However, since preparation of the manuscript, one study has been published reporting reductions in apparent soma density and white and grey matter specific differences in apparent soma size in amyotrophic lateral sclerosis (ALS) (Zeng et al., Eur J Radiol 2025, 10.1016/j.ejrad.2025.111981). These findings accord with the loss of motor neurons and glial responses in ALS. We have added this study to the introduction of SANDI on page 7.

      (3) Why are the basal ganglia compared against thalami? The rationale of this decision is missing.

      The thalami were selected as control regions based on the established trajectory of neurodegeneration in HD, which begins with early loss of medium spiny neurons in the striatum and later extends to surrounding structures, including the putamen and thalamus. Given that most participants in our study were at early disease stages, we assumed the thalami would remain relatively unaffected in this sample. This explanation has been added to the introduction on page 7.

      (4) The use of bullet points is unusual for a scientific paper format.

      Bullet points have been removed throughout the manuscript.

      (5) The authors mention that they eroded the boundaries of the subcortical masks. Providing the details and parameters of this erosion would be beneficial.

      Details of the default parameters of the FSL erode function that was used have been added to the method section on page 13.

      (6) In the conclusion, the authors state that their results will bridge the gap between histopathological findings and in vivo imaging, but it would be helpful if they could briefly explain how they imagine such a bridge (e.g., which kind of comparisons or correlations) and whether there exists any literature in this regard so far.

      We have added the following brief explanation to the conclusion on page 26: “Although conventional MRI lacks the resolution to directly capture histopathology, advanced biophysical models such as SANDI may help bridge this gap by providing biologically interpretable parameters that reflect tissue composition and capture histopathological changes in vivo.”

      (7) The scale is missing in Figure 3.

      The scale has been added to Figure 3.

      (8) In general, the work would benefit from a better organization and potentially a smaller number of figures and tables.

      The manuscript has been re-edited to improve the readability and organization throughout and the number of figures and tables were reduced by moving some of them to the Supplementary Material (old Tables 2 and 5 are now Supplementary Tables 2 and 3, old Figure 3 is now Supplementary Figure 1).

      Reviewer #2:

      Certain aspects of the study would benefit from clarification:

      (1) Scanner and acquisition consistency: While HD data are from the WAND study, it is not clear whether controls were scanned on the same scanner or protocol. Given the use of model-derived metrics (especially SANDI), differences in scanner or acquisition could introduce confounds. From the text, the HD participants are explicitly said to come from the WAND study (a longitudinal HD cohort). On the other hand, while the HC participants are described as age-matched controls, the paper does not clearly state whether they were scanned in the same study (i.e., WAND), on the same scanner, or with the same acquisition protocol. This ambiguity is potentially problematic, especially since they use model-derived diffusion metrics that can be very sensitive to scanner hardware, gradient strengths, and protocol settings. If the WAND HD data were acquired on a specific scanner (e.g., 3T Connectom) and the HCs were not, then differences in SANDI/DTI metrics might reflect scanner bias, not disease pathology. This is particularly critical in SANDI, which is sensitive to high b-values and SNR. It would strengthen the manuscript to explicitly state whether the HD and control data were acquired using the same scanner model, sequence, and protocol, and ideally at the same site. If this were not the case, the authors should include this as a limitation and discuss any harmonization strategies applied (e.g., ComBat, covariate modeling, etc).

      For harmonization and comparison purposes, HD and control data were acquired using the same strong gradient (300mT/m) 3T Connectom MRI system at CUBRIC with the same acquisition protocols and sequences. It should also be noted that the Connectom scanner has not had any software upgrades that could introduce scanner biases in data acquired at different time points. This is now made explicit on page 8 by stating that all MRI data for all participants were acquired on the same MRI system using the same acquisition protocols, and on page 10 by stating that all HD and HC MRI data included in our analyses were acquired on the same 3T Siemens Connectom scanner at CUBRIC using the same acquisition protocols described in this section.

      Also, although it offers novel and biologically informative markers, widespread clinical translation still faces hurdles. For instance, the study used a 3T Connectom scanner (300mT/m gradients), which is not widely available. Reproduction of these results in standard 3T clinical scanners would be a great addition, in scenarios with lower resolution, less precise parameter recovery, and longer scans if SNR needs to be maintained.

      We agree that for clinical adoption it is important to demonstrate that HD-related SANDI differences can also be detected on clinical MRI systems and do not require ultra-strong gradient imaging. While we have not collected such data in people with HD, we have demonstrated the feasibility of modelling SANDI metrics from multi-shell diffusion-weighted imaging acquired on a clinical 3T MRI (maximum b-value of 6,000 s/mm<sup>2</sup>) in healthy adults and people with MS (Schiavi et al 2023, https://doi.org/10.1002/hbm.26416). Furthermore, Zeng et al 2025, reported significant differences in SANDI metrics acquired on a 3T MRI Prisma system between individuals with ALS and healthy controls (maximum b-value of 3,000 s/mm<sup>2</sup>).

      Two additional studies demonstrated that SANDI could be implemented and microstructural differences could be detected in MS using 3T scanners with standard gradient strength (Barakovic et al., 2024; Margoni et al., 2023). Collectively, these findings indicate that SANDI can be applied on clinical scanners, particularly as clinical systems move toward stronger gradient capabilities such as Siemens Magnetom Cima.X. These explanations can be found under the clinical implication section in the Discussion on page 25.

      (2) Limitations of HD-ISS staging resolution and group separation:

      The use of HD-ISS staging to anchor progression analyses is conceptually appropriate, but, in practice, the sample is quite limited.

      (a) Only 26-27 out of 56 gene-positive participants could be assigned HD-ISS stages, and none were classified into stages 0 or 4. This restricts the interpretation of progression to a narrow clinical window (mostly stages 1-3) and excludes over 50% of the cohort.

      (b) Furthermore, visual inspection of the scatter plots (e.g., Figures 3 and 4) reveals substantial overlap between stages 1 and 2, particularly in CAP100 and Q-Motor measures. This suggests that the separation between early disease stages may not be robust in this dataset, potentially due to limited power or phenotypic variability.

      (c) The above may lead to claims based on progression across HD-ISS stages to be overinterpreted or underpowered

      Despite this, the paper treats the staging as a reliable stratification for group comparisons. To improve clarity and transparency, I would recommend that the authors:

      (a) Acknowledge that over 50% of the HD cohort could not be classified.

      (b) Discuss whether those excluded differed from those included in key metrics.

      (c) Explicitly comment on the substantial overlap between stages 1 and 2, and limit claims about progression unless such separation is statistically supported.

      (d) Avoid overinterpreting staging-related effects without statistical support for group separability

      Re a-d) We have added to the study limitations on pages 23 ff that only 54% (30 out of 56) HD participants could be HD-ISS classified due to missing data, and provide an overview of demographic and clinical information for HD-ISS stages and unclassified individuals in Supplementary Table 1. We acknowledge that the combined groups (HD-ISS 0-1 versus HD-ISS 23) for exploratory group analyses did not represent discrete disease stages and that there was some overlap in imaging and behavioural features between them as illustrated in Figures 3, 4, and 7. We state explicitly that these exploratory findings should be interpreted with caution and require replication in larger, prospective cohorts before SANDI metrics can be considered as potential markers of disease progression.

      (3) Clarify regression strategy and interpretational limits of SANDI-derived regressors: While the hierarchical regression strategy is broadly appropriate, several aspects would benefit from clarification to improve both interpretability and robustness of the findings. For example:

      (a) Why were only a subset of SANDI parameters (fis and De) considered in the HC models (Figure 6), while additional metrics (fec and rs) were tested in HD models (Figures 7-8)? Including the same variables across groups could aid comparability?

      The same SANDI indices were included in regression models for HD and HC groups, Figure 7-8 report only significant predictors. This has been clarified in the figure legend and on pages 14 of the manuscript.

      (b) Were any checks for multicollinearity (e.g., variance inflation factors) conducted? Given known interdependencies among some SANDI parameters, I wonder whether some of the reported regression coefficients may be unstable or difficult to interpret.

      Cross-correlation matrices between all imaging metrics for HD, HC, and total samples have been included to Supplementary materials Figure 3.

      To improve transparency and interpretability, I suggest actions such as:

      (a) SANDI metrics included in the models differ between HC and HD groups, reducing comparability. Consider using consistent full models across ROIs for comparison purposes, even if some predictors are not significant.

      (b) Report the correlation structure between SANDI metrics within each group to assess multicollinearity (The potential impact of multicollinearity (e.g., between fis and rs) is not discussed)

      (c) Explicitly acknowledge the limitations imposed by parameter degeneracy in the SANDI model and clarify how the authors ensured the biological interpretability of regression outputs in this context - Beta coefficients could reflect model instability or parameter degeneracy rather than true biological effects.

      (a) The same SANDI metrics and age were included in the first regression models for HD and HC data. The first models only differed by the inclusion of TFC as estimate of disease burden for the HD data. HD and HC participants were not included in a single regression model, as our aim was not to perform formal between-group inference on regression coefficients. Instead, models were fitted separately to explore within-group associations and to descriptively compare patterns of relationships across groups. This approach avoids imposing identical model structures across groups that may differ in variance structure, disease burden, and biological coupling between SANDI metrics. We have clarified these points on page 13/14.

      (b) We agree that multicollinearity is an important consideration when interpreting regression coefficients derived from microstructural models. To address this, we examined pairwise Spearman correlations between all imaging (SANDI, DTI, volume) metrics (averaged across ROIs), shown in the revised Supplementary Figure 2. As can be seen in the healthy control data, SANDI indices of apparent soma and neurite fractions showed a strong inverse correlation (rho = -0.92) and did not correlate with soma radius (rho = 0.1). All SANDI indices correlated only weakly with FA and volume and moderately with MD. This correlation pattern suggests that apparent soma density and radius capture distinct information about grey matter microstructure that differs from neurite fraction and is not captured by FA or volume. We note in HD participants a negative correlation between soma radius and fraction, and stronger correlations between SANDI metrics and volume measures. We would argue that these reflect disease-related reorganization of micro- and macro-structural relationships rather than uniform collinearity across groups. This information has been added to the Methods, Results and Discussion sections on pages 13, 19, and 21, 23ff.

      (c) We agree that regression coefficients derived from interdependent microstructural parameters should be interpreted with caution, as they may reflect shared variance or partial parameter degeneracy rather than fully independent biological effects. For this reason, we do not interpret individual beta coefficients in isolation. Instead, our conclusions focus on the consistency and directionality of associations across regions and metrics, and on the overall feasibility and sensitivity of SANDI to detect biologically meaningful variation in HD. The observed correlation structure (Supplementary Figure 2) provides important context for these interpretations and supports a multivariate, pattern-based rather than univariate reading of the results. These points have been added to the Discussion on pages 23 ff. Please also refer to our response to point (5) below.

      (4) Preprocessing order:

      Gibbs ringing correction was applied after TOPUP and EDDY, which deviates from the commonly recommended order in diffusion MRI preprocessing. Since Gibbs artifacts are introduced by kspace truncation and affect the spatial domain, it is typically advised to perform Gibbs correction prior to geometric corrections like TOPUP and EDDY. This avoids potential blurring or propagation of ringing artifacts during resampling. Could the authors clarify the rationale for this ordering, and whether an early application of Gibbs correction was tested?

      We agree that the application of Gibbs ringing correction prior to TOPUP and EDDY correction deviates from the commonly recommended order in diffusion MRI preprocessing. However, as some of the data included in this paper were preprocessed before this consensus was agreed in the literature, we kept the preprocessing order consistent for all datasets for harmonization and comparison purposes. We have since changed the order for subsequent preprocessing of the HDDRUM data and have found comparable FA maps for data processed with Gibbs ringing correction before and after TOPUP and EDDY correction.

      (5) Expand on SANDI model assumptions:

      SANDI is presented as being used for the very first time in this problem. However, a vague explanation is given: "using all the default settings". Given the novelty of applying SANDI in a clinical HD context, the manuscript would benefit from a discussion of the model's key assumptions and limitations. For instance:

      (a) The potential degeneracy between fis and rs in the absence of protocol features (e.g., long Δ or high b) that can disambiguate them.

      (b) Whether a dot compartment was included, and the implications of excluding it for the interpretation of rs or fis.

      (c) The lack of exchange modeling or fixed stick diffusivity, and how these may bias compartment estimates (particularly in diseased or aging tissue).

      (d) Any steps taken to verify robustness or identifiability (e.g., simulations, synthetic fitting). These issues are not flaws in the method, but they do affect how confident we can be in interpreting fis/rs as markers of neuron loss or glial hypertrophy, especially given the subtle group differences and the potential for biological heterogeneity in HD. Even a brief acknowledgment would strengthen the manuscript and provide useful context to readers less familiar with multicompartment modeling.

      We thank the reviewer for this constructive suggestion and fully agree that, because this is the first application of SANDI in our clinical HD cohort, the manuscript should more explicitly describe the model assumptions, potential identifiability limitations under our protocol, and the implications for biological interpretation.

      We have revised the Methods (pages 11-12) and Discussion (page 24) to (i) specify the exact SANDI implementation used (the SANDI MATLAB toolbox, available at: https://github.com/palombom/SANDI-Matlab-Toolbox-Latest-Release), (ii) describe which components are included in the default formulation and the key modelling assumptions, and (iii) add a dedicated “Limitations and interpretability” paragraph addressing points (a–d) below. We also avoid the previous shorthand “default settings” and provide a clear description of the fitting setup.

      “The SANDI model [Palombo M. et al, NeuroImage 2020] assumes three compartments, namely intra-neurite signal modelled as diffusion inside impermeable randomly oriented sticks, intra-soma signal modelled as restricted diffusion inside spheres, and extra-cellular signal modelled as Gaussian isotropic diffusion. The direction-averaged (or spherical mean) normalized diffusion signal has thus the following expression:

      S(b) = f<sub>is</sub>A<sub>sphere</sub> (b, r<sub>s</sub>, D<sub>is</sub>) + f<sub>in</sub>A<sub>stick</sub> (b, D<sub>in</sub>) + f<sub>ec</sub>A <sub>ball</sub> (b, D<sub>e</sub>)

      where f<sub>in</sub> + f<sub>is</sub>+ f<sub>ec</sub> = 1; A<sub>stick</sub> and A<sub>sphere</sub> are the normalized, directionally-averaged (or spherical mean) signals for restricted diffusion within neurites and soma, respectively and A<sub>ball</sub> is the normalized, directionally-averaged (or spherical mean) signal of the extra-cellular space. The specific expressions are given in [Palombo M. et al. NeuroImage 2020]. The parameters estimated from the direction-averaged (or spherical mean) data are D<sub>in</sub>, proxy of the intra-neurite effective axial diffusivity; D<sub>e</sub>, proxy of the extracellular effective mean diffusivity; r<sub>s</sub, a proxy of apparent soma radius as well as the signal fractions subject to the constraint f<sub>in</sub> + f<sub>is</sub> + f<sub>ec</sub> = 1, proxy respectively of the relaxation-weighted neurite, soma and extracellular volume fractions. The bulk diffusivity inside the sphere D<sub>is</sub> is fixed to 3 μm<sup>2</sup>/ms. The parameters were fitted using a Random Forest regression algorithm (TreeBagger Matlab®) with 200 trees, trained on simulated data, using the code publicly available at https://github.com/palombom/SANDI-Matlab-Toolbox-Latest-Release. The training data consisted of simulated signals for 10<sup>5</sup> parameter combinations, uniformly sampled: f<sub>in</sub> and f<sub>is</sub> ∈ [0, 1], D<sub>in</sub> ∈ [0.5, 3] μm<sup>2</sup>/ms, D<sub>e</sub> ∈ [0.5, 3] μm<sup>2</sup>/ms and r<sub>s</sub> ∈ [1, 12.5] μm. Rician noise with a distribution of standard deviations randomly sampled from the voxels within the brain mask of the noise map obtained using MPPCA denoising was added to account for realistic SNR levels and rectified noise floor. The loss function of the training was the mean squared error between predicted parameters and ground truth values. Model fitting provided maps of f<sub>in</sub>, f<sub>is</sub>, f<sub>e</sub>, D<sub>in</sub>, D<sub>e</sub> and r<sub>s</sub>.”

      (a) Potential degeneracy between f<sub>is</sub>and r<sub>s</sub>. We agree that partial coupling (or degeneracy) between the soma fraction f<sub>is</sub> and soma radius r<sub>s</sub> is possible when the acquisition does not provide strong sensitivity to restricted sphere size (e.g., in the low b-values regime). Our protocol benefits from high b-values (up to 6000 s/mm<sup>2</sup>) enabled by the Connectom gradient system, which increases sensitivity to signal attenuation from restricted compartments and reduce the f<sub>is</sub>-r<sub>s</sub> coupling/degeneracy. However, we acknowledge that the specific choice of fixed diffusion timing (in our case δ=7 ms, Δ=24 ms) can further modulate the f<sub>is</sub>-r<sub>s</sub> coupling/degeneracy in a protocol-dependent way. To reflect this appropriately, we now explicitly state that r<sub>s</sub> should be interpreted as an “apparent soma radius” under our protocol, and that our inferences focus on relative group differences and spatial patterns rather than absolute histological soma radii.

      We have now added a paragraph in the limitations section acknowledging this point.

      (b) Dot compartment. We did not include an explicit “dot” (immobile) compartment, because there is no evidence that in human in vivo this is required (see for example very low and negligible contribution provided in Tax C. et al. NeuroImage 2020: https://www.sciencedirect.com/science/article/pii/S1053811920300215). Accordingly, our fits did not include a dot term, and we now state this explicitly in the Methods. However, we would like to clarify that our fitting method (described in details at https://github.com/palombom/SANDI-Matlab-Toolbox-Latest-Release) includes accurately the impact of Rician noise and thus it account for the corresponding rectified noise-floor that very often, in high b-values applications, is mistakenly associated with a “dot” compartment. Therefore, there is no expected bias on the estimated f<sub>is</sub> and r<sub>s</sub> due to not including a “dot” compartment.

      (c) Exchange modelling and fixed stick diffusivity. We agree that SANDI, as implemented here, does not explicitly model inter-compartment exchange during the diffusion encoding and uses simplified representations of neurites (sticks), but the intra-stick diffusivity, D<sub>in</sub>, was not fixed but rather fitted. In diseased or aging tissue, deviations from these assumptions (e.g., altered membrane permeability) may bias compartment estimates. This has been investigated in dept in Schiavi S. et al. HBM 2023 (https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.26416), so we refer the redear to that. We have added an explicit limitation statement noting that HD-related microstructural changes (e.g., changes to membrane permeability) could affect model parameter fidelity, and thus f<sub>is</sub>and r<sub>s</sub> should be treated as MRI-derived effective indices rather than direct quantitative measures of neuron loss or glial hypertrophy. Importantly, our analysis compares groups under an identical acquisition and fitting pipeline, so grouplevel contrasts remain informative even if absolute parameter values are biased.

      (d) Robustness / identifiability checks. We agree that reporting robustness strengthens confidence, particularly given subtle effects and biological heterogeneity. The SANDI Matlab Toolbox we used extensively investigates model parameters robustness and identifiability using numerical simulations and synthetic signals accounting for the specific experimental protocol and noise distribution. An example of the results supporting the robustness / identifiability is reported in the Author response images. These results show that accuracy and precision of all SANDI model parameters, except D<sub>in</sub>, is very high (>~80%, Author response image 1)

      Author response image 1.

      Analysis of the accuracy and precision of SANDI model parameters estimation. We simulated 10<sup>4</sup> synthetic diffusion signals using the SANDI model with random combinations of five parameters: f<sub>neurite</sub>(f<sub>in</sub>), f<sub>soma</sub>(f<sub>is</sub>), D<sub>in</sub>, R<sub>soma</sub>(r<sub>s</sub>), and D<sub>e</sub>. Parameters were sampled uniformly from: f<sub>neurite</sub>, f<sub>soma</sub> ∈ [0,1]; D<sub>in</sub>, D<sub>e</sub> 𝛜[0.5,3.0] µm<sup>2</sup>/𝑚𝑠; 𝑅<sub>soma</sub> 𝛜[1,12] µm. Rician noise with experimentally estimated variance was added, and the SANDI model was then fit to the noisy signals. For each parameter, we report the relative percentage error between estimated and ground-truth values as a function of the parameter value (normalized to [0,1]), together with goodness-of-fit (R<sup>2</sup>).

      and sensitivity to changes as small as 5% in each of the model parameters is correctly captured (Author response image 2A), with small to negligible degeneracy (except, once again, for D<sub>in</sub>), even in presence of exchange (Author response image 2B).

      Author response image 2.

      Sensitivity to 5% parameter modulations. The matrices show how a controlled perturbation in one parameter propagates into the estimated values of all model parameters. Each row corresponds to a 5% increase in the parameter on the y-axis; the resulting percentage change observed in each estimated parameter is reported along the x-axis. An ideal estimator would yield a purely diagonal matrix, with 5% on the diagonal and 0% elsewhere (no cross-talk). In (A), we used the same synthetic SANDI signals as in Figure 1. In (B), we additionally generated 10<sup>4</sup> synthetic signals incorporating neurite–extra-cellular exchange using the NEXI model [https://doi.org/10.1016/j.neuroimage.2022.119277] and an exchange time representative of human cortex (𝜏<sub>ex</sub> ≈ 30 ms) [https://doi.org/10.1162/imag_a_00104].

      We have therefore revised the manuscript language to be more precise and appropriately cautious, describing f<sub>is</sub> and r<sub>s</sub> as apparent compartment indices and explicitly discussing potential confounds (e.g., parameter coupling, and unmodelled exchange), while clarifying the value of SANDI for detecting reproducible group-level microstructural differences in HD.

      (6) Clarify "not-classified" group in figures:

      It is not clear to me what the "not-classified" groups shown in Figures 3-4 represent, what criteria determined their inclusion, and whether their inclusion affects the comparability or interpretability of staging-based analyses

      We have added to the legends of Figures 3 and 4 that not-classified refers to HD participants who could not be HD-ISS classified due to missing clinical data or their CAG repeat falling within the 36-40 range. As correlation analyses were conducted across the whole HD sample though, these datapoints were included in the scatterplot.

      (7) Figure labeling:

      There appears to be a mismatch between figure numbering and captions around Figures 3-4. Please ensure alignment.

      Mismatch between figure numbering and captions has been corrected.

      Minor suggestions:

      (1) Figures 1-2:

      (a) Label axis values meaningfully, e.g., negative vs. positive instead of 0 vs 1.

      (b) Add units to MD axes (e.g., ×10⁻⁴ mm²/s).

      (c) Figure 6 colors: Consider improving the color distinction between "Age" and "fis" predictors, which are currently hard to differentiate.

      The suggested adjustments have been made to Figures 1, 2, 5 and 6 and Figure 2 legend.

      (c) Discuss why apparent soma size decreases in some ROIs (e.g., pallidum), if unexpected.

      We offer the following speculation about the reduced soma size in the pallidum (pages 20/21): Changes in apparent soma size may reflect alterations in neural and glial cell proportions and/or morphology, including astrocyte and microglia swelling in response to neurodegeneration and soma shrinkage preceding neuronal cell death. Thus, increased apparent soma size in the striatum may indicate HD-related reorganisation of cell types driven by MSN loss and reactive glial cell swelling, whereas smaller soma size in the pallidum may result from infiltration of smaller glia cells prior to secondary neuronal loss following striatal MSN degeneration.

      Reviewer #3:

      (1) An important question is whether the SANDI measures, which require an expensive scanner and elaborate processing, are better biomarkers than the more traditional DTI measures. Can the authors compare the effect size of FA/MD with SANDI measures? In some of the plots and tables, FA/MD seem to have comparable, if not higher, correlations with QMotor or CAP scores. On the same vein, it is unclear whether DTI measures were included in hierarchical stepwise regression. I wonder if the stepwise models may have picked up FA/MD instead of SANDI measures if they are given a chance. Overall, I hope the authors can discuss their findings also in this light of cost vs. benefit of adopting SANDI in future studies, which is an important topic for clinical trials.

      Effect sizes (ES) of group differences in all microstructural indices can be found in Table 4. ES of DTI and SANDI indices in the caudate and putamen were broadly comparable with a trend for MD showing larger ES (FA: r<sub>rb</sub> = 0.38 -0.55, MD: r<sub>rb</sub> = 0.51 -0.61, f<sub>is</sub>: r<sub>rb</sub> = 0.32 -0.45, r<sub>s</sub>: r<sub>rb</sub> = 0.45 0.53).

      This information is now reported in the result section on pages 15/16 and is being discussed in light of cost versus benefit considerations on pages 21 and 25.

      (2) Similar to the above point, it is very important to consider how strong the biomarking signal is from SANDI measures compared to the good old striatal volume. Some plots seem to indicate that volumes still have the highest correlation with QMotor and the highest effect size in group comparisons. It would be helpful for the community to know where the new SANDI measures stand compared to the most typically used volumes in terms of effect size.

      Effect sizes (ES) of group differences in volumes can be found in Table 2. ES in caudate and putamen volumes ranged between r<sub>rb</sub> = 0.49 -0.55 and were comparable to the ES of apparent soma size r<sub>rb</sub> = 0.45 -0.53 but slightly larger than ES of soma density r<sub>rb</sub> = 0.32 -0.45.

      This information is now reported in the result section on page 15/16 and is being discussed on pages 21 and 25.

      (3) The diffusion measures are inevitably correlated to some degree. Please provide a correlation matrix in the supplementary material, including all DWI measures, to enable readers to better understand how similar SANDI measures are to each other or vs. other DTI measures. Perhaps adding volumes to this correlation matrix may also be a good future reference.

      We have added cross-correlation matrices between all imaging measures (SANDI, DTI, Volumes) for the total sample as well as for HC and HD participants separately to the Supplementary material (Figure 3), providing an overview of the shared variance within SANDI parameters and between SANDI and DTI and volume metrics for each group.

      (4) ISS stages:

      (a) The online ISS calculator requires cut-offs derived from the longitudinal Freesurfer pipeline, while the authors do not have longitudinal data. Thus, the ISS classification might be inaccurate to some degree if the authors used the FS cross-sectional pipeline. Please review this issue and see if updated cut-offs should be used to classify participants.

      We acknowledge that our HD-ISS classifications may have been biased due to the use of crosssectional rather than longitudinal FreeSurfer v6 volumes (page 23).

      (b) Were there really no participants with ISS 0 among the 56 HD individuals? Please clarify in the manuscript.

      We classified four individuals as ISS 0 based on their caudate and/or putamen z-scored volumes falling below 2SD of the healthy control mean. These analyses are described on pages 14-15 and were based on the cross-sectional data of this study.

      (5) A note on terminology that might be confusing to some readers. According to the creators of ISS, the ISS stages are created for research only; they are not used or applied in the clinic. On the other hand, the terms "premanifest" and "manifest" have a clinical meaning, typically based on the diagnostic confidence level. The assignment of ISS0-1 to premanifest and ISS2-3 to manifest may create some non-trivial confusion, if not opposition, in some segments of the HD community. The authors can keep their current terminology, but will need to at least clarify to the reader that this assignment is speculative, does not fully match the clinically-based categories, and should not be confused with similarly named groups in the previous literature.

      To avoid confusion about terminology, we have removed the labels “premanifest” versus “manifest” throughout the manuscript. We refer to HD-ISS 0-1 and HD-ISS 2-3 when referring to the exploratory comparisons between HD-ISS stages.

      (6) The population in the study seems to be obtained from different other studies or research projects, and there are missing scores for several participants due to the retrospective nature of sample gathering for the analyses. Please state clearly that this study was done with retrospective data to properly justify why there are missing data. Also, and this is important, please clarify for the reader whether there was any temporal bias in the acquisition of data of a certain group (HD) vs. another (HC). It is important to rule out that there were no scanner changes or upgrades that may confound the reported group differences.

      We can confirm there were no Connectom scanner changes or upgrades that may have confounded the reported group differences. This was added to the image acquisition section on page 10. We have added to the participant section on page 9 that data were retrospectively pooled from separate studies and explain this was the reason why HD-ISS classification was only available for a subset of participants.

      (7) Several of the significant results with SANDI scores seem to be driven by a subgroup of HD individuals that are more clearly different than the healthy control distribution. Not sure if this may help, but one idea the authors can consider is to check if HD individuals that deviate more than 2 SDs from the healthy control distribution of SANDI scores have also worse QMotor, worse atrophy, or higher CAP scores than those HD individuals that are practically within the 2SD boundary distribution of HDs. This is another way of showing that the new measures have potential for application in individualized medicine (the MRI Z score of a patient as a proxy of the clinical deterioration). It is not a request to authors but just a suggestion for their consideration.

      The data points in the scatterplots of Figures 3, 4, and 7 have now been color-coded according to HD-ISS stage, showing a stage-related worsening of microstructural and volumetric imaging markers and Q-Motor performance.

      (8) The variance explained in hierarchical regression is obtained by fitting models within the sample, and can be subject to overfitting. In the absence of a more robust cross-validated R2, the authors may want to at least briefly inform the reader that the current approach can be subject to overfitting and does not represent a true out-of-sample R2.

      We have added this point to the study limitations in the Discussion section on page 23.

      (9) There are two Figure 3 labels, and all figures thereafter do not match the manuscript.

      The Figure numbering has been corrected.

      (10) In (the currently labelled) Figure 8, there seem to be fewer than 56 data points in the scatterplots. Is there a reason why not all 56 HD individuals do not have the CAP100 score available? CAP needs only CAG and age, which all HD gene carriers should have, to be included in the study.

      Inclusion criteria for individuals with HD for the HD-DRUM project were a positive genetic test for the presence of the mutant huntingtin allele (CAG length ≥ 36 repeats) and/or a clinical diagnosis of HD. Thus, for a small number of participants CAG was not available for the calculation of CAP100 score.

    1. eLife Assessment

      Using fMRI-based pRF mapping, this important study presents a novel method for estimating visual field (VF) loss and potential restoration by analyzing contrast-sensitivity patterns in early visual cortex. The evidence supporting the main claims is convincing. This work will be of broad interest to researchers in vision and clinical vision, neuroscience, and brain imaging.

    2. Reviewer #1 (Public review):

      Integrating large-field stimulation with a retinotopic atlas, this study introduces an fMRI-based method for measuring contrast sensitivity across the visual field. Retinotopy was assessed using pRF mapping and a calibrated Benson atlas. The authors validate their method by replicating known patterns of contrast sensitivity across eccentricities and visual field quadrants in healthy subjects, and demonstrate its potential clinical utility through case studies of both simulated and real visual field loss.

      Comments on revisions:

      I appreciate the addition of the quadrant-scotoma condition and the authors' clarification that the goal is to demonstrate individual-level detection sensitivity. The 95% CI argument is reasonable, and I am satisfied with framing the simulated-scotoma work as proof-of-concept.

    3. Reviewer #2 (Public review):

      Summary

      This study uses functional MRI to evaluate visual contrast sensitivity across the visual field at the level of the visual cortex, testing the method as a proof of principle in a small group of normally sighted individuals, modelling both normal vision and simulated vision loss, as well as a patient with independently verified vision loss. The results suggest a promising technique to measure vision objectively across the visual field and overcomes the requirement for careful fixation which is often challenging in those with low vision or sight loss.

      Strengths

      • Objective measure of central vision: The proposed method may provide a more comprehensive and objective assessment of residual visual function in individuals with sight loss. This may be particularly useful for those with central visual field loss without the requirement of stable fixation or subjective motor responses.

      • More sensitive measure: The use of slope to calculate contrast sensitivity across a range of contrasts within the brain is clever and likely more sensitive than single threshold measurements or standard clinical measures of visual acuity using letter charts. Standard supra-threshold (high contrast) tests are not ideal for capturing residual vision or partial vision loss.

      • Good agreement with standard atlas: The Benson atlas provides a good estimate of visual field maps within V1 based on anatomical landmarks, and the authors take steps to refine this informed by cortical magnification and V1 surface area (brain size) for each individual participant. This could allow the technique to be generalised without the need to collect lengthy individual mapping data from every participant.

      • Within-subject reproducibility: The measurements appear to be sensitive and reproducible, particularly in those with normal vision, and are consistent with known features of visual sensitivity differences in different parts of the visual field.

      • Potential tool to measure visual field sensitivity in controls: Even if the proposed methods are not ideal for widespread clinical translation, they do offer an exciting tool to test hypotheses about visual field differences in healthy controls. For example, there seems to be an increase in sensitivity on either side of the simulated ring scotoma (Fig 6 - perhaps due to the release of lateral inhibition?). Reliability measures suggest that individual differences are consistent in healthy controls (although not tested statistically, perhaps due to the small sample size?). Whether they reflect behaviourally meaningful differences in visual field sensitivity could be tested in individuals by comparing them to behavioural measures across the visual field.

      • Potential tool to test novel treatments: The proposed techniques could be used to test within-subject changes in visual function in environments that are equipped to measure and analyse fMRI data, including clinical trials aimed at determining the success of novel treatments. Preliminary testing in healthy controls with eye movements also suggests that the method is suitable for testing low vision patients with unstable fixation (e.g., nystagmus), and the authors have modelled the effects of varying amounts and types of eye movements on functional outcome measures.

      Weaknesses

      • Questionable sensitivity to differences in patients. The variability in heat maps across healthy control participants is somewhat surprising, and it is uncertain whether they represent actual visual sensitivity differences or an artifact of the measurement technique, e.g., due to signal-to-noise differences introduced by local variations in brain anatomy. Thus, it is uncertain whether the substantial variance across controls will allow for a sufficiently stable baseline to detect meaningful differences in individual patients. Also, as the authors rightly point out, Benson atlas does not model differences along meridians, so that upper/lower field differences might not be detectable. However, the authors acknowledge that this is a pilot study, and further testing a wider range of scotoma types in patients and simulated in controls will only improve the methods. Furthermore, the ability to capture visual field representations in human visual cortex is also likely to improve with computational advances, making the use of atlases more feasible, obviating the need for individualised population receptive field mapping.

      • Potential for clinical translation. Although it is a sensitive measure, functional MRI is costly, is not available in all clinical settings, requires significant post-processing analyses, and may be contraindicated in some individuals due to safety (e.g., metallic implants) or other concerns (e.g., claustrophobia). These could present significant barriers to widespread clinical translation, if this were the ultimate goal of the study.

      • Limited range of spatial frequencies. The spatial frequencies tested were still quite low (0.3 and 3cpd) compared to measures such a visual acuity. Extending the measurements to higher spatial frequencies could allow better characterization of central vision, although necessarily for peripheral vision. However, this may depend on the typical visual abilities of the patient population of interest.

      Appraisal and Impact:

      The authors used appropriate and robust methods to assess and model known features of visual sensitivity differences across the visual field in sighted controls. In addition, the assessment technique successfully captured sensitivity changes due to simulated and actual partial field loss but was also fairly resilient to eye movements and fixation instability, typical of patients with sight loss. Although currently providing a proof of principle, the method is likely to improve with further testing and increasing normative sample sizes, and as computational methods continue to advance visual field map predictions. Although it may not be adopted widely as a standard clinical assessment technique due to the expense and other obstacles, it would provide a valuable tool in assessing clinical populations, for example in the context of clinical trials to assess suitability for treatment interventions or monitor treatment outcomes.

    4. Reviewer #3 (Public review):

      Summary:

      Chow-Wing-Bom et al. introduce an innovative wide-field visual stimulation setup for 3T experiments that enables stimulation up to a diameter of 40{degree sign} visual angle while allowing continuous gaze tracking. Using this setup, the authors systematically investigate contrast sensitivity across the visual field by presenting subjects with sinusoidal gratings varying in contrast and spatial frequency. Their findings confirm the expected organization of contrast sensitivity, demonstrating a preference for high spatial frequencies in the central field and lower frequencies in the periphery. They also extend these measurements to eccentricities up to 20{degree sign}, which exceeds previous fMRI-based reports. Moreover, the study explores the potential of using contrast sensitivity calculations as a method for detecting visual field defects, demonstrated in a healthy subject with simulated ring-shaped and upper-right-quadrant scotomas, and in a patient with LHON. The revised version additionally characterises the robustness of the approach to varying degrees of fixation instability.

      Strengths:

      - The manuscript is well written and provides comprehensive methodological details, ensuring high transparency and reproducibility.

      - The visual stimulation setup represents a significant technical advance by enabling wide-field stimulation with continuous eye tracking, which is crucial for both research and potential clinical applications.

      - The study confirms established findings regarding the organization of contrast sensitivity while extending them to a larger eccentricity range.

      - The efforts to establish a measure for visual field losses aligns with current efforts to develop objective alternatives to conventional perimetry.

      - The revised manuscript includes an empirical assessment of how varying levels of eye movement affect cortical contrast sensitivity estimates, providing useful guidance on the tolerance of the approach to fixation instability.

      Weaknesses:

      - The original version left certain methodological aspects unclear, particularly the correction of eccentricity values from the Benson atlas and the V1 masks used in each analysis branch. The authors have added a dedicated figure illustrating the eccentricity correction procedure and now explicitly state that a manually delineated V1 mask was used for the pRF-based analyses while the Benson V1 label was used for the atlas-based analyses, together with a discussion of how this difference may influence the comparison.

      - Minor inconsistencies in reporting, such as the introduction of a second session in the Results section, have been corrected.

      The conclusion that high-contrast patterns as in pRF mapping are not optimal to test for subtle but potentially clinically relevant changes in the visual field coverage are very valid. The suggested use of contrast sensitivity can therefore be a potentially well-suited parameter for estimating visual field losses. The presented work is an interesting starting point, and the proposed method of using contrast sensitivity as measure for partial vision loss should be further explored.

      Comments on revisions:

      The authors have thoroughly addressed all points raised in my original review, and I have no further concerns.

    5. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      The current claims should be better supported by more evidence.

      R1-1: In the first experiment, have the statistics undergone multiple comparison corrections (e.g., Line 441-442)? Given the small sample size, incorporating additional statistical tests (such as the Bayes Factor) could strengthen the analysis.

      We confirm that corrections for multiple comparisons are now applied where appropriate, particularly in the group-level ANOVA analyses.

      “Post-hoc tests using Holm-Bonferroni correction show that V1 neuronal populations receiving inputs from the central visual field (0.5-4.5°) showed greater contrast sensitivity to high spatial frequency as compared to low spatial frequency stimuli (steeper slope for the 3cpd versus 0.3cpd condition: 0.5-2.5º: t(6) = 4.35, p<sub>bonf</sub> = 0.0149; 2.5-4.5º: t(6) = 3.471, p<sub>bonf</sub> = 0.0266). Conversely, peripheral eccentricities in V1 (above 9.5°) showed higher contrast sensitivity to low as compared to high spatial frequency stimuli (steeper slope for 0.3cpd versus 3cpd condition: 9.5-15º: 𝑡(6) = −4.591, p<sub>bonf</sub> = 0.0149; 15-20º: t(6) = −6.615, p<sub>bonf</sub> = 0.0029). Between 4.5° and 9.5°, V1 contrast sensitivity was similar for both spatial frequencies (t(6) = −0.226, p<sub>bonf</sub> = 0.8286). Crucially, these effects remained when using retinotopic estimates based on structural scans derived from the Benson retinotopic atlas instead of the pRF-mapping measures (0.5-2.5º: 𝑡(6) = 5.768, p<sub>bonf</sub> = 0.0059 ; 2.5-4.5º: t(6) = 2.531, p<sub>bonf</sub> = 0.0892 ; 4.5-9.5º: 𝑡(6) = −0.293, p<sub>bonf</sub> = 0.7792; 9.5-15º: t(6) = −3.274, p<sub>bonf</sub> = 0.0509; 15-20º: t(6) = −3.528, p<sub>bonf</sub> = 0.0496; see Figure A2 and Table A3 in Appendix section).”

      “Post-hoc pairwise comparisons using Holm-Bonferroni corrections revealed that, as predicted, the cortical contrast response function had a higher slope – indicating better V1 sensitivity – along the horizontal versus vertical quadrants (Horizontal-Vertical Anisotropy – HVA: 𝑡(6) = 5.908, p<sub>bonf</sub> = 0.0031) and along the lower versus upper quadrant (Vertical Meridian Anisotropy – VMA: 𝑡(6) = 4.106, p<sub>bonf</sub> = 0.0126). Conversely, no difference in cortical contrast sensitivity was found between V1 neuronal populations encoding the left and right quadrants of the visual field (Left-Right Horizontal Meridian Anisotropy – LRHMA: t(6) = 0.7197, p<sub>bonf</sub> = 0.4988).”

      “We found that the horizontal-vertical anisotropy effect was recovered (HVA: t(6) = 3.584, p<sub>bonf</sub> = 0.0347), but that the vertical meridian anisotropy effect was not (VMA: t(6) = 0.744, p<sub>bonf</sub> = 0.9697) with this approach.”

      R1-2a: The authors claim that "structure-based atlases can replace the need for pRF mapping in cases where it might otherwise be difficult or impossible to collect pRF data." This claim needs further scrutiny. Currently, only one simulated condition of visual field loss was examined in one subject.

      AR-R1-2a: We agree that further work is needed to fully establish the utility of structure-based atlases. As a first step, we have followed the reviewer’s suggestion and collected an additional dataset from one of the seven participants, in whom we simulated another condition of visual field loss – specifically, loss of the upper right quadrant. This participant is the same individual already presented in the manuscript (C5), but with a different simulated vision loss condition.

      This new condition has been introduced in the Methods, Results and Discussion section, and a new Figure 10 alongside Figure 9 which showed the 3º-8º scotoma. With relevant changes as follows:

      “We also demonstrate the clinical relevance of this approach by recovering simulated scotomas (i.e., a ring of visual field loss around fixation and the loss of an entire visual field quadrant), as well as visual field loss in a patient with a neurodegenerative disorder causing large areas of visual field loss.”

      “Additionally, one participant (C5) repeated the task under two simulated vision loss conditions (ring or quadrant loss), and two others (C5, C6) completed it with different levels of eye movement.”

      “Simulated vision loss

      One healthy control participant (C5) also performed a version of the task designed to simulate two forms of visual input loss (i.e., artificial scotoma). These simulations were implemented by: (a) masking a region of the visual field with a grey, annular ring, covering 3º-8º eccentricity, and (b) masking the upper right visual quadrant using a grey quarter-sector overlay. The stimuli and contrast levels used in this task were identical to those described in the original task.”

      “A test-case of simulated loss of visual inputs

      In the previous sections, we showed that the slope of a square root function provides a reliable measure of contrast sensitivity in the brain of healthy controls. But can this brain-level model also quantify loss of visual inputs? To test this, we first simulated an artificial scotoma in one normal sighted participant, by (a) masking a region of the visual field with a grey, annular ring, covering 3°-8° eccentricity (Figure 9A), and (b) masking the upper-right visual quadrant using a grey quarter-sector overlay (Figure 10A). We expect smaller slope values in V1 neuronal populations that would under normal circumstances encode that part of the visual space.

      As expected, we observed reduced responses in V1 locations corresponding to the artificial scotoma (Figures 9 and 10), with increased responses along the edges of the mask for the ring scotoma condition (Figure 9B). This artificial loss of visual input was also clearly present in the cortical contrast sensitivity estimate, with significantly reduced slope steepness in V1 between 3-8° for the ring scotoma condition (Figure 9C&D) and in the upper-right quadrant for the quarter-sector scotoma condition (Figure 10B&C). Additionally, we could recover this scotoma using the calibrated Benson template, although less accurately (Figures 9E and 10D). These results show that this measure of V1 contrast sensitivity is sensitive enough to detect loss of visual inputs in the brain at an individual level, when a complete local loss of sight is simulated, and that this approach does not crucially rely on pRF mapping data from the individual. This supports the utility of our approach in recovering patterns of vision loss and recovery at a cortical level.”

      “Mapping Simulated and Pathology-Driven Vision Loss

      Our method successfully identified both simulated retinal loss in a healthy volunteer and real visual field loss in a patient with Leber Hereditary Optic Neuropathy (LHON). The signal drop observed in response to masking portions of the visual field in the healthy control was both large and significant at the individual level, as demonstrated by non-overlapping 95% confidence intervals (Figures 9B-C and 10B). This provides proof-of-concept evidence that our approach can detect signal changes in individual patients, which is a critical requirement for clinical translation.

      Unlike previous fMRI studies that used high-contrast stimuli (Farahbakhsh et al., 2022; Pawloff et al., 2023; Ritter et al., 2019), which may not accurately represent partial vision loss due to potential saturation effects and the stimulation of less sensitive retinal cells, our use of multiple contrast levels offers a more nuanced assessment of cortical contrast sensitivity.

      Combined with the large-field set-up allowing stimulation up to 20° eccentricity, this approach may be particularly well-suited for evaluating treatment efficacy in cases of widespread and variable vision loss.

      Future work will focus on further validating reconstruction accuracy under controlled conditions, including simulated scotomas of varying severity and location, expanding testing to larger patient cohorts, and establishing a normative dataset to contextualize patient data.

      R1-2b: Also, in Figure 7, contrast sensitivity in the periphery differs between pRF mapping and the Benson atlas. How do the authors explain this discrepancy?

      AR-R1-2b: The discrepancy in periphery between pRF mapping and Benson atlas is caused by various factors. These include (a) individual differences in the retinotopy/structure relationship that are not captured in the template, (b) the fact that the Benson atlas at larger eccentricities was obtained with hemifield stimulation, and (c) a larger impact of any inaccuracies at larger eccentricities because of cortical magnification. As a result, peripheral vertices are more likely to be mis-assigned by the template than central ones. Note that this adds distortion in cortical visual field maps which will be consistent across timepoints (rather than noise). Critically, a reduction in accuracy does not preclude utility if meaningful differences in spatial patterns in cortical sensitivity can still be recovered, as is the case in our data. We cover this in the discussion.

      “Particularly at large eccentricities however, we initially observed inaccuracies between the template and individual retinotopy eccentricity estimates which led to substantial distortions in cortical visual field maps due to cortical magnification (see Figure A4 in Appendix section). To address this, we adjusted the Benson eccentricity estimates to align with the cortical magnification scaling function (Horton & Hoyt, 1991).”

      “Beyond ROI considerations, we still observed differences in cortical sensitivity between pRF mapping and the adjusted Benson atlas - particularly in the periphery. Several factors likely contribute to this. First, individual differences in the relationship between cortical structure and retinotopy are not fully captured by the template. Second, the Benson atlas has never been fit with empirical data more eccentric than approximately 20°, which naturally limits its precision in the far periphery. Third, because of cortical magnification, any small inaccuracy at larger eccentricities has a disproportionately large effect, making peripheral vertices more susceptible to mis-assignment than central ones. These influences introduce systematic distortions in cortical visual field maps rather than random noise and thus remain consistent across time points - an important point when assessing longitudinal changes (e.g., ageing or gene-therapy interventions). Importantly, the spatial gradients in cortical contrast sensitivity were preserved across both the pRF and Benson atlas approaches, indicating that minor ROI differences do not affect our conclusions. Together, these findings show that the Benson Atlas remains a useful alternative when pRF mapping is not feasible.

      R1-3: Overall, the writing could be significantly improved.

      AR-R1-3: We have made edits throughout the manuscript and hope this has improved the writing.

      Reviewer #1 (Recommendations for the authors):

      R1-Recommendation 1a: The writing can be significantly improved for clarity.

      The introduction section is not well-organized, and the motivation for developing the current method (Paragraphs 2-3) is vague and lacks adequate documentation.

      Several references are missing (e.g., Lines 90-92) or incorrectly placed (e.g., Lines 108-109).

      AR-R1-Recommendation 1a: We have revised the Introduction to clarify the motivation for developing the current method and to correct missing or misplaced references.

      “Still, testing visual function across the visual field remains limited in clinical and therapeutic contexts, especially in patients with drastic central vision loss. In this study, we aimed to address this gap by introducing a novel fMRI-based approach to measure visual field sensitivity across a wide expanse of the visual field (40º diameter).”

      “Beyond visual acuity, functional impairment across the wider visual field can be measured using a range of visual field tests, from the finger counting visual confrontation field test to more complicated and/or computerized tests (e.g., standard automatic perimetry, kinetic perimetry, microperimetry; Rai et al., 2024). Computerized tests typically involve measuring sensitivity to the luminance contrast of a target relative to a background at different visual field locations while the participant’s gaze is fixed on a central point. In some cases (e.g., microperimetry), sensitivity measurements are paired with fundus imaging, offering greater precision in linking visual field functions to specific retinal locations (Rai et al., 2024). As a result, visual field assessments can reveal functionally relevant deficits – including localized sensitivity loss and scotomas – that are not captured by foveal acuity alone, and are therefore potentially valuable for tracking disease progression and therapeutic efficacy.

      Despite their clinical relevance, visual field testing comes with challenges and limitations, and as a result, the inclusion of visual field measures in sight-rescuing therapy trials is limited. Firstly, it requires prolonged fixation and sustained visual attention. This can be very challenging for patients with severe vision loss, who often struggle to fixate, and strain to detect even high intensity stimuli. This can lead to long and unpleasant testing sessions with unreliable results. Secondly, as perception of light stimuli is inherently subjective (Rai et al., 2024) and effortful, patients may vary in their criteria for visual recognition, and in their ability to report visual signals that are weakened or distorted by disease. Together, these constraints reduce the feasibility, robustness, and interpretability of conventional visual field testing in clinical trials, underscoring the need for alternative or complementary approaches that can assess functional vision while placing fewer demands on subjective reporting.”

      “Functional MRI (fMRI) has recently been proposed as a promising alternative to measure visual field loss, as it requires no overt task, and instead measures visual sensitivity directly from brain responses (Farahbakhsh et al., 2022; Prabhakaran et al., 2021; Ritter et al., 2019). Population receptive field (pRF) mapping fMRI can measure which parts of the cortex respond to which parts of the visual scene (Dumoulin & Wandell, 2008).”

      “Finally, most studies use a single maximum contrast stimulus to assess visual function (Broderick et al., 2022; Farahbakhsh et al., 2022; Liu et al., 2006; O’Connell et al., 2016; Ritter et al., 2019).”

      R1-Recommendation 1b: The strengths of the current method and its applicable scenarios are unclear. For example, in Lines 39-40: "We developed an fMRIbased approach to measure contrast sensitivity across the visual field without the need for precise fixation." To what extent can fixation be imprecise? Could this protocol be applied to patients with strabismus, who have biased fixation?

      AR-R1-Recommendation 1b: We agree with the reviewer that the tolerance to fixation challenges is key here and so we collected additional data to respond to your points regarding the effects of eye movement on the cortical contrast sensitivity maps.

      In terms of biased fixation, the approach should be very robust to this, as this would just reduce the cortical visual field covered on one side and extend it on the other.

      We collected new data to test the tolerance to fixation instability across a wide range of eye movement, including severe nystagmus-level movement. Despite large eye movements, the cortical contrast-sensitivity pattern remained largely consistent, though extreme movements reduced slope estimates and flattened the cortical sensitivity pattern for 3cpd, indicating reduced measurement sensitivity for extreme eye movement to high spatial frequency gratings.

      These additions have been incorporated into the Abstract, Methods, Results, and Discussion sections as follows:

      Abstract

      “To assess the method’s tolerance to fixation variability, we further investigated how different levels of eye movement affect cortical sensitivity patterns in two participants. We found that cortical sensitivity patterns were largely preserved across eye movement, particularly at low spatial frequencies. This suggests that our approach can accommodate several degrees of fixation instability, making it suitable for populations with unstable or biased fixation for whom visual field maps are harder to acquire behaviorally (e.g., patients with dense central scotoma or strabismus).”

      Methods

      “Additionally, one participant (C5) repeated the task under two simulated vision loss conditions (ring or quadrant loss), and two others (C5, C6) completed it with different levels of eye movement.”

      Results

      “Effect of eye movement

      Participants C5 and C6 also performed a version of the task designed to test the effect of eye movements. In this version, saccades were elicited by randomly and rapidly shifting the fixation dot away from central fixation (C5: 2º and 5º from fixation and random motion; C6: up to 2º from fixation). Participant C5 was tested using 0.3 and 3cpd gratings at four contrast levels (7.5, 42.2, 60, 100%), while participant C6 was tested only under the low spatial frequency condition (0.3cpd).

      Fixation stability was assessed for each fMRI run using the bivariate contour ellipse area (BCEA), which estimates the area (in degrees<sup>2</sup> or arcmin<sup>2</sup>) of an ellipse that contains approximately 95% of fixation points. BCEA was calculated using the formula: , as described by Morales et al. (2016). In this expression, σ<sub>h</sub> and σ<sub>v</sub> represent the standard deviations of eye position in the horizontal and vertical directions, respectively, while p corresponds to the Pearson correlation coefficient between horizontal and vertical eye positions. The constant k determines the size of the ellipse based on the desired probability area, defined by the relationship P =1 – e<sup>-k</sup>, with P set to 0.95 in this study. A smaller BCEA indicates greater fixation stability.

      “Effect of eye movements on V1 cortical sensitivity

      So far, we have demonstrated that our measure of cortical sensitivity can reliably recover known gradients in sensitivity across eccentricities and visual quadrants. We also showed that this measure was consistent across visits and sessions, suggesting its potential utility for monitoring changes over time. However, all prior tasks were conducted under conditions of central fixation, with participants instructed to maintain gaze on a central dot. A key motivation for this approach was its theoretical robustness to fixation instability. We therefore also aimed to investigate how varying degrees of eye movement might influence cortical sensitivity across the visual field.

      To address this, two participants (C5 and C6) completed a modified version of the contrast sensitivity task in which they made eye movements either by following a dot moving randomly at a radius of 2º or 5º around fixation, or by self-initiated very large eye movements. Eye movements across these or by self-initiated very large eye movements. Eye movements across these conditions (Figure 7, bottom row; Figure 8, bottom row), were quantified using BCEA (C5 – Central fixation: mean±SD = 0.57±0.11 deg<sup>2</sup>, 2º eye motion: 2.69±0.48 deg<sup>2</sup>, 5º eye motion: 20.3±1.32 deg<sup>2</sup>, random eye motion: 133.7±23.36 deg<sup>2</sup>; C6 – Central fixation: 0.96±0.56 deg<sup>2</sup>, 2º eye motion: 1.28±0.15 deg<sup>2</sup>). For reference, in severe (idiopathic) nystagmus, the eye movement variability along the vertical and horizontal planes is on average 1.08 deg and 1.60 deg, respectively (Tailor et al., 2021). Assuming a moderate correlation between axes (p = 0.3), the average fixation stability would equate to a BCEA of ~21.46 deg<sup>2</sup> (i.e., ~5º eye motion condition in our data).

      Despite these very large levels of eye movements, we observed that the overall cortical contrast sensitivity spatial pattern across eccentricity remained remarkably consistent (Figure 7, top and middle rows; Figure 8, top row). However, at the most extreme movements, contrast sensitivity estimates (slope values) were lower; and while the overall cortical visual field map structure was still clearly present for low spatial frequencies, it appeared more flattened for 3cpd, suggesting reduced sensitivity of our measure for large eye movement and high spatial frequency stimuli.”

      Discussion

      “Crucially, one advantage of cortical visual field mapping is that the maps are inherently centered on the foveal confluence, providing a stable reference point for comparing responses across eccentricities. When combined with large-field, spatially homogeneous stimuli, this anchoring means that our approach should remain robust to moderate fixation variability and still quantify sensitivity changes across the visual field – provided that fixation instability does not exceed the stimulus extent (40º diameter).

      When measuring the impact of eye movements, we found that spatial sensitivity patterns were largely preserved, even for extreme eye movements (emulating severe nystagmus). However, under the most extreme conditions, sensitivity estimates (i.e., slope values) were reduced, especially for high spatial frequency (SF) stimuli. This likely reflects image blurring from large rapid eye movements, which degrades high-SF inputs and shifts activation toward neurons tuned to lower SFs. This aligns with evidence that nystagmus and large saccades impair perception of fine detail and grating stimuli due to retinal image slip (Abadi & Bjerre, 2002; Dickinson & Abadi, 1985; Hertle et al., 2017; Randall et al., 2020). While classic findings report suppression of low-SF signals during saccades (Burr et al., 1994; Ross et al., 2001), our results suggest that high SF sensitivity may be more vulnerable to large eye movements when participants are presented with 2Hz phase-flickering gratings. Further validation in clinical groups with naturally-occurring fixation instability would further strengthen these conclusions.”

      R1-Recommendation 1c: There are also some confusing descriptions, such as Lines 130-132.

      AR-R1-Recommendation 1c: We have also clarified ambiguous descriptions of the Benson atlas templates.

      “We therefore also evaluated the approach using the structure-based atlas of retinotopic values developed by Benson et al. (Benson et al., 2014; Benson & Winawer, 2018). This atlas predicts retinotopic organization by aligning individual cortical anatomy (e.g., surface curvature) to a group-average template that incorporates an algebraic model of retinotopy (Benson et al., 2014). Once the subject’s brain is aligned to this structural atlas, retinotopic maps defined by the model – i.e., polar angle and eccentricity maps – are projected onto the individual’s cortex. This allows estimation of visual field maps without requiring functional imaging, and provides a non-invasive, anatomy-driven approximation of visual field representations.”

      R1-Recommendation 1d: Line 361: "Assessing the brain's ability to discriminate shapes"-is the author referring to the functional relevance of contrast tuning assessment here? Since the task or stimuli are not related to shapes, this description is unclear.

      AR-R1-Recommendation 1d: We have revised the reference to “discriminating shapes” to more accurately reflect the functional relevance of contrast sensitivity mapping.

      “To measure visual field function, we developed a new measure of cortical contrast sensitivity, assessing the brain’s ability to discriminate gratings of varying spatial frequencies based on luminance variations.”

      R1-Recommendation 2a: Simulated visual loss experiment: only one condition of visual field loss was examined in a single subject. I encourage the authors to include additional subjects to meet statistical test criteria at group level. Simulated scotomas in more visual quadrants, including both central and peripheral areas, should be examined, as asymmetries may exist.

      AR-R1-Recommendation 2a: We agree that it is important to verify that the approach can also capture other types of scotomas. We have therefore now incorporated another simulated condition of visual field loss, namely loss of the upper right quadrant.

      Regarding adding more participants: The drop in signal is clearly large and significant at the individual level (error bars corresponding to 95% confidence interval do not overlap; Figures 9B-C & 10B). The ability to detect signal change at the individual level is what we need for clinical application, and here we are showing proof-of-concept of its feasibility with our approach. However, we do appreciate that it might be valuable to test cortical visual field loss reconstruction accuracy with simulated scotomas of varying levels of vision loss in variable locations. We now highlight this as a future direction.

      Please refer to our response to R1-2a, where we also detail the corresponding changes made in the manuscript.

      R1-Recommendation 2b: Additionally, why do the results from pRF mapping and the corrected Benson atlas differ, particularly in the far periphery?

      AR-R1-Recommendation 2b: Please refer to our response to R1-2b, where we also detail the corresponding changes made in the manuscript.

      R1-Recommendation 3: To validate the recovery of visual field loss in the case study, it would be necessary to include fundus imaging to characterize the structural loss and correlate it with the behavioral and fMRI results.

      AR-R1-Recommendation 3: We included Compass perimetry data for the LHON patient, which is fundus-tracked perimetry and uses fundus imaging to keep the visual stimulation fixed to retinal locations.

      In the context of LHON, the fundus image is not expected to provide more information than perimetry. This is because the visual deficit in LHON arises from optic nerve dysfunction, and retinal abnormalities are typically minimal. Aside from the characteristic pallor of the optic disc, the fundus appearance is usually normal in appearance.

      For illustration, Author response image 1 shows the Compass-acquired fundus image from the LHON patient included in this study. For comparison, we also show a normal fundus image from a 25-year-old male volunteer, reproduced from Häggström, Mikael (2014). "Medical gallery of Mikael Häggström 2014". WikiJournal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 2002-4436. Public Domain.

      Author response image 1.

      We do, however, recognize the importance of linking functional changes to structural alterations (e.g., retinal thickness measured with OCT), and we now highlight this as a key future direction in the discussion. This will be a central focus of a planned follow-up study involving a larger patient cohort.

      “Next steps in this work will therefore involve testing larger patient cohorts with diverse forms of vision loss, validating the approach for tracking pathology over time, and investigating how cortex-based visual field measures relate to and complement other visual field and retinal integrity indices including Compass measures and OCT-derived retinal layer thickness.”

      “Additionally, linking brain-based variations in function across the visual field to behavioral performance (e.g., perimetry, microperimetry) and retinal structure (fundus imaging, retinal thickness from Optical Coherence Tomography), could help bridge the gap between neural measures and functional outcomes. Such integration would provide deeper insights into developmental, learning, and vision loss mechanisms.”

      R1-Recommendation 4a: Why is a 0.5 mm smoothing applied to the contrast task data?

      AR-R1-Recommendation 4a: We have now clarified in the Methods section. This 0.5 mm FWHM smoothing kernel was applied to the contrast sensitivity task data to meet the minimum requirements of the GLM module in SPM.

      “To accurately capture neural activity across various eccentricities and polar angle locations, minimal smoothing (0.5mm FWHM Gaussian blur) was applied to the contrast sensitivity task data using FSL’s 3dmerge program. This was done to meet the minimum requirements of the GLM module in SPM.”

      R1-Recommendation 4b: Is this the first time the cortical magnification calibration has been applied to the Benson atlas? I recommend including a figure to describe this method.

      AR-R1-Recommendationn 4b: This is indeed the first time this correction has been applied to the Benson atlas. We have now added a figure (Figure 3) to illustrate the eccentricity adjustment procedure applied to the Benson atlas.

      R1-Recommendation 5: In Figure 5, the test-retest reliability can be reported by including r-values.

      AR-R1-Recommendation 5: We have now included Spearman correlation 𝜌-coefficients for test-retest and between-condition comparisons in Figure 6 (previously Figure 5).

      R1-Recommendation 6: Inconsistency in the reporting format of statistical values: e.g., the degrees of freedom are presented with, or without parentheses.

      AR-R1-Recommendation 6: Thank you for pointing this out. We have reviewed and standardized the reporting format of all statistical values throughout the manuscript to ensure consistency. Degrees of freedom are now all presented with parentheses, in details:

      “Using ANOVA, we found the expected interaction between spatial frequency and eccentricity (F(1.96,11.79) = 28.66, p < 0.001; Figure 4) as well as a main effect of eccentricity (F(2.33,13.99) = 12.67, p < 0.001).”

      “We found a main effect of visual field quadrant location on V1 sensitivity (F(2.46,14.76) = 20.71, p < 0.001).”

      “Moreover, there was no interaction between spatial frequency and (F(2.16,12.99) = 1.34, p = 0.298), visual field quadrant positions suggesting V1 visual field anisotropies are relatively constant across spatial frequencies.”

      Reviewer #2 (Public reviews):

      R2-1a: Questionable sensitivity to differences in patients. The variability in heat maps across healthy control participants is somewhat surprising. Do differences between individuals represent actual visual sensitivity differences, or are they an artifact of the measurement technique, e.g., due to signal-to-noise differences introduced by local variations in brain anatomy? Will the substantial variance across controls allow for a sufficiently stable baseline to detect meaningful differences in individual patients?

      AR-R2-1a: We agree the variability across healthy controls is surprising. It is unclear whether this reflects true individual differences in visual sensitivity or arises from factors like local signal-to-noise introduced by local variations in brain anatomy. It will be really interesting to investigate this further by examining structural variations across the visual field and comparing them with behavioral measures.

      As for establishing a stable baseline for patient comparisons, this is inherently an empirical question and depends on the degree of vision loss. LHON patients typically show dense central scotomas (up to 15º) in the chronic phase, making them well suited for detecting sensitivity differences – e.g., between central versus peripheral locations. Detecting subtler changes – in the acute phase or other conditions – may be more challenging. We agree with the reviewer that a normative range will be essential for contextualizing patient data, which we now mention in the Discussion, and we aim to develop in the future based on the present data.

      “Future work will focus on further validating reconstruction accuracy under controlled conditions, including simulated scotomas of varying severity and location, expanding testing to larger patient cohorts, and establishing a normative dataset to contextualize patient data.”

      R2-1b: Also, as the authors rightly point out, Benson atlas does not model differences along meridians, so upper/lower field differences might not be detectable.

      AR-R2-1b: We acknowledge the limitations of the Benson atlas, particularly its inability to model meridional asymmetries (e.g., upper vs. lower visual field). Still, our goal is to provide a method for tracking visual cortex changes over time. By consistently projecting longitudinal functional data onto the same structural image fitted with the Benson atlas, we maintain a stable anatomical reference, which supports reliable comparisons across timepoints – even with limited spatial accuracy. Future improvements could include shearing corrections, Bayesian updating, or alternative models such as DeepRetinotopy developed by Ribeiro et al.

      “Further enhancing the alignment between retinotopic template atlases and individual retinotopic tuning could improve this approach further, for example, by integrating them with functional measures using Bayesian methods (Benson & Winawer, 2018). In parallel, geometric deep learning frameworks such as DeepRetinotopy (Ribeiro et al., 2021) could also offer anatomy-driven predictions from structural MRI, and combining these strategies may yield more accurate and generalizable retinotopic reconstructions.”

      R2-2: Effects of unstable fixation/eye movements not explicitly tested: The methods state, 'In all tasks, participants were asked to report when the color of a central fixation dot changed', suggesting participants maintained fairly good fixation. Most of the results seem to pertain to measurements where central fixation is required. How does unstable fixation affect measurements?

      AR-R2-2: This is an important point. We have now extensively and systematically investigated the impact of eye movements on the cortical contrast sensitivity maps and updated the Abstract, Methods, Results, and Discussion sections (see R1-1b).

      R2-3: Potential for clinical translation. Although it is a sensitive measure, functional MRI is costly, is not available in all clinical settings, requires significant post-processing analyses, and may be contraindicated in some individuals due to safety (e.g., metallic implants) or other concerns (e.g., claustrophobia). These could present significant barriers to widespread clinical translation if this were the ultimate goal of the study.

      AR-R2-3: We agree that fMRI, while sensitive, has practical limitations for broad clinical adoption due to cost, accessibility, and contraindications. However, it remains a valuable tool in targeted contexts, where sensitive detection of visual field loss has large utility – for example for evaluating treatment effects in clinical trials. This application has been demonstrated in recent studies (Farahbakhsh et al., 2022; Maimon-Mor et al., 2025; Haal et al., 2016; Ritter et al., 2019).

      R2-4: Limited range of spatial frequencies. The spatial frequencies tested were still quite low (0.3 and 3cpd) compared to measures such as visual acuity. Extending the measurements to higher spatial frequencies could allow better characterization of central vision, although necessarily for peripheral vision.

      AR-R2-4: We agree that extending to higher spatial frequencies could improve central vision characterization and note this can be readily incorporated into future studies using the current framework. However, LHON patient’s acuity tends to be very low, and we found that 5cpd did not allow us to measure any cortical contrast sensitivity in a prior pilot. So, to characterize the visual field in LHON with fMRI, we therefore aimed to balance central and peripheral coverage: 0.3 cpd ensured broad detectability, while 3 cpd offered a middle ground to assess central vision without exceeding acuity of this population. Additional approaches, such as neural contrast sensitivity functions (e.g., Roelofzen et al., 2025) may also offer complementary insights such as acuity, and contrast sensitivity across the full spatial frequency range (area under the curve).

      Reviewer #2 (Recommendations for the authors):

      R2-Recommendation 1: It appears that the reliability measures, comparing differences in Spearman correlations between and within sessions, were not tested statistically, but evaluated qualitatively. What was the justification for this? The results only state Spearman values, but the discussion claims that the differences between the two comparisons were significant.

      AR-R2-Recommendation 1: The differences in Spearman correlations between and within sessions were tested statistically, and the omission of p-values was an oversight. We have now revised the Results section results from the paired one-tail t-test as follows:

      “We collected test-retest reliability measures from 4 out of 7 participants (Figures 6A-B) and benchmarked them against the correlations between the 0.3cpd condition and 3cpd spatial frequency condition, collected in the same session (Figure 6C). If measures are reliable, correlations should be higher for repeated measures with the same spatial frequency stimulus, collected on different days. We tested this prediction using a one-tailed paired t-test.”

      “This difference was statistically significant (t(3) = 2.62, p < 0.0395).”

      R2-Recommendation 2a: The variability of heat maps (visual field sensitivities) between healthy controls should also be discussed. What are potential explanations for this variability?

      AR-R2-Recommendation 2: We have expanded the Discussion section to address the variability observed in cortical sensitivity maps across healthy controls.

      “We also observed intriguing variability in cortical visual field maps across healthy controls, and this variability was consistent across measures. This may reflect genuine individual differences in visual sensitivity that are relevant for behavioral performance. Alternatively, it could arise from factors such as local signal-to-noise differences driven by anatomical variability. However, the fact that maps derived from different spatial stimulus conditions showed markedly different patterns argues against a purely anatomical explanation and suggests that at least part of the variability is functional. Despite this inter-subject variability, variations in cortical contrast sensitivity across eccentricities and visual field quadrants were significant at the individual level indicating high sensitivity.”

      R2-Recommendationn 2b: There should also be more discussion about any potential effects of eye movements/unstable fixation in order to address the suitability of the methods for these clinical populations.

      AR-R2-Recommendation 2b: Please refer to our response to R2-2, where we also detail the corresponding changes made in the manuscript.

      Reviewer #3 (Public review):

      R3-1: The authors should more strongly emphasize their findings on the organization of contrast sensitivity, particularly in light of the stimulation extent provided by the wide-field setup.

      AR-R3-1: Thank you for this important point – we have now emphasized more clearly in the manuscript that our method extends the measurement of contrast sensitivity to 20º eccentricity, which represents a significant advancement over previous studies.

      “These results demonstrate that our approach can detect subtle changes in visual sensitivity across eccentricities at the individual participant level. The ability to reveal these gradients was made possible by the large peripheral coverage provided by our large-field stimulation set-up (see Figure A1 in Appendix section), which enabled a more complete characterization of V1 sensitivity across the visual field. Importantly, the same effects were preserved when using retinotopic estimates derived from structure-based atlases, demonstrating that atlas-based methods can be used as alternative to pRF mapping in cases where it might otherwise be difficult or impossible to directly collect pRF measures. Together, these highlight both the validity of our approach and its potential to broaden the scope of visual neuroscience.”

      “Crucially, the ability to visualize these sensitivity gradients was made possible by the large peripheral coverage provided by our large-field stimulation set-up. Such coverage is particularly important for clinical applications, as it enables the detection of visual field losses beyond the macula (i.e., beyond 10º eccentricity) and the evaluation of residual peripheral vision in patients with macular-restricted damage. In doing so, this work provides a useful tool for advancing both basic visual neuroscience and translational research in clinical populations.”

      R3-2: Certain methodological aspects require further clarification, particularly regarding the correction of eccentricity values from the Benson atlas. It's not clear which V1 masks are used for the specific analysis which could have a substantial impact on the reported differences between the two approaches of pRF mapping and atlas-based pRF parameters.

      AR-R3-2: The correction of eccentricity values was performed using the V1 label provided by the Benson atlas. We have now explicitly stated this in the Methods section:

      “We collected data from 7 healthy controls (mean±SD: 29.6±4.7yo; 1M). All controls either had normal or corrected to normal vision, with no other ocular pathologies, and were recruited from the local staff and student pool at the University College of London. Each control completed both the population receptive field (pRF) mapping and the fMRI contrast sensitivity task. To assess measurement repeatability, four participants (C2, C4, C5, C6) performed the contrast sensitivity task twice. Additionally, one participant (C5) repeated the task under two simulated vision loss conditions (ring or quadrant loss), and two others (C5, C6) completed it with different levels of eye movement.”

      “Four participants (C2, C4, C5, C6) were invited for a second session in which they repeated the task to assess the reliability of the measures.”

      R3-4: The conclusion that high-contrast patterns as in pRF mapping are not optimal to test for subtle but potentially clinically relevant changes in the visual field coverage is very valid. The suggested use of contrast sensitivity can therefore be a potentially well-suited parameter for estimating visual field losses. The presented work is an interesting starting point and the proposed method of using contrast sensitivity as a measure for partial vision loss should further be explored.

      AR-R3-4: Thank you for the positive evaluation of our work.

      Reviewer #3 (Recommendations for the authors):

      R3-Recommendation 1: The shown organization of contrast sensitivities is consistent with previous studies; however, it extends the measurements to up to 20º eccentricity, which is, to my knowledge, much more than previously reported. The authors should therefore emphasize this more strongly.

      AR-R3-Recommendation 1: Please refer to our response to R3-1, where we also detail the corresponding changes made in the manuscript.

      R3-Recommendation 2: In the Methods section, it is not entirely clear why the eccentricity values originating from the Benson atlas need to be corrected using Horton & Hoyt cortical magnification. Do the authors consider these cortical magnification measurements as ground truth? Is the correction only applied to higher eccentricity values that are not mapped by the Benson atlas?

      AR-R3-Recommendation 2: The Benson et al. (2014) atlas predicts both polar angle and eccentricity from cortical anatomy (curvature, thickness) using a template pRF dataset and a mathematical retinotopic model. However, it does not incorporate a smooth parametric cortical magnification function such as Horton & Hoyt. Because the atlas is fit to an average map across subjects, and because the FreeSurfer alignment used to apply the template cannot incorporate functional information, the atlas cannot capture individual variability in eccentricity or cortical magnification. In practice, we therefore treat the Benson atlas as providing the correct topological layout of eccentricity, but not necessarily the correct eccentricity values for a given individual. Moreover, the data used to generate the Benson atlas have mainly been restricted to the central visual field (roughly 8º-12º) and the Benson atlas themselves has never been fit with data more eccentric than 20º. Consequently, peripheral eccentricity values are more model-driven and less constrained by ground-truth data.

      To improve the correspondence between the atlas and expected cortical representations, we applied Horton & Hoyt cortical magnification function to all eccentricities in the V1 Benson mask (from the foveal confluence to the periphery, up to 90º). We assume that the Horton & Hoyt model, adapted from physiology data, provides an accurate model of group level cortical magnification (Benson et al., 2021) – even though it does not capture individual differences. This means it offers the best approximation of ground-truth in the absence of individual pRF data, which is often not feasible to collect in patients with unstable fixation. We have now added a figure that showcases the method and shows how this correction affects the distribution of eccentricity values in the Benson atlas.

      R3-Recommendation 3: For the analysis using the atlas-based retinotopy, it is not entirely clear whether the authors also use the provided V1 masks. In other words, differences between the original pRF-based and atlas-based analyses could originate from different borders of V1 rather than from the atlas-based pRF parameters. The authors could try using the same mask for both analyses, either the manually delineated one or the atlas-based one.

      AR-R3-Recommendation 3: This is a well-noted point that is important to clarify. We used a manually delineated V1 mask for the own pRF map data and the Benson mask for the adjusted Benson atlas-based analysis – both restricted to the screen size. The difference in included vertices could have indeed introduced some additional error beyond the atlas/pRF mapping itself. We have opted not to correct this in this version of the manuscript because (1) the error introduced is likely small (as we inspected that the alignment of V1 ROI delineations with the Benson ROIs are good, so effects are likely not too major - although using identical masks may slightly improve the mapping further in particular the very center and outer-periphery), and (2) our ROI selection for each respective approach is in line with typical procedures used in reality. Critically, the spatial gradients in cortical contrast sensitivity are preserved across the pRF and Benson atlas approach with the different ROIs, so we believe that improvements would not alter our conclusions that Benson offers a useful alternative when pRF mapping is not possible - however, we now highlight this important difference across the two approaches in the paper.

      “With this structure-based atlas, we successfully replicated key variations in visual field function (across eccentricity and polar quadrants), although sensitivity to more subtle differences (e.g., upper versus lower quadrant anisotropy) was reduced. This reduction may partly stem from differences in ROI definitions: a manually delineated V1 mask was used for the pRF-based data, while the Benson atlas mask was used for the adjusted Benson atlas analysis. Such differences could introduce minor error beyond the atlas/pRF mapping itself due to differences in the vertices included by each mask.”

      “Importantly, the spatial gradients in cortical contrast sensitivity were preserved across both the pRF and Benson atlas approaches, indicating that minor ROI differences do not affect our conclusions. Together, these findings show that the Benson atlas remains a useful alternative when pRF mapping is not feasible.”

      R3-Recommendation 4: The patient was measured monocularly. Given the widefield stimulation setup and the fact that the blind spot is located at about 15º eccentricity, do the authors expect to measure this blind spot with the given setup?

      Does this have an influence in binocular measurements?

      AR-R3-Recommendation 4: This is an interesting point. In theory, our wide-field setup could allow for the detection of the blind spot, as located around 12-15º eccentricity. However, in our LHON patient, the visual field defect typically extends to or beyond the blind spot, making it difficult to isolate its boundary, as shown in Figure 11 (previously Figure 7). Additionally, under binocular viewing, the brain integrates inputs from both eyes to create a unified percept, which may obscure blind spots unless specific paradigms are used (e.g., binocular rivalry or dichoptic tasks). Whilst this is outside the scope of this work, our setup could be adapted to map out the blind spot or explore phenomena like binocular rivalry more directly in future research.

      R3-Recommendation 5: How stable is the presented wide-field stimulation setup? In other words, does the eye tracker still capture the eye reliably after small head movements?

      AR-R3-Recommendation 5: While small head movements can occur, these were minimized by the use of padding cushions and monitored throughout the session, and the eye tracker maintained reliable tracking throughout the sessions.

      R3-Recommendation 6: Are the shown sine-wave gratings always oriented the same? We would expect orientation tuning curves in the early visual cortex; how could this influence the results?

      AR-R3-Recommendation 6: For six of the seven control participants (C1-C6), the sinewave gratings were presented with a fixed horizontal orientation. In an updated version of the task – used for participant C7, cases of simulated eye movements, cases of artificial scotoma, and the patient – the orientation of the gratings was varied every 5 seconds among four angles (−45º, 0º, 45º, 90º) during each 15-second stimulus block.

      We acknowledge that orientation tuning in the early visual cortex could influence responses, since V1 neurons are selective for specific stimulus orientations and respond most strongly to their preferred orientation. However, we replicated the same overall pattern of results in groups tested with a single orientation and with multiple orientations. Importantly, some participants completed both versions of the task, and the contrast sensitivity patterns remained consistent across conditions. This suggests that the results we report are robust across different orientation-tuned populations for the purposes of this study. A more fine-grained investigation of orientation effects would nevertheless be an interesting direction for future work.

      “For six control participants (C1–C6), gratings were initially presented with a fixed horizontal orientation. In an updated version of the task – used for C7, cases of simulated eye movement, cases of artificial scotoma, and the LHON patient – the orientation varied every 5 s among four angles (−45º, 0º, 45º, 90º). Contrast sensitivity patterns were consistent across single and multiple-orientation conditions, including in participants who completed both versions, indicating robustness across orientation-tuned populations.”

      R3-Recommendation 7: Are pRF centers also fitted outside the stimulated 20º radius? If yes, were they masked for the analysis?

      AR-R3-Recommendation 7: During pRF model fitting, pRF centers were allowed to extend beyond the stimulated visual field, up to approximately 1.5 times the maximum stimulus eccentricity (~30°), to improve model stability near stimulus boundaries. Eccentricity was sampled on a logarithmically spaced grid defined as 2<sup>*</sup>, with 𝑥 ranging from -5 to 0.6 in steps of 0.2, and then scaled by the maximum stimulus eccentricity (20°) to express pRF centers in degrees of visual angle. This spacing approach provided finer sampling near the fovea and progressively coarser sampling at larger eccentricities, consistent with cortical magnification principles. For all subsequent analyses of cortical contrast sensitivity, pRF centers located outside the stimulated 20° eccentricity were explicitly excluded. Likewise, although the Benson atlas provides eccentricity estimates extending well beyond the stimulated range (up to ~90°), only pRF centers within 20° were included to ensure consistency across pRF based and atlas-based analyses.

      “During pRF model fitting, pRF centers were allowed to extend beyond the stimulated visual field to improve model stability near stimulus boundaries – up to approximately 1.5 times the maximum stimulus eccentricity (~30°). Eccentricity was sampled on a logarithmically spaced grid defined as 2*, with x ranging from −5 to 0.6 in steps of 0.2, and then scaled by the maximum stimulus eccentricity (20°) to express pRF centers in degrees of visual angle. This sampling scheme provided finer resolution near the fovea and progressively coarser sampling at larger eccentricities, consistent with cortical magnification principles.”

      “For all subsequent analyses of cortical contrast sensitivity, pRF centers outside the stimulated 20° eccentricity were excluded. Similarly, although the Benson atlas provides eccentricity estimates extending far beyond the stimulated range (up to ~90°), only values within 20° were retained to maintain consistency across pRF-based and atlas-based analyses.”

      R3-Recommendation 8: L212: Could the authors please clarify what "scaled across eccentricity to account for cortical magnification" means for the given stimulus?

      AR-R3-Recommendation 8: The pRF stimulus was scaled across eccentricity using a logarithmic transformation of retinal radius to approximate cortical magnification. Radial checker boundaries were defined in log eccentricity space (log(r)), resulting in an exponential increase in checker size with eccentricity (scaling factor = 3.2; ~1.37× increase per radial step). As a result, the spatial frequency content of the stimulus decreases with eccentricity (i.e., checker size increases), compensating for known changes in V1 spatial frequency preference across the visual field. This eccentricity dependent scaling inherently relies on precise fixation to stimulate the intended retinal locations, which can be difficult for patients with central vision loss and therefore motivates the use of Benson templates.

      “This scaling was implemented by applying a logarithmic transformation of retinal radius, such that radial checker boundaries were defined in log eccentricity space (log(r)), where r denotes to eccentricity relative to the fixation target). This produced an exponential increase in checker size with eccentricity (scaling factor = 3.2; ~1.37 times increase per radial step), resulting in lower spatial frequency content at larger eccentricities – consistent with known variations in V1 spatial frequency tuning. Because this eccentricity dependent scaling assumes precise fixation, it can be challenging for individuals with central vision loss, further motivating the use of Benson atlas templates in such populations.”

      R3-Recommendation 9: L213: Three runs were measured per session, were they averaged before analysis or analyzed independently? If analyzed independently, how were the individual results handled?

      AR-R3-Recommendation 9: As described in the Methods, data from all three runs were first aligned to an alignment scan that had been co-registered to the MPRAGE image – typically the scan with the fewest outlier voxels, or alternatively, a single-band reference scan in cases of misregistration. The runs were then analyzed as separate regressors in a single design matrix in SPM to account for run-specific variation - following standard recommendations for this software (Author response image 2 shows the SPM design matrix for the GLM). We did not average the runs beforehand due to differences in the order of stimulus presentation across runs. Instead, the GLM modeled each run’s specific presentation sequence to estimate condition-specific beta values, capturing the average contribution of each spatial frequency and contrast level to the BOLD response.

      Author response image 2.

      R3-Recommendation 10: L289: Did the authors check for very small pRF sizes, as SamSrf is prone to fitting many small sizes?

      AR-R3-Recommendation 10: We did not apply an explicit filter to remove very small pRF sizes; we excluded only pRFs with σ > 6.

      R3-Recommendation 11: L384: p is missing before the value.

      AR-R3-Recommendation 11: Thank you for catching this oversight. We have now added the missing p-value in the revised manuscript.

      “Post-hoc tests using Holm-Bonferroni correction show that V1 neuronal populations receiving inputs from the central visual field (0.5-4.5°) showed greater contrast sensitivity to high spatial frequency as compared to low spatial frequency stimuli (steeper slope for the 3cpd versus 0.3cpd condition: 0.5-2.5º: t(6) = 4.35, p<sub>bonf</sub> = 0.0149; 2.5-4.5º: 𝑡(6) = 3.471, p<sub>bonf</sub> = 0.0266).”

      R3-Recommendation 12: I have a very subjective comment regarding the figures. I do not really like the use of the hot colormap in this setting, as I feel it is hard to interpret high and low values.

      AR-R3-Recommendation 12: We appreciate the suggestion, but we have had many heated discussions amongst the authors about this and have moved back forth several times before settling. Hopefully the reviewer will be happy for us to stick with the author’s eventually agreed-on subjective preference although we acknowledge that it is by no means a perfect color scheme.

      R3-Recommendation 13: L474: Suddenly, a second session appears in the Results section; please report this in Methods.

      AR-R3-Recommendation 13: Please refer to our response to R3-3, where we also detail the corresponding changes made in the manuscript.

      R3-Recommendation 14: Figure 5C: are the reported results from the first session of the same subjects?

      AR-R3-Recommendation 14: That is correct. The results shown in Figure 6C (previously 5C) reflect correlations between slope estimates obtained from the 0.3 and 3cpd conditions within the same session for each subject. We have updated the panel title to “C. 0.3cpd vs 3cpd (within session)” to clarify this point.

      R3-Recommendation 15: For the classic pRF mapping (Figure 6D), the artificial scotoma shows lower contrast sensitivity within the scotoma and increased values outside its borders. In contrast, using the retinotopic template (Figure 6E), the area of increased sensitivity is shifted inside the scotoma. Can the authors please comment on this discrepancy?

      Is this shift due to systematic differences between the eccentricity values estimated during the pRF run and those derived from the template?

      If such a shift exists, is it induced by the eccentricity correction step performed?

      AR-R3-Recommendation 15: The shift inside the scotoma observed in the atlas-based analysis (Figure 9E; previously Figure 6E) compared to the pRF-based analysis (Figure 9D; previously Figure 6D) likely reflects residual inaccuracies in eccentricity estimates from the adjusted Benson atlas. While the Horton & Hoyt correction improves the alignment of eccentricity values, it does not ensure perfect matching with the pRF data. Without the Horton & Hoyt correction, the misalignment and shift of activity in the scotoma region are even more pronounced (see below).

      We have added a sentence to the Methods section to justify the applied correction. Furthermore, to illustrate the impact of misalignment and its correction on cortical sensitivity maps, we have included an additional figure in the Appendix section showcasing the effect of applying the correction to improve mapping of the artificial scotoma.

      “We initially observed inaccuracies between the template and individual retinotopy eccentricity estimates which led to substantial distortions in cortical visual field maps due to cortical magnification – especially in peripheral locations (see Figure A4 in Appendix section).”

      R3-Recommendation 16: L532: The age and mutation type of the patient are already reported in the Methods. In general, many Methods and Discussion statements are embedded within the Results section.

      AR-R3-Recommendation 16: We are aware that it is a stylistic choice to remind of method in the results and foreshadow discussion. We chose this approach to support the interpretability of the results for less specialist readers.

      R3-Recommendation 17: L636: Did the authors consider other options for estimating pRF parameters based on anatomical features, like Ribeiro et al. (2021;https://github.com/felenitaribeiro/deepRetinotopy_TheToolbox).

      AR-R3-Recommendation 17: We agree that alternative approaches to estimating pRF parameters based on anatomical features, such as the DeepRetinotopy method proposed by Ribeiro et al. (2021), are promising and worth exploring. In this study, we used the Benson atlas as a starting point, along with an adjustment of eccentricity estimates based on cortical magnification. Future work could compare the performance of different retinotopic template fitting approaches, including deep learning-based methods, to further improve anatomical alignment and functional predictions.

      “Further enhancing the alignment between retinotopic template atlases and individual retinotopic tuning could improve this approach further, for example, by integrating them with functional measures using Bayesian methods (Benson & Winawer, 2018). In parallel, geometric deep learning frameworks such as DeepRetinotopy (Ribeiro et al., 2021) could also offer anatomy-driven predictions from structural MRI, and combining these strategies may yield more accurate and generalizable retinotopic reconstructions.”

      R3-Recommendation 18: Figure A4: This figure brings up a very important point, namely, whether small eye movements reduce the accuracy of pRF and contrast sensitivity estimates. However, these experiments and results are not reported in the manuscript. I would prefer the authors to add all necessary Methods and Results, or at least not leave this Figure unexplained.

      AR-R3-Recommendation 18: We thank the reviewer for highlighting the importance of this figure. To address this point, we collected additional data and have revised the manuscript to include a dedicated section on the effects of eye movements, with corresponding updates in the Abstract, Methods, Results, and Discussion.

    1. eLife Assessment

      This important study utilizes behavioral data and computational modeling to show that spatial properties of visual attention affect human planning. The methodology and statistical analyses are convincing, though the way attention is conceptualized and modeled could be refined. The findings of this study will interest cognitive scientists studying attention, perception, and decision-making.

    1. eLife Assessment

      This study offers a valuable analysis of how moment-to-moment fluctuations in arousal are associated with structured, non-uniform patterns of brain-wide functional connectivity during wakefulness. Using data-driven analyses of resting-state and naturalistic fMRI with eye tracking, the authors present convincing evidence that arousal is a dynamic, continuous process that shapes brain activity in a structured way beyond a simple global effect. This paper sheds light on the link between brain activity and ongoing fluctuations in arousal and will be of interest to researchers studying large-scale brain functional organization and links between the brain and body.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors aim to characterize how moment-to-moment fluctuations in arousal during wakefulness shape large-scale functional brain connectivity. Using pupil diameter as an index of arousal and high-field functional imaging, they seek to determine whether arousal-related modulation of connectivity is uniform across the brain or organized into structured patterns, and whether such patterns show hemispheric asymmetry. The work further aims to assess whether these organizational features generalize across resting-state and naturalistic viewing conditions.

      Strengths:

      The study addresses an important and timely question regarding how spontaneous variations in arousal influence whole-brain communication during wakefulness. The dataset is rich, combining high-field imaging with concurrent physiological measurements, and the analyses are ambitious in scope. A key strength is the attempt to move beyond region-based effects and to describe arousal-related modulation at the level of large-scale connectivity organization. The comparison across rest and movie viewing provides useful context and suggests a degree of consistency across behavioral states.

      Weaknesses

      All analyses are based on 7T ultra-high-field imaging. The manuscript does not address whether the reported arousal-related patterns, including the community structure and hemispheric asymmetries, are expected to be reproducible at standard 3T field strengths. It therefore remains unclear whether the findings depend critically on the use of high-field data or whether they would generalize to more widely available datasets, limiting the broader applicability of the results.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript addresses a clear and widely relevant question: how ongoing fluctuations in alertness during wakefulness relate to large scale patterns of coordinated brain activity. The authors combine high field magnetic resonance imaging with simultaneous pupil measurements, and they compute an edgewise measure of arousal-related coupling for every pair of regions. Their main contribution is to show that arousal-related coupling is low dimensional and organized into seven reproducible "connectivity communities", each with characteristic network pair compositions. A secondary contribution is the observation that these communities exhibit systematic but community-specific hemispheric asymmetries, including a striking left/right dissociation within the ventral attention network, where the left side participates broadly across communities while the right side forms a more cohesive, segregated arousal responsive module. A final contribution is cross-context generalization: the same organizational structure and lateralization signatures are largely preserved during naturalistic movie watching.

      Strengths:

      (1) The paper moves beyond state contrasts and quantifies arousal related modulation continuously within wakefulness, directly addressing a gap highlighted in the Introduction.

      (2) The hemispheric asymmetry result is not framed as a crude global dominance effect; the authors explicitly test and argue that the key signal lies in structured spatial heterogeneity rather than mean shifts.

      (3) The cross-paradigm replication in movie watching is a strong design choice and supports the claim that the organizational motifs are not limited to unconstrained rest.

      (4) Arousal effects on BOLD signals and on pupil size can have different delays. The authors have now tested lagged relationships (for example shifting the pupil series forward and backward) to show that the main community structure and lateralization results are not sensitive to an arbitrary temporal alignment.

      (5) Time resolved connectivity results are now shown to be robust to changes in parameters.

    4. Reviewer #3 (Public review):

      Summary:

      The paper investigates neural fluctuations underlying arousal using a combination of resting state/naturalistic movie watching fMRI and eye tracking data. The authors have used several data driven approaches, including time varying sliding window analyses and clustering methods, to characterize large scale brain organization and hemispheric asymmetries associated with arousal fluctuations. This is an interesting study framing arousal as a dynamic, continuously varying process rather than a discrete state. Overall, the manuscript is well written and the authors have provided sufficient details about the methodological choices, their impact on the results, along with the limitations of the study.

      Strengths:

      This is an interesting study framing arousal as a dynamic, continuously varying process rather than a discrete state. Overall, the manuscript is well written and provides sufficient methodological and analytical details to evaluate the results.

      Weakness:

      While the study provides new insights regarding neural processes underlying arousal, future studies may be needed to further examine the implications of identified cluster and patterns.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) First, a central claim is that arousal modulates functional connectivity in a hemispherically asymmetric and community-specific manner. Although structured asymmetries are demonstrated at the group level, it remains unclear whether these effects reflect a stable neurobiological principle or arise from high-dimensional, connection-wise analyses that are sensitive to sampling variability. Given the interpretive weight placed on hemispheric lateralization, stronger evidence of robustness and individual-level consistency would be necessary to support this conclusion.

      We appreciate your critical comments on the robustness of our lateralization findings. We fully agree with you that it is essential to demonstrate that the observed hemispheric asymmetries reflect a stable neurobiological principle rather than an artifact of sampling variability or high-dimensional noise. To address this concern, we performed two rigorous validation analyses using 500-iteration resampling schemes, consisting of a split-half reliability test and a participant-level consistency assessment.

      First, to ensure our findings do not depend on specific sample compositions, we conducted a split-half reliability test where the dataset was randomly partitioned into two independent subgroups over 500 iterations. As shown in Figure S1A, the community labels maintained high spatial consistency across iterations (as evidenced by the confusion matrix and Dice coefficient distributions), and our original findings—including network-pair community architecture (Fig. S2A), regional affiliation patterns (Fig. S3A-B), and arousal–tvFC coupling lateralization (Fig. S4A-B)—were consistently situated at the center of the iteration distributions.

      Second, to account for potential within-participant dependencies in the HCP 7T dataset, we performed a participant-level resampling analysis (N = 139). By randomly selecting a different session for each participant across 500 iterations, we confirmed that the community architecture and hemispheric biases remain robust even under this strict control (Figure S1A, S2B, S3C-D and S4C-D). Collectively, these additional analyses provide strong evidence that the hemispheric lateralization we reported is not a byproduct of sampling bias, but instead represents a stable organizational principle of the arousal-modulated connectome.

      (2) Second, all analyses are based on ultra-high-field imaging. The manuscript does not address whether the reported arousal-related patterns, including the community structure and hemispheric asymmetries, are expected to be reproducible at standard field strengths. It therefore remains unclear whether the findings depend critically on the use of high-field data or whether they would generalize to more widely available datasets, limiting the broader applicability of the results.

      We appreciate your constructive comments on the generalizability of our findings across different field strengths.

      As you noted, our primary motivation for employing 7T ultra-high-field imaging was to leverage its superior signal-to-noise ratio (SNR) and significantly enhanced BOLD sensitivity. These technical advantages were instrumental in capturing the subtle, moment-to-moment coupling between spontaneous pupillary fluctuations and tvFC—signals that might be close to the detection threshold in standard field strength environments.

      However, we fully recognize your point that 3T remains the standard in most clinical and research settings. In the revised manuscript, we have added a dedicated discussion to address this (page 21, lines 447-456):

      “Fifth, the findings reported here were derived exclusively from ultra-high-field (7T) imaging data. The superior BOLD sensitivity of 7T fMRI was instrumental in resolving the fine-scale community architecture of arousal–tvFC coupling, which involves subtle signals that may be challenging to detect at lower field strengths. Given that 3T remains the most common parameter for neuroimaging research and clinical applications, future investigations are needed to determine the extent to which these organizational principles generalize to standard field strength data. Validating these motifs in large-scale 3T datasets will be essential to establish their broader applicability across different imaging environments.”

      (3) Third, arousal-connectivity coupling is assessed using zero-lag correlations between pupil diameter and time-resolved connectivity estimates. Physiological and hemodynamic considerations suggest that pupil-linked arousal and blood-based imaging signals may exhibit systematic temporal delays. The absence of analyses examining sensitivity to such delays raises the possibility that the reported coupling patterns depend on a specific temporal alignment assumption.

      Given the inherent delay of the hemodynamic response function (HRF) and the complex temporal relationship between pupillary dynamics and neural activity, we conducted an additional lagged cross-correlation analysis to test the sensitivity of our findings. Following established frameworks for linking BOLD signals with pupillometry (Yellin et al., 2015; Gonzalez-Castillo et al., 2022; Lloyd et al., 2023), we systematically shifted the pupil time series relative to the fMRI data by -3 TR to +3 TR (-3s to +3s) and evaluated the consistency of the community architecture across these different lags using Dice coefficients.

      As shown in Figure S5, these results demonstrate that the community organization remain stable across the tested range of physiological delays. This stability indicates that the arousal-modulated communities we reported are not specific to the zero-lag assumption but instead persist throughout the physiologically plausible lag window. Consequently, our findings reflect a robust neurobiological phenomenon rather than an artifact of a specific temporal alignment.

      (4) Fourth, the estimation of time-resolved connectivity relies on a single choice of sliding-window length. The manuscript does not examine whether the reported patterns are stable across different window sizes. Given ongoing concerns about parameter dependence in time-resolved connectivity analyses, sensitivity analyses would be important to establish that the findings are not artifacts of a particular analytical choice.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S5, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. These findings provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data rather than being driven by specific analytical choices in the sliding-window setup.

      (5) Finally, the identification of seven connectivity communities is a central result, yet the justification for this choice relies primarily on a single clustering quality measure. In practice, evaluation of clustering solutions typically draws on multiple complementary criteria, including measures of compactness and separation, approaches for selecting the number of clusters, and assessments of stability under resampling. Without such complementary evaluations, it is difficult to determine whether the reported community structure reflects a stable organizational feature or sensitivity to specific methodological decisions.

      We agree that relying on a single measure can be limiting, and in the revised manuscript, we have implemented a comprehensive multi-criteria evaluation to justify our selection of K=7. To ensure the robustness of the community partition, we expanded our analysis to include several complementary indices, such as the Davies-Bouldin Index, Calinski-Harabasz Score, and Silhouette Coefficient, alongside the original Within-Cluster Sum of Squares (WCSS), as detailed in Figure S7A.

      To further minimize subjective bias in "elbow" detection, we utilized the L-method (Salvador & Chan, 2004), which identifies the optimal K by minimizing the combined root-mean-square error (RMSE) of two linear regression segments. As illustrated in Figure S7B, the RMSE was minimized at K=7, providing a robust mathematical basis for our partition. Furthermore, we systematically visualized the community maps across a range of granularities from K=5 to 9 (Figure S7C). This stability analysis demonstrates that the fundamental topological features and the resulting hemispheric asymmetries are not transient artifacts of a specific K but are consistently preserved as the clustering granularity increases. These additional evaluations demonstrate that the seven-community structure reflects a stable organizational feature of arousal-modulated connectivity

      Reviewer #2 (Public review):

      (1) Arousal effects on BOLD signals and on pupil size can have different delays, so it would be valuable to test lagged relationships (for example, shifting the pupil series forward and backward) to show that the main community structure and lateralization results are not sensitive to an arbitrary temporal alignment.

      We agree with you that accounting for the varying delays between BOLD signals and pupillary dynamics is essential for ensuring the robustness of our results. We conducted a comprehensive lagged cross-correlation analysis to address it. Following established frameworks for linking BOLD signals with pupillometry (Yellin et al., 2015; Gonzalez-Castillo et al., 2022; Lloyd et al., 2023), we systematically shifted the pupil time series relative to the fMRI data by -3 TR to +3 TR (-3s to +3s) and evaluated the consistency of the community architecture across these lags using Dice coefficients.

      As shown in Figure S5C, these results demonstrate that the core community organization remain stable across the tested range of physiological delays. This stability confirms that our findings are not sensitive to an arbitrary temporal alignment but instead reflect a robust neurobiological phenomenon that persists throughout the physiologically plausible lag window.

      (2) Pupil diameter covaries with blinks, eye closure, and other factors that can covary with head motion and physiological noise. The Methods include substantial quality control and denoising, including motion regression and scrubbing, plus exclusions for eye closure.

      We appreciate your attention to these potential confounding factors. While we implemented rigorous preprocessing including regressing out confounds on fMRI images, we agree that physiological noise and motion may influenced pupil signals.

      To address this, we conducted an additional control analysis where we included head motion (framewise displacement, FD) and the global signal (defined as the mean signal across all gray matter voxels) as covariates when calculating the arousal–tvFC coupling. We then re-evaluated the similarity between the resulting community architecture and our original findings. As shown in Figure S4, the community structure remained stable after controlling for these variables.

      Regarding eye closure, we intentionally did not regress this out, as extensive literature demonstrates that eye closure is itself a reliable physiological proxy for arousal levels (Sommer & Golz, 2010; Chang et al., 2016; Gonzalez-Castillo et al., 2022); regressing it out would likely remove the very arousal-related coupling effects we aim to investigate.

      (3) The dataset is described in terms of runs retained (for example, 485 resting runs), and runs are treated as observations in clustering after z-scoring across runs. If multiple runs come from the same individuals, the manuscript would benefit from explicitly showing that results replicate at the participant level (for example, community structure stability within participant across runs, and participant-level summary statistics used for inference), rather than relying primarily on pooled run-level patterns.

      We fully agree with you that it is essential to demonstrate that the observed hemispheric asymmetries reflect a stable neurobiological principle rather than an artifact of sampling variability or high-dimensional noise. To address this concern, we performed two rigorous validation analyses using 500-iteration resampling schemes, consisting of a split-half reliability test and a participant-level consistency assessment.

      First, to ensure our findings do not depend on specific sample compositions, we conducted a split-half reliability test where the dataset was randomly partitioned into two independent subgroups over 500 iterations. As shown in Figure S1A, the community labels maintained high spatial consistency across iterations (as evidenced by the confusion matrix and Dice coefficient distributions), and our original findings—including network-pair community architecture (Fig. S2A), regional affiliation patterns (Fig. S3A-B), and arousal–tvFC coupling lateralization (Fig. S4A-B)—were consistently situated at the center of the iteration distributions.

      Second, to account for potential within-participant dependencies in the HCP 7T dataset, we performed a participant-level resampling analysis (N = 139). By randomly selecting a different session for each participant across 500 iterations, we confirmed that the community architecture and hemispheric biases remain robust even under this strict control (Figure S1A, S2B, S3C-D and S4C-D). Collectively, these additional analyses provide strong evidence that the hemispheric lateralization we reported is not a byproduct of sampling bias, but instead represents a stable organizational principle of the arousal-modulated connectome.

      (4) Time-resolved connectivity is estimated using a 30-second sliding window and 5 second step. It is reasonable to wonder whether the same conclusions hold with alternative estimators that do not rely on fixed windows. The Discussion acknowledges this limitation, but adding a small robustness analysis would make the paper more definitive.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S3, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. Furthermore, the core hemispheric asymmetry patterns were robustly preserved regardless of the specific windowing configuration used. These results provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data and are stable across a broad range of temporal scales.

      Reviewer #3 (Public review):

      (1) A major limitation of the study is the limited discussion of subcortical regions, which play a central role in arousal regulation according to extensive prior literature. Although the current analyses focus primarily on cortical organization, the authors should include a brief discussion of how their findings relate to subcortical arousal systems.

      We completely agree that subcortical structures are pivotal drivers of arousal regulation. While our study primarily utilized a symmetric cortical atlas to ensure a mathematically rigorous assessment of hemispheric lateralization, we recognize that the exclusion of subcortical regions limits the functional interpretation of the observed patterns.

      In the revised manuscript, we have added a dedicated discussion part (page 20, lines 412-428) to address this point:

      “First, to ensure a mathematically rigorous assessment of hemispheric asymmetry, our analysis was restricted to a symmetric cortical parcellation. Consequently, while we demonstrate that arousal-modulated connectivity follows a structured macroscopic architecture, we did not explicitly analyze the subcortical nuclei hypothesized to drive these patterns. We hypothesize that the presence of these low-dimensional cortical communities reflects coordinated motifs rather than a homogeneous gain modulation, potentially mirroring the differentiated projection patterns of subcortical neuromodulatory systems. For instance, the locus coeruleus–noradrenergic pathway (Chandler et al., 2014; Schwarz & Luo, 2015) and thalamus (Hwang et al., 2017; Shine, 2019; Müller et al., 2020; Shine et al., 2023) possess extensive yet non-uniform projections that may anchor the community-specific and hemispherically asymmetric patterns observed here. “

      (2) While sliding window methods can capture temporal changes in functional organization, they have limitations in characterizing moment-to-moment neural fluctuations. In particular, results can be highly sensitive to window length and step size. The manuscript would benefit from (a) a clearer discussion of these methodological limitations, (b) justification for the chosen window length and step size, and (c) a sensitivity analysis demonstrating whether the main findings are robust across different parameter choices.

      To ensure that our findings are not artifacts of a specific analytical choice, we performed an exhaustive sensitivity analysis by repeating our entire pipeline across a wide range of window lengths (30s, 35s, 60s, and 90s) and step sizes (1s, 5s, and 10s). We then employed Dice coefficients to quantify the topological similarity between these alternative configurations and our original parameters (30s window, 5s step).

      As shown in Figure S5, our results demonstrate high topological consistency, with Dice coefficients for community structures remaining consistently above 0.8 across all tested parameter combinations. Furthermore, the core hemispheric asymmetry patterns were robustly preserved regardless of the specific windowing configuration used. These results provide strong evidence that the arousal-modulated organizational principles we reported are inherent to the data and are stable across a broad range of temporal scales.

      (2) The authors use k-means clustering to identify groups of brain regions and refer to these groupings as "communities." However, in general, community detection typically refers to graph-based algorithms that identify modules based on connectivity structure (e.g., modularity maximization). The clusters derived from k-means in feature space are not necessarily equivalent to graph-theoretic communities. The authors should explicitly clarify this distinction and adjust terminology accordingly to avoid conceptual ambiguity.

      We agree that the term "community detection" is often specifically associated with graph-based algorithms, such as modularity maximization, which define modules based on topological connectivity. In contrast, our implementation of k-means identifies groupings based on the similarity of arousal–FC coupling patterns within a high-dimensional feature space.

      To avoid any conceptual ambiguity or potential confusion, we have explicitly clarified this distinction in the Methods (pages 24-25, lines 533-542) section of the revised manuscript:

      “We employed the k-means clustering algorithm (Euclidean distance) to explore a range of cluster solutions from K = 2 to 15. To ensure the stability of the results and avoid local optima, each K was repeated 250 times with random initializations. The optimal number of clusters was determined by evaluating clustering quality and reproducibility (e.g., maximizing silhouette stability). It is important to clarify that "communities" in this context refer to clusters of edges that exhibit similar arousal-modulation motifs within a high-dimensional feature space, rather than topological modules typically derived from graph-theoretic algorithms like modularity maximization. This procedure consistently identified seven distinct communities, each representing a robust, arousal-sensitive connectivity motif that characterizes the large-scale organization of brain-pupil coupling.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) To strengthen confidence in the reported hemispheric effects, the authors should provide additional robustness analyses, such as subject-level consistency of lateralization measures, split-half or resampling reliability, and sensitivity to alternative preprocessing or analysis choices. Reporting the distribution of lateralization effects across individuals would help clarify whether the observed asymmetries reflect stable features or group-level averages driven by a subset of connections or participants.

      We agree that establishing the individual-level stability of lateralization is essential. We have now provided extensive validation, including split-half reliability tests and participant-level consistency analyses (500 iterations). These results confirm that the reported asymmetries are robust and consistent across the sample. Please refer to Reviewer #1 Weakness2 for the full analysis and associated figures (Figure. S1-S4).

      (2) The authors should examine whether arousal-connectivity coupling patterns are robust to plausible temporal delays between pupil diameter and BOLD signals. Lagged or time-shifted analyses would help establish that the findings do not depend on a specific zero-lag assumption.

      We agree that validating the coupling between pupil dynamics and the time varying FC is essential. To address this, we conducted a lag sensitivity analysis by shifting the pupil-derived arousal signal within a physiologically plausible range (-3 to +3 TR). The community architecture remains highly consistent across these temporal offsets, showing high spatial correlation and Dice coefficients with our original findings. This stability confirms that the identified organizational motifs are robust and not dependent on a specific zero-lag assumption. For the full details of this validation and the associated figures, please refer to Reviewer #1 Weakness3 and Figure S5 in the Supplementary Material.

      (3) Given reliance on a single sliding-window length, the authors should assess how key results vary across different window sizes. Demonstrating stability of the community structure and lateralization patterns across parameter choices would strengthen the methodological foundation of the study.

      We have conducted an exhaustive sensitivity analysis across various window lengths (30s, 35s, 60s, 90s) and step sizes (1s, 5s, 10s). The high Dice coefficients (>0.8) confirm that our findings are not dependent on specific windowing choices. Please refer to Reviewer #1 Weakness3 and Figure S5 for the full results.

      (4) The justification for the chosen number of connectivity communities would benefit from additional clustering evaluations. Complementary criteria such as measures of compactness and separation, model selection approaches for determining the number of clusters, and stability or reproducibility under resampling would help establish whether the reported community structure is robust rather than method-dependent.

      To strengthen the mathematical basis for our partition, we have implemented a multi-metric evaluation and the L-method for objective K selection. These metrics consistently support the seven-community structure. Please refer to our response to Reviewer #1 Weakness5 and Figure S7 for the comprehensive evaluation.

      (5) The manuscript would benefit from a clearer discussion of why ultra-high-field imaging was required for the present analyses and whether similar results are expected at standard field strengths. If feasible, validation using lower-field data or reference to existing datasets would substantially enhance generalizability.

      We have expanded our discussion to clarify that 7T was instrumental for capturing the subtle, high-frequency arousal-tvFC coupling due to its superior SNR. We also explicitly discuss the potential and limitations of generalizing these findings to 3T datasets. Please refer to our response to Reviewer #1 Weakness2 for the full discussion (page 21, lines 447-456).

      (6) The authors should more explicitly report exclusion related to pupil measurements and discuss how missing or noisy pupillometry may affect the applicability of the approach in other datasets or experimental settings.

      We agree that transparency in data screening is essential for the reproducibility of our method. In the revised manuscript, we have clarified our quality control pipeline in the quality control section in Methods (page 23, lines 502-510):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female). Runs were excluded if (a) more than 20% of frames exceeded motion thresholds, (b) eye tracking did not cover the full fMRI time series, or (c) more than 90% of samples were classified as eye closure. After applying these criteria, 485 of the initial 723 scans were retained for analysis. The same quality-control pipeline was applied to the movie-watching dataset, yielding 513 usable scans out of the original 725. Detailed information on data retention and run distribution per participant is summarized in Figure S9.”

      Furthermore, we have added a discussion regarding how noisy or missing pupillary signals might affect the generalizability of our approach (pages 20-21, lines 437-447):

      “Fourth, the generalizability of our approach to external cohorts warrants caution regarding pupillary data integrity. In contexts where high-fidelity eye-tracking is technically demanding—such as in clinical settings involving patients with restricted compliance or in naturalistic fMRI studies—the prevalence of blink artifacts and signal dropouts may bias the estimation of arousal-modulated states. Excessive reliance on data interpolation in such cases could artificially smooth temporal fluctuations, leading to an overestimation of community stability. Future applications should therefore prioritize high-frequency sampling and potentially incorporate multi-modal physiological features (e.g., respiratory or cardiac signals) to cross-validate arousal dynamics when pupillary data is suboptimal (Meissner et al., 2023; Bolt et al., 2025; Weijs et al., 2025).”

      (7) The authors should ensure that all data and analysis code necessary to reproduce the results are made publicly available in accordance with eLife policies, including clear documentation of preprocessing steps, parameter choices, and clustering procedures.

      All analysis code and the necessary processed data required to reproduce our findings have been made publicly available through https://github.com/kongxy6478/Arousal-modulates-functional-connectivity. This repository includes documented pipelines for pupillometry cleaning and fMRI denoising, alongside the core Python scripts used for sliding-window connectivity calculation, k-means clustering, and hemispheric lateralization analysis.

      Reviewer #2 (Recommendations for the authors):

      (1) Add a lag sensitivity analysis between pupil-derived arousal and time-resolved connectivity, and report whether the seven community structure and key lateralization findings are stable across a plausible lag range.

      We agree that validating the coupling between pupil dynamics and the time varying FC is essential. To address this, we conducted a lag sensitivity analysis by shifting the pupil-derived arousal signal within a physiologically plausible range (-3 to +3 TR). The community architecture remains highly consistent across these temporal offsets, showing high spatial correlation and Dice coefficients with our original findings. This stability confirms that the identified organizational motifs are robust and not dependent on a specific zero-lag assumption. For the full details of this validation and the associated figures, please refer to Reviewer #1 Weakness3 and Figure S5 in the Supplementary Material.

      (2) Quantify and report the extent to which residual head motion, blink rate, eye closure segments, and global signal changes explain arousal connectivity coupling, for example, via partial correlation or regression controls, and show that key effects persist.

      We agree that it is essential to demonstrate that the observed arousal-connectivity coupling is not driven by non-specific physiological or motion-related artifacts. As requested, we have quantified the influence of head motion (FD) and global signal on our primary results. By implementing partial correlation analyses, we confirmed that the identified arousal-modulated community structures persist even after strictly controlling for these variables. These results indicate that the arousal-tvFC coupling we report reflects a specific neuro-arousal process rather than a byproduct of motion or systemic physiological fluctuations. For the detailed quantitative results and control analysis figures, please refer to our response to Reviewer #2 Weakness3 and Figure S6 in the Supplementary Material.

      (3) Add participant-level validation: demonstrate that community profiles and lateralization signatures are consistent within participants across runs, and consider participant-level statistical summaries rather than treating all runs as independent observations.

      We agree that demonstrating participant-level consistency is vital. In response, we performed two rigorous 500-iteration resampling schemes: a split-half reliability test and a participant-level consistency assessment (N = 139). These analyses, which involved randomly partitioning the sample and selecting single sessions per participant, confirm that our community architecture and hemispheric biases are remarkably stable and not driven by sampling variability or high-dimensional noise. For a comprehensive description of these validations and the associated statistical distributions, please refer to our detailed response to Reviewer #2 Weakness3 and Figures S1–S4.

      (4) Provide an alternative dynamic connectivity estimator robustness check, or at a minimum, vary the window length and step size to show stability of the primary conclusions.

      We have conducted an exhaustive sensitivity analysis across various window lengths (30s, 35s, 60s, 90s) and step sizes (1s, 5s, 10s). The high Dice coefficients (>0.8) confirm that our findings are not dependent on specific windowing choices. Please refer to Reviewer #1 Weakness3 and Figure S5 for the full results.

      (5) Consider validating the seven community solutions with at least one additional unsupervised approach, and report agreement with the main k-means solution.

      We agree that validating the clustering scheme is essential. To this end, we implemented a multi-criteria evaluation (including Davies-Bouldin and Silhouette indices) and utilized the L-method (Salvador & Chan, 2004) to mathematically confirm K=7 as the optimal granularity (Figure S7A–B). Furthermore, we verified that the core topological features and hemispheric asymmetries remain robustly consistent across a range of granularities from K=5 to 9 (Figure S7C). These analyses demonstrate that our findings are not dependent on a specific K or subjective bias. For the full quantitative evaluation and stability maps, please refer to our response to Reviewer #2 Weakness5 and Figure S7.

      (6) State explicitly, early in Results, what the main inferential unit is (run or participant) for each key analysis, and clarify how repeated runs per participant are handled.

      We agree that defining the inferential unit is critical for methodological clarity. In the revised manuscript, we have explicitly stated at the beginning of the Results section (page 5, lines 113-116):

      “While our primary inferential analyses were conducted at the run level to leverage the high-density sampling of the HCP 7T dataset, we further validated the robustness of these findings using participant-level statistical summaries and resampling to account for within-participant dependencies (see Figure. S1-S2 in Supplementary Materia).”

      Specifically, all key findings—including community architecture and hemispheric asymmetries—were validated using participant-level statistics and resampling schemes (N = 139) to ensure that the results are not biased by within-participant dependencies.

      (7) When introducing the integration and segregation indices, add a brief intuitive explanation of what a positive or negative value means in plain language before the equations.

      We thank the reviewer for this suggestion to improve the accessibility of our methods. We have added brief, intuitive explanations for both indices in the Methods section (pages 26-27, lines 569-582):

      “The integration index provides a measure of the overall hemispheric dominance of arousal-modulated connections. A positive value indicates that arousal-related edges are preferentially concentrated in the left hemisphere (including its internal and outgoing connections) compared to the right.” and “The segregation index assesses whether arousal preferentially modulates local, intra-hemispheric communication versus long-range, inter-hemispheric communication. A positive value reflects a "segregated" left-hemisphere bias, where arousal strengthens within-hemisphere connections more than it strengthens across-hemisphere communication for that same hemisphere. “

      (8) In the Discussion, separate claims into "what we show" versus "what we hypothesize," especially when connecting findings to neuromodulatory pathways.

      In the revised manuscript, we have carefully separated our direct empirical findings from our mechanistic hypotheses. we have utilized more cautious and speculative language (e.g., "suggesting a potential role of," "may be mediated by," and "we hypothesize that”) (page 17, lines 352-358):

      “Specifically, we show the presence of low-dimensional, reproducible communities suggests that arousal modulates the connectome through coordinated motifs rather than homogeneous gain modulation. We hypothesize that this structured macroscopic architecture reflects the differentiated projection patterns of subcortical neuromodulatory systems, such as the locus coeruleus–noradrenergic pathway (Aston-Jones & Cohen, 2005; Jordan, 2024) and thalamus (Magnin et al., 2010; Lewis et al., 2015; Liu et al., 2018)”

      (9) Provide a clear participant-level summary (number of participants contributing to the retained runs, demographics if available, and distribution of runs per participant), alongside the reported run counts retained after quality control.

      We agree that clear reporting of participant-level data is essential. In the revised Methods section, we have added a detailed summary of participant demographics (age and sex) and clarified the sample composition (page 23, lines 502-503):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female).”

      Furthermore, to provide a transparent view of the data retained after quality control, we have included Figure S9 to illustrate the distribution of valid runs per participant. This visualization confirms the amount of data contributing to our group-level inferences and accounts for exclusions due to motion or pupillary signal quality.

      (10) Report the robustness of results to reasonable changes in pupil preprocessing choices (for example, smoothing parameters or interpolation rules), since pupil diameter is the key arousal index.

      We agree that the robustness of pupil-derived arousal estimates is fundamental to our findings. To address this, we conducted an extensive validation analysis by comparing our original pupil preprocessing pipeline against 18 alternative combinations of parameters. These variations included different smoothing window sizes (100 ms, 200 ms, and 500 ms), interpolation methods (linear vs. cubic spline), and blink buffer durations (25 ms, 50 ms, and 100 ms). As shown in Figure S8, the pupil diameter time courses derived from these diverse pipelines remained highly correlated with our original estimates (all above 0.65). This demonstrates that our arousal-modulated connectivity results are remarkably robust to reasonable changes in pupil preprocessing choices.

      Reviewer #3 (Recommendations for the authors):

      I have two additional minor comments:

      (1) Given the overall goal of this study to identify large-scale brain communities or clusters underlying arousal, the results may be sensitive to the choice of cortical parcellation. The authors should consider:

      (a) including analyses using additional parcellation schemes, or

      (b) discussing how the current findings might depend on the chosen parcellation and the implications for robustness and generalizability.

      We have addressed this by adding a dedicated point in the Discussion (page 21, lines 456-465):

      “Sixth, our findings were derived using a single high-resolution cortical parcellation. While the specific choice of atlas can influence fine-grained regional connectivity, it is important to note that our primary conclusions—such as hemispheric asymmetries and community-level preferences—were identified and interpreted at the macroscopic network and system level. By aggregating signals across broad functional systems, this approach likely mitigates the dependency on precise regional boundary definitions. Nevertheless, future studies employing alternative parcellation schemes would be valuable to further confirm that these organizational principles are not specific to the current atlas but represent a generalizable feature of the arousal-modulated connectome.”

      (2) Some key details, such as the number of participants included in the study, as well as basic demographic information, are not reported.

      We apologize for this omission. In the revised Methods section, we have now included a detailed summary of the participant demographics, including the final sample size (N = 139), age, and sex distribution (page 23, lines 502-503):

      “The final analyzed sample for the resting-state consisted of N = 139 healthy participants (mean age = 29.1±3.5 years, 77 female)”

      Furthermore, to ensure full transparency regarding data retention, we have added a new figure (Figure S9) illustrating the distribution of valid fMRI runs per participant following our quality-control procedures. We believe these additions provide a clear and complete overview of the study sample.

      Reference

      Aston-Jones, G., & Cohen, J. D. (2005). AN INTEGRATIVE THEORY OF LOCUS COERULEUS-NOREPINEPHRINE FUNCTION: Adaptive Gain and Optimal Performance. In Annual Review of Neuroscience (Vol. 28, Issue Volume 28, 2005, pp. 403–450). Annual Reviews. https://doi.org/10.1146/annurev.neuro.28.061604.135709

      Bolt, T., Wang, S., Nomi, J. S., Setton, R., Gold, B. P., deB.Frederick, B., Yeo, B. T. T., Chen, J. J., Picchioni, D., Duyn, J. H., Spreng, R. N., Keilholz, S. D., Uddin, L. Q., & Chang, C. (2025). Autonomic physiological coupling of the global fMRI signal. Nature Neuroscience, 28(6), 1327–1335. https://doi.org/10.1038/s41593-025-01945-y

      Chandler, D. J., Gao, W.-J., & Waterhouse, B. D. (2014). Heterogeneous organization of the locus coeruleus projections to prefrontal and motor cortices. Proceedings of the National Academy of Sciences, 111(18), 6816–6821. https://doi.org/10.1073/pnas.1320827111

      Chang, C., Leopold, D. A., Schölvinck, M. L., Mandelkow, H., Picchioni, D., Liu, X., Ye, F. Q., Turchi, J. N., & Duyn, J. H. (2016). Tracking brain arousal fluctuations with fMRI. Proceedings of the National Academy of Sciences, 113(16), 4518–4523. https://doi.org/10/f8ktgg

      Gonzalez-Castillo, J., Fernandez, I. S., Handwerker, D. A., & Bandettini, P. A. (2022). Ultra-slow fMRI fluctuations in the fourth ventricle as a marker of drowsiness. NeuroImage, 259, 119424. https://doi.org/10.1016/j.neuroimage.2022.119424

      Hwang, K., Bertolero, M. A., Liu, W. B., & D’Esposito, M. (2017). The Human Thalamus Is an Integrative Hub for Functional Brain Networks. The Journal of Neuroscience, 37(23), 5594–5607. https://doi.org/10.1523/JNEUROSCI.0067-17.2017

      Jordan, R. (2024). The locus coeruleus as a global model failure system. Trends in Neurosciences, 47(2), 92–105. https://doi.org/10.1016/j.tins.2023.11.006

      Lewis, L. D., Voigts, J., Flores, F. J., Schmitt, L. I., Wilson, M. A., Halassa, M. M., & Brown, E. N. (2015). Thalamic reticular nucleus induces fast and local modulation of arousal state. eLife, 4, e08760. https://doi.org/10.7554/eLife.08760

      Liu, X., De Zwart, J. A., Schölvinck, M. L., Chang, C., Ye, F. Q., Leopold, D. A., & Duyn, J. H. (2018). Subcortical evidence for a contribution of arousal to fMRI studies of brain activity. Nature Communications, 9(1), 395. https://doi.org/10.1038/s41467-017-02815-3

      Lloyd, B., De Voogd, L. D., Mäki-Marttunen, V., & Nieuwenhuis, S. (2023). Pupil size reflects activation of subcortical ascending arousal system nuclei during rest. eLife, 12, e84822. https://doi.org/10.7554/eLife.84822

      Magnin, M., Rey, M., Bastuji, H., Guillemant, P., Mauguière, F., & Garcia-Larrea, L. (2010). Thalamic deactivation at sleep onset precedes that of the cerebral cortex in humans. Proceedings of the National Academy of Sciences, 107(8), 3829–3833. https://doi.org/10.1073/pnas.0909710107

      Meissner, S. N., Bächinger, M., Kikkert, S., Imhof, J., Missura, S., Carro Dominguez, M., & Wenderoth, N. (2023). Self-regulating arousal via pupil-based biofeedback. Nature Human Behaviour, 8(1), 43–62. https://doi.org/10.1038/s41562-023-01729-z

      Müller, E. J., Munn, B., Hearne, L. J., Smith, J. B., Fulcher, B., Arnatkevičiūtė, A., Lurie, D. J., Cocchi, L., & Shine, J. M. (2020). Core and matrix thalamic sub-populations relate to spatio-temporal cortical connectivity gradients. NeuroImage, 222, 117224. https://doi.org/10.1016/j.neuroimage.2020.117224

      Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE International Conference on Tools with Artificial Intelligence, 576–584. https://doi.org/10.1109/ICTAI.2004.50

      Schwarz, L. A., & Luo, L. (2015). Organization of the Locus Coeruleus-Norepinephrine System. Current Biology, 25(21), R1051–R1056. https://doi.org/10.1016/j.cub.2015.09.039

      Shine, J. M. (2019). Neuromodulatory Influences on Integration and Segregation in the Brain. Trends in Cognitive Sciences, 23(7), 572–583. https://doi.org/10.1016/j.tics.2019.04.002

      Shine, J. M., Lewis, L. D., Garrett, D. D., & Hwang, K. (2023). The impact of the human thalamus on brain-wide information processing. Nature Reviews Neuroscience, 24(7), 416–430. https://doi.org/10.1038/s41583-023-00701-0

      Sommer, D., & Golz, M. (2010). Evaluation of PERCLOS based current fatigue monitoring technologies. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 4456–4459. https://doi.org/10.1109/IEMBS.2010.5625960

      Weijs, M. L., Missura, S., Potok-Szybińska, W., Bächinger, M., Badii, B., Carro-Domínguez, M., Wenderoth, N., & Meissner, S. N. (2025). Modulating cortical excitability and cortical arousal by pupil self-regulation. Nature Communications, 16(1), 4552. https://doi.org/10.1038/s41467-025-59837-5

      Yellin, D., Berkovich-Ohana, A., & Malach, R. (2015). Coupling between pupil fluctuations and resting-state fMRI uncovers a slow build-up of antagonistic responses in the human cortex. NeuroImage, 106, 414–427. https://doi.org/10.1016/j.neuroimage.2014.11.034

    1. eLife Assessment

      The importance of uterine natural killer (NK) cells in reproductive success has been demonstrated in mice and humans; however, it is still unclear how uterine NK cells are developed. In this important manuscript, the authors provide convincing evidence that TGF-b signaling in NK cells supports normal pregnancy in mice by the conversion of conventional NK cells into uterine tissue-resident NK cells. Previous concerns have been addressed in this revised version.

    2. Reviewer #1 (Public review):

      This is an excellent paper from Dr. Yokoyama and colleagues. The experiments are technically demanding, given the very low cell numbers and the challenges of working with implantation sites at gestational days 6.5, 10.5, and 14.5. Overall, the impact of TGF-β receptor II deficiency in the NK lineage on uterine trNK cell numbers and litter size is convincing, and the authors' conclusions are well supported by the data. Less convincing, however, is the claim that the decrease in trNK cells is compensated by an increase in cNK cells; rather, the absence of TGF-β receptor II appears to result in an overall reduction of NK/ILC1 cells.

      Comments on revised version:

      I thank the authors for addressing all my comments from my initial review.

    3. Reviewer #2 (Public review):

      In their manuscript "TGF-β drives the conversion of conventional NK cells into uterine tissue-resident NK cells to support murine pregnancy", Yokoyama and colleagues investigate the role of Tgfbr2 expression by NK cells in the formation of tissue-resident uterine NK cells and subsequent importance in murine pregnancy. By transferring congenic splenic conventional NK cells into pregnant mice, they show conversion of circulating NK cells into uterine ivCD45 negative tissue-resident NK cells. When interfering with the formation of uterine trNK cells, spiral artery remodelling was impaired, fetal resorption rates were increased, and litter sizes were reduced.

      Generally, this is a research topic of high interest, yet the manuscript is lacking detailed mechanistical insights and some questions remain open. At the current state, the data represent an interesting characterisation of the Tgfbr2-fl/fl Ncr1-Cre mice in pregnancy, but considering 1) the recent publication by the group (Ref#17) on the role of Eomes+ cNK cells during pregnancy, 2) the previously described role of Tgfbr2 and autocrine TGFb expression for uterine NK cell differentiation in virgin mice (also cited by the authors), and 3) the well-known relevance of uterine NK cells during pregnancy, additional experiments addressing the specific role of Tgfb during pregnancy would help to improve novelty and significance of the manuscript.

      Comments on revised version:

      In their revised version of the manuscript and their point-by-point response, the authors have very carefully addressed and discussed all of our concerns and suggestions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) Figure 1A and B: Although a trend is evident, it does not appear that the absolute number of cNK cells at day 14 is significantly changed from day 6.5?

      We thank the reviewer for this careful observation. We had not originally performed a statistical comparison between the number of cNK cells present at gds 6.5 and 14.5. We have now conducted the appropriate statistical analysis for this dataset and found that the absolute number of cNK cells at day 14.5 is in fact significantly different from day 6.5 (p = 0.0005; unpaired t test, Mann-Whitney correction). The figure and corresponding legend have been updated to reflect this analysis. Please see Figure 1B:

      “Statistics were calculated using unpaired t tests with the Mann-Whitney correction. Error bars indicate SEM; *** p < 0.001.”

      (2) Figure 2E: The authors state, "This reduction of uterine trNK cells was accompanied by a concomitant increase in the absolute number and frequency of CD49b+Eomes+ cNK cells within the pregnant uterus of TGF-βRIINcr1Δ dams (Figure 2 D, E). The number of cNK cells appears relatively low (visually ~1,000-1,300), and although the difference is statistically significant, its physiological relevance is unclear. More importantly, this modest increase does not correlate with the marked decrease in trNK and ILC1 populations, as cNK cells do not appear to accumulate. In my opinion, the conclusion "Collectively, these findings indicate that a TGF-β-driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy" should be slightly toned down.

      We thank both reviewers for this suggestion. Regarding the absence of cNK cell accumulation in the absence of TGF-β signaling, we suggest that this may be related to the normal passage of cNK cells circulating in the placenta, i.e., these cells may not have acquired signals to remain in the uterus and are simply continuing to pass through and not accumulating. Nonetheless, we have rephrased our wording in to address this concern as follows:

      “This reduction of uterine trNK cells was accompanied by a small increase in the absolute number and frequency of CD49b<sup>+</sup> Eomes<sup>+</sup> cNK cells within the pregnant uterus of TGF-βRII<sup>Ncr1∆</sup> dams (Figure 2 D, E). Collectively, these findings suggest that a TGF-β–driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy.”

      “The absence of cNK cell accumulation in the gravid uterus in the setting of impaired TGF-β signaling suggests a defect in tissue retention rather than recruitment. In the absence of TGF-β–mediated cues, circulating cNK cells that enter the uterine vasculature may fail to acquire the molecular programs required for residency and instead continue to transit through the tissue. This is consistent with a model in which TGF-β signaling promotes not only phenotypic conversion but also the acquisition of retention signals necessary for persistence within the uterine microenvironment, reinforcing that acquisition of tissue-residency in the gravid uterus is an actively instructed process [29,32].”

      (3) Figures 2-4: It is unclear whether the littermate controls are floxed mice or floxhet-Ncr1iCre mice? This distinction is important, as Ncr1iCre expression itself could potentially lead to a phenotype.

      To address these concerns, we characterized the uterine innate lymphoid cell compartment in the pregnant uterus of Ncr1<sup>icre</sup> dams at gestational day 6.5. We did not observe a difference in the absolute number and frequency of trNK cells, cNK cells, and ILC1s in the gravid uterus of Ncr1<sup>icre</sup> dams compared to wildtype CD45.1 C57BL/6 mice. Additionally, the number of implantation sites and resorption rates in Ncr1<sup>icre</sup> dams was comparable to wildtype CD45.1 C57BL/6 mice. Together these data indicate that Ncr1<sup>icre</sup> expression itself does not influence the phenotype we report in TGF-βRII<sup>Ncr1∆</sup> dams. These additional findings have been included in Supplementary Figure 1 and in the text as follows:

      “To ensure we exclude a confounding effect of Ncr1<sup>iCre</sup> expression, we profiled the uterine innate lymphoid compartment in pregnant Ncr1<sup>iCre</sup> dams at gestational day 6.5. No differences were observed in the absolute number of trNK cells, cNK cells, or ILC1s relative to wildtype controls (Figure S1 A-D), and implantation site number and resorption rates were likewise unchanged (Figure S1 E-F). These data indicate that Ncr1<sup>iCre</sup> expression alone does not perturb uterine ILC composition or early pregnancy outcomes.”

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1C &D: The adoptive transfer experiment is convincing. As a minor point, why is the gate setting for Eomes different between panels 1C and 1D?

      To clarify the phenotype of the adoptively transferred cNK cells, we included two additional gates depicting the expression of CD49a and CD49b in unlabeled (non-vascular) trNK cells and cNK cells in the pregnant uterus Please see the revised Figure 1C and revised figure legend:

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

      (2) Figure 3: Has the pup ratio male/female changed?

      We did not observe a statistically significant difference in the female-to-male pup ratio between groups.

      Reviewer #2 (Public review):

      (1) The authors suggest cNK extravasation and local differentiation into iv- trNK. Can it be estimated how much this process contributes to the trNK pool vs. a potential local proliferation of already existing trNK? How do absolute numbers of CD49a+ Eomes+ trNK change during pregnancies? (In Figure 1A, the cell numbers of CD49a+ Eomes+ trNK seem to go down dramatically between gd 6.5 and 14.5). The plot in 1B could also include absolute numbers of ILC1s and trNKs. Would recruited cNK cells compensate for a potential loss of CD49a+ Eomes+ trNK?

      Our prior work as well as others have tracked the changes in uterine trNK cells, cNK cells, and ILC1s over the course of murine pregnancy. Consistent with these studies, the absolute number of uterine CD49a<sup>+</sup> Eomes<sup>+</sup> trNK cells peaks during early pregnancy (roughly between gds 5.5 7.5) and subsequently declines until term. The decrease in uterine trNK cells between gd 6.5 and gd 14.5 observed in Figure 1A is therefore consistent with the known physiological contraction of the decidual NK compartment as pregnancy progresses. Thus, it is unlikely that cNK cells recruited within the uterine tissue compensate for the loss of CD49a<sup>+</sup> Eomes<sup>+</sup> trNK cells observed. To address the reviewer’s request, we have now included the absolute number of uterine trNK cells and ILC1s in Figure 1–please see updated Figure 1C and D and corresponding figure legend (provided below). With respect to the relative contribution of cNK cells extravasation vs local proliferation of trNK cells, our data do not allow us to quantitatively distinguish between these mechanisms. Moreover, previous studies have demonstrated that uterine trNK cells express Ki67, suggesting that they exhibit proliferative activity during this period. Thus, we hypothesize that both local proliferation of existing trNK cells and recruitment of circulating cNK cells contribute to the population of uterine trNK cells during early pregnancy.

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells. (D) Proportion of uterine ILC subsets derived from adoptively transferred splenic cNK cells in the pregnant uterus of wildtype dams. Statistics were calculated using unpaired t tests with the Mann-Whitney correction. Error bars indicate SEM; ***p < 0.001.”

      Barahona, J.D., Yang, L. and Yokoyama, W.M., 2025. Eomesodermin defines uterine NK cells crucial for pregnancy success in mice. The Journal of Immunology, 214(10), pp.2549-2556.

      Filipovic, I., Chiossone, L., Vacca, P., Hamilton, R.S., Ingegnere, T., Doisne, J.M., Hawkes, D.A., Mingari, M.C., Sharkey, A.M., Moretta, L. and Colucci, F., 2018. Molecular definition of group 1 innate lymphoid cells in the mouse uterus. Nature Communications, 9(1), p.4492.

      (2) Figure 1C: 2.5 Mio cNK cells have been transferred, but only very few cells can be detected within the uterus (concatenated FACS plot shown). What may represent the limit to generate uterine trNK out of cNK? Is the niche supporting cNK-trNK differentiation limited? Is it only a specific subset of (splenic) cNK capable of differentiating into trNK? Is gd 0.5 the optimal timepoint for the transfer? Is there continuous recruitment of cNK into the uterus and differentiation into trNK, or is it enhanced at specific timepoints of pregnancy? Could there be local proliferation of cNK-derived trNK? This could be studied by proliferation dye dilution of WT cNK cells in this transfer-setup.

      We recognize that transferring cNK cells at gestational day 0.5–prior to placental formation–may partially account for the low uterine reconstitution observed. At this time point, the local signals necessary for efficient recruitment and retention of cNK cells in the uterus may not yet be fully established, potentially resulting in preferential homing to peripheral tissues such as the spleen and liver. Consistent with this possibility, we do observe a robust population of adoptively transferred cNK cells in the spleen and liver of our pregnant dams. We decided to transfer cNK cells at gestational day 0.5 to ensure that the cells were present at throughout most of early pregnancy, particularly during implantation and the initial stages of decidualization. We also did not transfer cells before mating to minimize the number of mice that did not get pregnant. Additionally, performing the transfer at this early time point minimized repeated manipulation of pregnant dams, as procedural stress itself has been shown to affect physiological processes of gestation and could thereby confound the pregnancy outcomes we were assessing. Furthermore, Filipovic et al. 2018 previously showed that both trNK cells and cNK cells in the pregnant uterus expressed Ki67 at gestational 9.5, suggesting that there could be local proliferation of cNK-derived trNK cells in the gravid uterus that could limit the migration of circulating cNK cells into this microenvironment. We have discussed in more depth in our discussion section as follows:

      “Interestingly, the inability to fully reconstitute the uterine trNK cell compartment following adoptive transfer suggests that only a subset of circulating cNK cells may be capable of differentiating into trNK cells during pregnancy, or alternatively that trNK cells already present in the virgin uterus may undergo in situ proliferation in the gravid uterus. Previous studies from our lab as well as others show that trNK cells within the pregnant murine uterus express marked levels of Ki67, supporting a model in which local proliferation of uterine trNK cells is a major contributor to the uterine trNK cell pool during pregnancy [7,32]. Prior studies have also described hematopoietic precursors within endometrial and decidual tissues that generate uterine trNK cells, suggesting that the compartment may be also sustained by local precursor differentiation [33-35]. Together, these findings suggest that uterine trNK cell ontogeny may be more complex than a single-source model and raise the possibility that distinct developmental pathways may operate at different stages of reproductive life. Therefore, defining the relative contribution and developmental timing of hematogenous versus locally maintained sources in vivo could provide relevant insights into the developmental trajectories and transcriptional programs that underlie decidual NK cell heterogeneity.”

      Zhai, Q.Y., Wang, J.J., Tian, Y., Liu, X. and Song, Z., 2020. Review of psychological stress on oocyte and early embryonic development in female mice. Reproductive Biology and Endocrinology, 18(1), p.101.

      Wiebold, J.L., Stanfield, P.H., Becker, W.C. and Hillers, J.K., 1986. The effect of restraint stress in early pregnancy in mice. Reproduction, 78(1), pp.185-192.

      Sánchez-Rubio, M., Abarzúa-Catalán, L., Del Valle, A., Méndez-Ruette, M., Salazar, N., Sigala, J., Sandoval, S., Godoy, M.I., Luarte, A., Monteiro, L.J. and Romero, R., 2024. Maternal stress during pregnancy alters circulating small extracellular vesicles and enhances their targeting to the placenta and fetus. Biological Research, 57(1), p.70.

      Filipovic, I., Chiossone, L., Vacca, P., Hamilton, R.S., Ingegnere, T., Doisne, J.M., Hawkes, D.A., Mingari, M.C., Sharkey, A.M., Moretta, L. and Colucci, F., 2018. Molecular definition of group 1 innate lymphoid cells in the mouse uterus. Nature Communications, 9(1), p.4492.

      (3) The authors should consider inducible Tgfbr2 deletion (e.g. with Tamoxifen-inducible Cre) to enable development of the uterine NK compartment in virgin mice and only ablate trNK differentiation during pregnancy. This could help to estimate the turnover of cNK into trNK, or to understand if constant cNK recruitment is required to form the uterine trNK compartment during pregnancy.

      Thank you for this suggestion. We did initially consider incorporating a mouse model with a tamoxifen-inducible deletion of the TGF-βRII to examine the differentiation of peripheral cNK cells into uterine trNK cells more precisely. However, the administration of tamoxifen during murine pregnancy has well-established deleterious effects on implantation, fetal viability, and placentation, which would confound our interpretations of any adverse pregnancy outcome observed in our studies. Because our goal was to assess NK cell-specific contributions to murine gestation without introducing additional pregnancy-related perturbations, we elected to use an Ncr1<sup>iCre</sup> – based mouse model in our studies.

      Ved, N., Curran, A., Ashcroft, F.M. and Sparrow, D.B., 2019. Tamoxifen administration in pregnant mice can be deleterious to both mother and embryo. Laboratory animals, 53(6), pp.630-633.

      Sun, M.R., Steward, A.C., Sweet, E.A., Martin, A.A. and Lipinski, R.J., 2021. Developmental malformations resulting from high-dose maternal tamoxifen exposure in the mouse. PLoS One, 16(8), p.e0256299.

      Ilchuk, L.A., Stavskaya, N.I., Varlamova, E.A., Khamidullina, A.I., Tatarskiy, V.V., Mogila, V.A., Kolbutova, K.B., Bogdan, S.A., Sheremetov, A.M., Baulin, A.N. and Filatova, I.A., 2022. Limitations of tamoxifen application for in vivo genome editing using Cre/ERT2 system. International Journal of Molecular Sciences, 23(22), p.14077.

      (4) Did the authors consider transfer of Tgfbr2-floxed Ncr1-Cre cNK in the same setup as in Fig. 1C? This experiment could confirm the requirement of Tgfbr-dependent signaling for cNK to trNK conversion during pregnancy versus effects of Tgfb signals on trNK numbers in the uterus at steady state (before pregnancy).

      We thank the reviewer for this mechanistically insightful suggestion. We did consider performing reciprocal transfer experiments using TGF-βRII<sup>fl/fl</sup> Ncr1<sup>icre</sup> cNK cells in the same adoptive transfer system as in Figure 1C. Our current adoptive transfer experiments already directly address this question. Transfer of congenically labeled wild-type splenic cNK cells into TGF-βRII<sup>Ncr1Δ</sup> dams at gestational day 0.5 resulted in partial reconstitution of the uterine trNK compartment and, importantly, this was sufficient to rescue the adverse pregnancy outcomes observed at midgestation. These findings indicate that TGF-β–competent cNK cells can differentiate and function appropriately within the pregnant uterine environment, supporting a requirement for TGF-β–dependent signaling in cNK-to-trNK conversion during pregnancy. Because restoration of TGF-β–sufficient cNK cells rescues these pregnancy outcomes, we believe this experiment functionally demonstrates the importance of TGF-β signaling in this process and therefore did not pursue reciprocal transfer of TGF-βRII–deficient cNK cells.

      “Partial reconstitution of uterine trNK cells restores midgestational pregnancy outcomes in TGF-βRII<sup>Ncr1∆</sup> dams

      To determine whether restoring uterine trNK cells could rescue the midgestational pregnancy defects observed in TGF-βRII<sup>Ncr1∆</sup> dams, we adoptively transferred wildtype, congenically labeled splenic cNK cells into pregnant TGF-βRII<sup>Ncr1∆</sup> dams at gd 0.5. By gd 10.5, donor cNK cells were detected in the pregnant uterus, where a subset upregulated CD49a and downregulated CD49b, consistent with acquisition of a uterine trNK cell phenotype (Figure 5 A). However, adoptively transferred splenic cNK cells only partially reconstituted the uterine trNK cell population in the gravid uterus of TGF-βRII<sup>Ncr1∆</sup> dams, as evidenced by reduced absolute number and frequency of donor-derived trNK cells in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams (Figure 5 A-C). Notably, this partial reconstitution was sufficient to rescue the gestational defects caused by impaired TGF-β–mediated uterine trNK cell differentiation. Reconstituted TGF- βRII<sup>Ncr1∆</sup> dams exhibited implantation site numbers and fetal resorption rates at gd 10.5 comparable to those observed in littermate controls (Figure 5 D, E). Together, these findings suggest that even partial restoration of the uterine trNK cell in pregnant TGF-βRII<sup>Ncr1∆</sup> dams is sufficient to restore pregnancy outcomes at midgestation, supporting a central role for uterine trNK cells as the principal NK cell subset required for successful murine pregnancy.”

      (5) Figures 2D/E: The authors should state that ILC1s are reduced in the virgin uterus of female Tgfbr2-floxed or Tgfb1-floxed Ncr1-Cre mice and cite the relevant work (the Ref #29 discussed in this context did not show that?). It would be helpful to include an analysis of all three uterine ILC subsets in steady state. This could help to answer the question if the cNK cell changes are pregnancy-specific or a general phenomenon in Tgfbr2-floxed Ncr1-Cre mice.

      We thank the reviewer for this important comment and for noting the miscitation. We regret the error and have corrected the reference in the revised manuscript to cite the appropriate study demonstrating reduced ILC1s in the virgin uterus of Tgfb1<sup>fl/fl</sup> Ncr1<sup>iCre</sup> mice {Sparano, C. et al. 2024. Autocrine TGF-β1 drives tissue-specific differentiation and function of resident NK cells. Journal of Experimental Medicine, 222(3), p.e20240930}. Please see Line 148. Importantly, the steady-state ILC compartment in virgin Tgfb1<sup>fl/fl</sup> Ncr1<sup>iCre</sup> mice has already been carefully characterized in the previously published work, including analysis of all three uterine ILC subsets. Because the steady-state uterine ILC landscape in this mouse model has already been established by Sparano, C. et al. 2024, our study focuses specifically on the pregnancy-associated changes in the uterine ILC landscape occurring in the absence of TGF-β signaling in Ncr1-expressing cells and their subsequent effects on gestational outcomes. In the absence of TGF-β signaling there appears to be a higher frequency of cNK cells in both the virgin uterus and pregnant uterus, suggesting that this is more of a general phenomenon.

      “However, in the pregnant uterus, CD49a<sup>+</sup> Eomes<sup>-</sup> ILC1s were markedly reduced in implantation sites of TGF-βRII<sup>Ncr1∆</sup> dams, paralleling the reduction of ILC1s previously reported in the virgin uterus of TGF-βRII<sup>Ncr1∆</sup> female mice [26].”

      (6) Figure 2E: Please phrase more carefully about the "concomitant increase" of cNKs, since this increase is much less pronounced compared to the very strong reduction (absence) of trNKs in Tgfbr2-floxed Ncr1-Cre mice. Do the authors suggest that cNKs are halted at this stage and cannot differentiate into trNK, based on these data?

      We thank both reviewers for this suggestion, and we have rephrased our wording to address this concern as follows:

      “This reduction of uterine trNK cells was accompanied by a small increase in the absolute number and frequency of CD49b<sup>+</sup> Eomes<sup>+</sup> cNK cells within the pregnant uterus of TGF-βRII<sup>Ncr1∆</sup> dams (Figure 2 D, E). Collectively, these findings suggest that a TGF-β–driven differentiation pathway directs the conversion of peripheral cNK cells into uterine trNK cells during murine pregnancy.”

      Please also see our response to Reviewer #1, Comment #2.

      (7) Can the reduced litter size and the abnormal spiral artery formation be rescued by transfer of WT cNK into Tgfbr2-floxed Ncr1-Cre mice?

      We thank the reviewers for this interesting question. In subsequent experiments, we transferred congenically labeled, splenic cNK cells from wildtype female mice into TGF-βRII<sup>Ncr1∆</sup> dams at gestational day 0.5. We only observed partial reconstitution of uterine trNK cell population; however, the number of viable implantation sites and resorption rates in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams were comparable to the number of viable implantation sites and resorption rates in HBSS-treated littermate controls at gestational day 10.5. Given that partial reconstitution of the uterine trNK cell compartment in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams was sufficient to rescue the defects in implantation site number and fetal resorption rates observed at midgestation, we hypothesize that this level of restoration may permit patrial but functionally sufficient spiral artery remodeling to reestablish maternal-fetal blood flow adequate to support fetal viability, although spiral artery remodeling was not directly assessed in this transfer study.

      “Partial reconstitution of uterine trNK cells restores midgestational pregnancy outcomes in TGF-βRII<sup>Ncr1∆</sup> dams

      To determine whether restoring uterine trNK cells could rescue the midgestational pregnancy defects observed in TGF-βRII<sup>cr1∆</sup> dams, we adoptively transferred wildtype, congenically labeled splenic cNK cells into pregnant TGF-βRII<sup>Ncr1∆</sup> dams at gd 0.5. By gd 10.5, donor cNK cells were detected in the pregnant uterus, where a subset upregulated CD49a and downregulated CD49b, consistent with acquisition of a uterine trNK cell phenotype (Figure 5 A). However, adoptively transferred splenic cNK cells only partially reconstituted the uterine trNK cell population in the gravid uterus of TGF-βRII<sup>Ncr1∆</sup> dams, as evidenced by reduced absolute number and frequency of donor-derived trNK cells in reconstituted TGF-βRII<sup>Ncr1∆</sup> dams (Figure 5 A-C). Notably, this partial reconstitution was sufficient to rescue the gestational defects caused by impaired TGF-β–mediated uterine trNK cell differentiation. Reconstituted TGF-βRII<sup>Ncr1∆</sup> dams exhibited implantation site numbers and fetal resorption rates at gd 10.5 comparable to those observed in littermate controls (Figure 5 D, E). Together, these findings suggest that even partial restoration of the uterine trNK cell in pregnant TGF-βRII<sup>Ncr1∆</sup> dams is sufficient to restore pregnancy outcomes at midgestation, supporting a central role for uterine trNK cells as the principal NK cell subset required for successful murine pregnancy.”

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1C: The shown gate seems to "cut" into the CD49b staining; staining for all transferred cells should be shown; have cNK cells been stained in parallel with the same panel to provide a positive and compensation control?

      To clarify the phenotype of the adoptively transferred cNK cells, we included two additional gates depicting the expression of CD49a and CD49b in unlabeled (non-vascular) trNK cells and cNK cells in the pregnant uterus Please see the revised Figure 1C.

      “(C) Concatenated flow plots of implantation sites showing that adoptively transferred cNK cells in pregnant uterus of wildtype dams upregulate CD49a and down regulate CD49b by gd 10.5, acquiring a CD49a<sup>+</sup> CD49b<sup>-</sup> Eomes<sup>+</sup> phenotype characteristic of uterine trNK cells (C57BL/6 dams n=4). Here, 2.5x10<sup>6</sup> CD45.2<sup>+</sup> CD3<sup>-</sup> CD19<sup>-</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> CD49b<sup>+</sup> splenic cNK cells were adoptively transferred into pregnant C57BL/6-CD45.1 dams at gd 0.5, and the receptor profile of these cells was subsequently assessed at gd 10.5. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2–PE-Cy7<sup>-</sup> CD45.2–PE<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

      (2) Figure 2A: The authors could include an isotype control or a staining in a genetic knockout as a control staining.

      Thank you for this suggestion. As suggested, we included staining in a genetic TGF-βRII<sup>Ncr1∆</sup> knockout as additional control staining. Please see the revised Figure 2A.

      “Representative histograms depicting TGF-β Receptor II expression on splenic NK cells from virgin TGF-βRII<sup>Ncr1∆</sup> and wildtype mice as well as splenic and uterine NK cell subsets from pregnant wildtype mice at gd 10.5 (virgin TGF-βRII<sup>Ncr1∆</sup> mice, n=2; virgin mice: C57BL/6, n=5; gd 10.5: C57BL/6 dams, n=8, implantation sites n=8). MFI, median fluorescent intensity. Gating strategy: Live, Single Cells; CD3<sup>-</sup> CD19<sup>-</sup> CD45.1<sup>-</sup> CD45.2<sup>+</sup> NK1.1<sup>+</sup> NKp46<sup>+</sup> cells.”

    1. eLife Assessment

      This manuscript provides important insights into how U2AF2-dependent intron retention regulates the localization and function of long noncoding RNAs, with evidence supported by multiple complementary approaches. The work is notable for linking intron retention to nuclear speckle localization and cellular phenotypes, including proliferation and migration, although the mechanistic basis remains incompletely resolved. Overall, the study presents a compelling dataset with clear biological implications but would benefit from additional analyses to strengthen mechanistic interpretation and generality.

    2. Reviewer #1 (Public review):

      Intron retention is observed in many long noncoding RNAs. The authors here used a powerful genome-wide screening strategy to identify proteins controlling intron retention in the long noncoding RNA PURPL. One of the top hits across multiple cell lines surprisingly, was U2AF2, which is well known to bind the polypyrimidine tract close to the 3' splice site to promote splicing. Nonetheless, U2AF2 is working in the opposite direction here. Convincing follow-up RT-PCR experiments confirmed that knocking down U2AF2 does indeed lead to reduced intron retention of PURPL. The authors then show that this intron retention event is functionally important for both the nuclear retention of PURPL as well as its ability to enhance cell proliferation.

      The authors then used transcriptome-wide analyses to look for additional intron retention events affected by U2AF2. Among the ~250 genes with decreased intron retention (more splicing) upon U2AF2 knockdown was MALAT1, a well-established long noncoding RNA that normally localizes to nuclear speckles. Depletion of U2AF2 or removal of the MALAT1 2nd intron resulted in reduced speckle localization and cell migration, revealing a critical and fascinating role for this intron retention event. Overall, the authors have used a set of complementary approaches to clearly demonstrate a very intriguing role for U2AF2 in controlling intron retention and functionality of a set of long noncoding RNAs.

      I feel the current work has revealed an important role of intron retention in controlling the localization and functionality of long noncoding RNAs, which is likely broad in scope and is likely regulated by cell state.

      One experimental suggestion: The authors show that expressing intron-2 containing PURPL in PURPL-depleted cells is sufficient to induce faster proliferation, but a valuable comparison would be identifying the phenotype expressing spliced PURPL transcript.

    3. Reviewer #2 (Public review):

      Summary:

      This study identified U2AF1/2 as a regulator of pre-mRNA splicing that either promotes or supresses the splicing of introns on different genes. The authors then focused on two genes PURPL and MALAT1 that U2AF1/2 can promote intron retention of specific introns, and characterized the biological implications of these introns regulated by U2AF1/2.

      Strengths:

      (1) The experiments in this manuscript are relatively rigorously designed and performed, often with validation checks such as verifying the knockout, verifying the treatment itself doesn't have an effect, etc.

      (2) The experiments provided comprehensive support for the claims that these specific introns are important for the stability or nuclear localization of the RNA, as well as that U2AF1/2 suppresses the splicing of these introns.

      (3) The writing of the manuscript is very clear and doesn't overstate the conclusions that can be drawn from the experiments.

      Weaknesses:

      I think one main weakness of this study is the lack of a deeper analysis of the mechanisms. Whether studying the mechanism is within the scope of this paper is probably debatable, but with the current experiment setup and data, I believe there are some analyses that can be relatively easily done to enhance the value or significance of this study. My detailed questions and suggestions are listed below:

      (1) Line 194-195 and Figure 2A: How many RBPs are included in "other RBPs" in line 194? Does "other RBPs" only include PTBP1, PRPF8 and SRSF1 in Figure 2A, or do they include all the ~100 RBPs with HepG2 eCLIP data available on ENCODE? If U2AF1/2 have the highest occupancy around the intron 2 region among the ~100 RBPs, it would be nice to visualize it.

      (2) Figure 2A and 2B: Why didn't U2AF2 show interaction with exon 2 and 3 in RNA-IP but showed enrichment over exon 2 and exon 3 regions in the eCLIP data?

      (3) Figure 3C - 3F: Maybe I misinterpreted the experiments, but to my understanding, these experiments showed that the exogenous PURPL with intron 2 promoted cell proliferation compared to when the exogenous PURPL wasn't induced, but didn't compare to the effect of the same amount of PURPL with intron 2 removed. Wouldn't it be clearer to compare the effects of exogenous PURPL with intron 2 and exogenous PURPL without intron 2 to pinpoint whether the effect is related to intron 2? Without an intron 2 specific experiment, these current experiments don't seem to provide much added value than "PURPL promotes cell proliferation".

      (4) It's not very clear what proportion of these introns are retained in the endogenous PURPL and MALAT1 in various tissues, cell types and conditions. I think it will be valuable to provide this background (either from previous research, public database or data from this study).

      (5) Since U2AF1/2 have a wide range of targets as demonstrated by Figure 4A, I think it would be valuable to have some experiments that directly disrupt the interaction between U2AF1/2 and PURPL and MALAT1 and test the effect on splicing outcomes, such as by mutating the sequence that U2AF1/2 bind to. The section on the weak py-tract of PURPL touched upon this topic but focused more on how the weak py-tract causes the intron 2 retention in the background rather than how U2AF1/2 binding and action were affected by sequence mutations. I think experiments on disrupting the direct binding between U2AF1/2 on targets can provide valuable mechanistic insights.

      (6) Across all the target genes of U2AF1/2, it might be feasible to do some systematic analysis to find what correlates with whether U2AF1/2 have a promoting or suppressing effect on intron splicing. For example, do genes with decreased IR after U2AF2 depletion systematically have a weak py-tract compared to genes with increased IR? This dataset can potentially provide many hypotheses for understanding the dual role of U2AF1/2.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript characterized the splicing regulation of two long non-coding RNAs relevant to cancer, starting with a focus on PURPL and ending with insights into MALAT1. A CRISPR screen for the regulators of PURPL intron retention revealed a role for the U2AF heterodimer in inducing this retention, with U2AF2 as the actual hit. This is surprising, because the canonical function of U2AF is to recognize the polypyrimidine tract (PPT) and 3' splice site junction to induce splicing at the site. The brief mechanistic characterization of this phenomenon showed that this intron retention accounts for the nuclear localization and instability of the PURPL transcript, and seems to confer the enhanced cell proliferation feature. U2AF2 also induces retention of two introns in MALAT1, and one of them is essential for its nuclear speckle localization and enhanced cell migration.

      Strengths:

      These findings about PURPL and MALAT1 are clear and interesting.

      Weaknesses:

      The results are not sufficiently connected to each other, because one regulation is nuclear-speckle dependent but not the other.

      Here are my specific comments:

      Major comments:

      The main issue is the lack of focus because of the distinct and incomplete analysis pertaining to the two long noncoding RNAs, PURPL and MALAT1. The paper starts with a very good genetic screen on the former, and immunofluorescence and functional analysis on the latter, with U2AF2 as the main link to induce intron retention. The first one does not show clear localization while the second docks to nuclear speckles, apparently because of the retained intron. Hence the two mechanisms are related yet distinct. Here are some suggestions to enhance the characterization and connection between the two cases:

      (1) As the MALAT1 intron 2 retention contributes to its speckle localization but not the retained PURPL intron, the retained introns or their 3' splice site sequences should be swapped to see if they determine the localization.

      (2) Figure 3, the rescue of the PURPL knockout by the intron-retained RNA to induce proliferation is a powerful experiment, that is lacking the rescue with the RNA without the intron as a control. This must be done and shown.

      (3) The weakness of the PPT of PURPL intron 2 appears as a clear feature of its retention dependent on U2AF2, which appears direct, as backed by CLIP data. It would be good to show direct binding by EMSA or equivalent techniques. Furthermore, the data is also consistent with other determinants. The exon and upstream intronic sequences, including the branch point, could also be involved, so mutations in these are also required.

      (4) In brief, what are the commonalities and differences between PURPL and MALAT1 with regard to their U2AF2-dependent intron retention?

    1. eLife Assessment

      This important study establishes the first vertebrate models of DeSanto-Shinawi Syndrome, revealing conserved craniofacial and social and behavioral phenotypes across mouse and zebrafish that mirror key clinical features. The convincing evidence is supported by behavioral, anatomical, and molecular analyses of Wac animal mutants. This study sets a baseline for future mechanistic studies and reports a platform to test approaches to reverse phenotypes.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      The authors generated mouse and zebrafish models for DeSanto-Shinawi Syndrome, caused by loss-of-function variants in the WAC gene. Using these vertebrate systems, they demonstrate conserved craniofacial and social-behavioral phenotypes that parallel human clinical features, along with deficits in GABAergic markers. They observe increased seizure susceptibility and male-biased brain volumetric changes in Wac mutant mice. Together, these findings begin to define the biological consequences of Wac haploinsufficiency and provide valuable resources for future mechanistic studies.

      Strengths:

      WAC is a high-confidence neurodevelopmental disorder gene and one of the genes identified by large-scale exome sequencing efforts, including the Satterstrom et al. (2020) autism spectrum disorder cohort. This study establishes the first vertebrate Wac models, addressing a major gap in the understanding of DeSanto-Shinawi Syndrome, and provides a framework for studying other syndromic forms of autism. The models generated will be impactful and useful to the community to study and understand DeSanto-Shinawi Syndrome.

      The cross-species analysis is important and well executed, and reveals both conserved and divergent phenotypes. The behavioral and anatomical assays are rigorously executed and well-controlled, and the inclusion of RNA-sequencing analyses adds valuable insights into the mechanisms underlying brain function in Wac mutants. Notably, the RNA-seq data reveal upregulation of several clustered protocadherins, genes central to neuronal identity and cell-cell interactions, which are known to be regulated by dynamic developmental regulation of chromatin architecture. This observation provides an intriguing hint that could link Wac function to higher-order chromatin organization and neuronal connectivity.

      Weaknesses:

      The evidence is solid, though the study remains incomplete in its mechanistic depth and molecular interpretation. The authors compellingly describe behavioral, anatomical, and transcriptomic phenotypes associated with WAC loss, yet do not explore how WAC mechanistically regulates chromatin or transcription. Given prior evidence that WAC interacts with the RNF20/40 ubiquitin ligase complex and promotes histone H2B ubiquitination and transcriptional elongation, the paper would benefit from a discussion of these functions as a potential link between Wac haploinsufficiency and the observed changes in neuronal gene expression. Similarly, the authors mention WAC's WW and coiled-coil domains but do not consider how these domains could mediate nuclear interactions or recruitment of transcriptional cofactors that shape gene regulation and chromatin organization in neurons.

      The transcriptomic analysis is rich but largely descriptive. Although the upregulation of clustered protocadherins is particularly intriguing, these findings are not validated or localized to specific neuronal populations. The study would be strengthened by independently validating the most significant RNA-seq changes, such as protocadherin gamma genes, using in situ hybridization methods to confirm the spatial and cellular specificity of expression changes.

    3. Reviewer #2 (Public review):

      The authors describe the first deep neurological characterization of WAC mutation in two vertebrate species (zebrafish and mouse). They examine these at various levels, guided by the work in humans that has associated a heterozygous WAC mutation with DeSantos Shinawi Syndrome (DESSH). Therefore, they investigate the animals for a variety of phenotypes, following a template for what is seen when characterizing a new mouse/fish model of a developmental disability gene. Investigations include analysis of skull and jaw for abnormalities(both species), MRI of brain structure(in mice), electrophysiology(mice), assessment of signaling pathways (by Western blot, in mice), cell counts (both, more in mice), transcriptomics (mice), and behavior (both).

      Generally, this describes an important first characterization of the consequences of the mutation. Most of the studies appear well-conducted and reasonably powered, thus solid or convincing.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors generated mouse and zebrafish models for DeSanto-Shinawi Syndrome, caused by loss-of-function variants in the WAC gene. Using these vertebrate systems, they demonstrate conserved craniofacial and social-behavioral phenotypes that parallel human clinical features, along with deficits in GABAergic markers. They observe increased seizure susceptibility and male-biased brain volumetric changes in Wac mutant mice. Together, these findings begin to define the biological consequences of Wac haploinsufficiency and provide valuable resources for future mechanistic studies.

      Strengths:

      WAC is a high-confidence neurodevelopmental disorder gene and one of the genes identified by large-scale exome sequencing efforts, including the Satterstrom et al. (2020) autism spectrum disorder cohort. This study establishes the first vertebrate Wac models, addressing a major gap in the understanding of DeSanto-Shinawi Syndrome, and provides a framework for studying other syndromic forms of autism. The models generated will be impactful and useful to the community to study and understand DeSanto-Shinawi Syndrome.

      The cross-species analysis is important and well executed, and reveals both conserved and divergent phenotypes. The behavioral and anatomical assays are rigorously executed and well-controlled, and the inclusion of RNA-sequencing analyses adds valuable insights into the mechanisms underlying brain function in Wac mutants. Notably, the RNA-seq data reveal upregulation of several clustered protocadherins, genes central to neuronal identity and cell-cell interactions, which are known to be regulated by dynamic developmental regulation of chromatin architecture. This observation provides an intriguing hint that could link Wac function to higher-order chromatin organization and neuronal connectivity.

      Weaknesses:

      The evidence is solid, but the study remains incomplete in its mechanistic depth and molecular interpretation. The authors compellingly describe behavioral, anatomical, and transcriptomic phenotypes associated with WAC loss, yet do not explore how WAC mechanistically regulates chromatin or transcription. Given prior evidence that WAC interacts with the RNF20/40 ubiquitin ligase complex and promotes histone H2B ubiquitination and transcriptional elongation, the paper would benefit from a discussion of these functions as a potential link between Wac haploinsufficiency and the observed changes in neuronal gene expression. Similarly, the authors mention WAC's WW and coiled-coil domains but do not consider how these domains could mediate nuclear interactions or recruitment of transcriptional cofactors that shape gene regulation and chromatin organization in neurons.

      We agree that many mechanisms underlying how both animal model phenotypes and human symptoms that are caused by the Wac gene still need to be worked out. Due to the need to generate a great deal of data to first describe these models in this manuscript this will be expanded upon later. In lieu of this, we plan to follow up with mechanistic papers later to fully address the gap that remains. We have now added a paragraph in the discussion to bring up these important points regarding the roles of Wac during transcription and how its protein domains might be involved in these processes.

      The transcriptomic analysis is rich but largely descriptive. Although the upregulation of clustered protocadherins is particularly intriguing, these findings are not validated or localized to specific neuronal populations. The study would be strengthened by independently validating the most significant RNA-seq changes, such as protocadherin gamma genes, using in situ hybridization methods to confirm the spatial and cellular specificity of expression changes.

      We have greatly expanded the analyses of the bulk RNA-seq data, including a more rigorous look into the differences in gene expression between sexes, which has additionally revealed males to be more impacted by Wac loss of function. We have also added new western blot data for pan protocadherin alpha, which is now validated to be upregulated in the cortex (new Figure 7I and 7J). We are holding back any additional data from this report as we have single nucleus RNA-seq data that will be reported on in follow-up papers with targeted conditional deletion models.

      Finally, while the behavioral and MRI results add valuable breadth, their interpretation would be improved by clearer reporting of sample sizes, statistical corrections, and effect sizes to support claims of sex-specific and regional brain volume differences.

      Some additional details have been added to the methods section. In addition, we have now provided sample sizes assessed in each figure legend.

      Reviewer #2 (Public review):

      The authors describe the first deep neurological characterization of WAC mutation in two vertebrate species (zebrafish and mouse). They examine these at various levels, guided by the work in humans that has associated a heterozygous WAC mutation with DeSantos Shinawi Syndrome (DESSH). Therefore, they investigate the animals for a variety of phenotypes, following a template for what is seen when characterizing a new mouse/fish model of a developmental disability gene. Investigations include analysis of skull and jaw for abnormalities(both species), MRI of brain structure(in mice), electrophysiology(mice), assessment of signaling pathways (by Western blot, in mice), cell counts (both, more in mice), transcriptomics (mice), and behavior (both).

      Generally, this describes an important first characterization of the consequences of the mutation. Most of the studies appear well-conducted and reasonably powered, thus solid or convincing. However, there are a few places where the data presentation could be improved for clarity, and a few concerns about some choices in analytical approach for a couple of the experiments, where improved statistical approaches could improve their sensitivity and/or better rule out false positives, and thus the support of some of these claims is currently incomplete. There is also some lack of clarity about the rationale for some decisions regarding the fish genetics. Nonetheless, this is an important and useful first characterization of many phenotypes of these lines. Such experiments form a baseline for future mechanistic studies in the same lines and a platform to test approaches to reverse phenotypes.

      Individual claims and their strength & weaknesses:

      (1) The authors developed mouse and zebrafish models of WAC deletion

      They used the existing KOMP floxed WAC line to generate a null allele. For the mouse, there is a Western showing that it is indeed null for the protein. The fish data is less robustly validated - they don't confirm the allele in null at the protein or RNA level, and fish have two paralogs (waca and wacb), and this paper only characterizes one of these. So this evidence is less clear. The evaluated mice are heterozygous (Het), similar to patients, while the fish appear to be evaluated as homozygous mutants.

      We agree with the reviewer’s comments on zebrafish genetics. Since antibodies against zebrafish Wac proteins are not available, we could not examine protein levels in zebrafish. We predicted frameshift mutations due to DNA analyses in waca and wacb KO zebrafish. We made waca KO, wacb KO, and waca/wacb double KO zebrafish. waca/wacb double KO zebrafish showed a lethal phenotype, similar to homozygous mice mutants. Since wacb KO zebrafish did not show any detectable phenotype we do not report those here. However, we now show examples of the wacb and dKO zebrafish in Figure S1. Since waca KO zebrafish showed craniofacial and behavioral phenotypes that are comparable to mice Het and human patients, they are focused on in this report.

      (2) The authors show that both species show altered craniofacial features

      These data appear well powered, and the findings are robust.

      We appreciate this confirmation.

      (3) Each model altered GABAergic neurons

      In mice, the authors stained with PV antibodies and saw a decrease in cells positive for this staining. A second marker, Lhx6, does not show a difference, suggesting this might be a change in PV expression rather than cell number. They could maybe look into the literature to see if this loss of just the protein also occurs in other models. Overall, the sample size here is a bit smaller than other parts of the paper (n=3), and the methods on the cell counts were less clear, so it is not as clear that this finding is as robust. The authors counted several other broad classes of cells, and those appear normal. Interestingly, there might also be some TBR1 mislocalization in layer 6 that might be significant with added power.

      Thank you for these suggestions. Yes, other models also show this lack of PV expression even when MGE-lineage interneurons are present at normal levels. We mention in the discussion a previous study on the ASD gene CTNNAP2 that showed this. We also agree that there is a trend going on in the Tbr1 population. We assessed another WT and Het pair for Tbr1 laminar distribution and were able to determine that these changes held up and are now significantly different; the person counting these numbers was blind to the genotypes. Finally, we added more details to the methods to describe how the counting was performed.

      The fish data is based on an in situ hybridization for GAD. The measure shown is the width of the positive area in the forebrain. This measure is not one I have seen much before, and has potential to be driven by something unrelated to GABA (e.g., if the whole forebrain were simply a bit smaller). So this analysis could use a couple of other approaches (density of signal?) and/or a control probe for some other brain gene showing the measure is normal, and thus it is not just a size issue.

      To compare altered GABAergic neurons in mice and zebrafish, we tried to isolate zebrafish PV genes and examined their expression by whole-mount in situ hybridization, now included Figure S3 but found no differences. However, we could not find any zebrafish PV gene useful for GABAergic neurons. We chose to examine gad1b expression in the positive area of the forebrain in WT and waca KO zebrafish and then found differences in the brain area with gad1b expression. Since WT and waca KO brain sizes are generally the same we believe this measurement is reasonable to make this conclusion and have added text to the results section to justify.

      (4) Mice were more susceptible to the seizure-inducing agent PTZ

      These data appear well powered, and the findings are robust. The authors also did a fair amount of useful electrophysiology that was all normal, but appeared to be well executed.

      Thank you, we appreciate this confirmation.

      (5) Mice had changes in brain volume that interact with sex

      The authors conducted an MRI on a good number of mice and reported a slight increase in global volume just in males. Sample size is fair, but the statistical approach here may be better if it puts males and females in the same model (to boost power and explicitly test for sex by genotype interaction that they report), and there is some chance that the brain region level differences that they report could include some false positives. They tested many regions, and it is not clear whether or not they corrected for the number of tests. Often, an FDR correction would be used in such imaging studies. It may be that only the most robust regional findings will survive those corrections. It is interesting data either way, but the analysis could be improved.

      Given the 80 regions (bilaterally) that we used and the number of mice, i.e. 6-7, we are underpowered to robustly undertake FDR types of corrections. In the data presented we used t-tests between sex and regions to illuminate putative regional changes. However, we did revisit our MRI data and found three data sets where the results were not normally distributed. We thus changed our statistical test to Mann Whitney for male retrosplenial cortex, male parietal cortex and female corpus callosum, which are now reflected in the figures and differential statistics noted in figure legends.

      (6) Several behaviors are altered in the mice as well

      These studies were fairly well-powered (n=15,16), and they found several positive and negative results, including alterations in memory and sociability in both species. There is a minor statistical flaw in the three-chamber analysis (they don't actually compare the Hets directly to the wildtypes in their statistical testing - a common mistake in neuroscience that should be addressed. But the data look like they will probably still be significant when correctly analyzed. In the supplement, the authors could do a bit more with the data they have to look at hyperactivity (i.e., show total motion in open field, not just time in center vs. periphery), and adding sex to their model might improve sensitivity for genotype effects.

      Thank you for these suggestions. We have done several things to address this behavioral paradigm. First, we added more n’s and also switched from comparing the mouse vs. object to just comparing genotypes as a variable. In addition, we switched to quantifying a discrimination index, described in Phiilips et al., 2019 PMID: 31112129 for our measurement. These new data are shown in Figure 3A. Open field total distance traveled has now been added to Figure S2A. For all other measurements, we did first assess for sex differences but found none and thus compiled both sexes for the graphs.

      (7) Some biochemical signaling pathways are altered in the brain

      These are n=4 immunoblots, and show altered phospho ERK, but no changes in other signaling events predicted from prior WAC literature like H2B ubiquitination. They appear well done, and the authors share the full blots in the supplement.

      Thank you, we appreciate this confirmation. Since Wac is an adaptor protein we needed to test these reported molecular changes in neurons that were previously only reported in cell lines and drosophila. We were not surprised that some of these previously reported changes would not be the same in brain cells. However, it is possible that these changes might arise in more discrete brain regions or at different times during development, which will be tested in our future conditional knockout models.

      (8) WAC deletion also alters gene expression in the brain

      These studies were well-powered for RNAseq, with 10 and 14 samples, using neonates (P2), just the forebrain. The sequencing quality metrics all looked good, and the approach to analysis was okay. It would be stronger to again include sex in the model, rather than separate by sex. There were some typos in this part of the paper that made part of the conclusions unclear, but the RNAseq nicely confirmed the mutation of the mice, and discovered many differentially expressed genes, consistent with the role of this gene as a regulator of transcription. The presentation could be expanded to make more use of the data. Overall, though, this is a useful first characterization of the transcriptome in the line.

      Thank you for the suggestions. We have greatly expanded our assessments of the RNA-seq data. Upon analyzation of the data we found many differences between males and females and now show combined and sex-separated data. Our new data isolate several more extreme and some unique changes in males that are better shown as stand alone figure panels. In addition to these edits, we have also reworked all the text in this section of the results for better reading.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The cause and timing of lethality in the homozygous Wac knockout should be reported or discussed. Investigating Wac homozygous knockout embryos, if viable at early stages, could provide valuable insight into the developmental origins of the neuroanatomical and behavioral phenotypes described in the heterozygous animals. Even a brief histological or transcriptomic characterization of embryonic brains would strengthen the mechanistic understanding of Wac function during neurodevelopment.

      We agree and have collected embryos as early as embryonic day 12.5 from multiple litters but never detected a knockout. We have added this text to the animal methods sections to let readers understand effort had been done to determine when death occurs. While we don’t currently explore this further in mice we now include zebrafish waca; wacb double knockouts. Notably, while we were able to generate a few of these mutants, most died. However, some zebrafish were aged long enough to observe lethal deficits in heart formation and swim bladder development, suggesting that early loss of Wac could impact these critical organs that leads to death.

      (2) A better description of the data reported in Supplementary Tables 3 through 5 is needed. Supplementary Table 3 does not report any statistically significantly differentially expressed genes in the FDR column, and Supplementary Table 5 reports only two, and the reader should understand what the columns are indicating.

      We have now added figure legend text to the supplementary file to explain each Table mentioned here.

      Reviewer #2 (Recommendations for the authors):

      (1) Page 3, last paragraph. The description of wacb is confusing. I recommend that the authors provide the unshown data they mention and also further explanation of the breeding scheme and result. Indeed, if wacb is homozygous lethal, does that make it more like the mouse WAC gene, and thus potentially the more relevant paralogue to study? Are both waca and wacb expressed in the same tissues? How does that compare to mouse and human WAC expression? Such figures about gene expression (even when adapted with permission from public resources like Allen brain atlas or GTEX) are common in this sort of paper, as they can be helpful to understand when and where the gene is thought to act. For waca vs. wacb, they may help determine which gene is more relevant to the brain (for example, if only one is expressed in the brain).

      First, this is a great question and we have now added whole mount in situ for the waca and wacb genes as Figure S1. These data show low to no wacb expression in brain regions while waca is highly expressed there. Since the waca mutants showed phenotypes relevant to DESSH but wacb mutants did not, this correlates with observed expression patterns without fully excluding wacb from any role. Thus, we also made waca/wacb double KO zebrafish that showed a lethal phenotype, similar to homozygous mice mutants. Only a few waca; wacb double knockouts survived a little through development and are now shown in Figure S1. Since wacb KO zebrafish did not show any detectable phenotype on their own, we did not include the data since there are already several figures/tables in this manuscript. However, the waca KO zebrafish did show phenotypes similar to humans with DESSH and are the ones we focused on.

      (2) Why did the authors cross the mice into the outbred CD1 background? Usually, most labs keep the lines on an inbred background. Was there a particular rationale here? I am not saying that they could not outcross them. It is just a bit puzzling why. Perhaps a sentence of explanation in the methods section would be warranted.

      This is a great question and we have now added text to the animal methods section. Many labs that study development, especially on genes critical for survival/life like the Wac gene, use a more robust strain like CD-1. By doing this, we have a better chance of evaluating mutants at more mature ages and getting enough progeny to do more reproducible studies.

      (3) A typical first experiment in a new knockout (fish or mouse) is to establish that the deletion does indeed result in a loss of RNA and protein. In the absence of this, the rest of the paper cannot be as confidently interpreted.

      We did this for the mouse model and found reduced protein expression in the constitutive Het, however this datum is part of the western blots in figure 5. We now mention this in the early results section that protein levels were reduced in the Hets but maintain that the presentation of the western blot is better suited in Fig. 5 to compare to the other western blots. For zebrafish this was attempted but was more difficult. Available antibodies don’t work in zebrafish. RNA expression was attempted in both models and due to Wac being a critical gene for life, there are checks in place to upregulate faulty and normal RNA in the waca model. We screened for frameshift mutations in multiple KO lines and confirmed it by genomic DNA sequencing. In making many KOs and large-scale mutagenesis in zebrafish, we usually depend on phenotype-genotype segregation in Mendelian inheritance for many generations.

      (4) Are these new lines indeed knockouts? I did find a WAC western as part of a later figure for the mouse. The authors may want to mention that earlier, or present at least that data right away. What about in the fish? Is there a way to confirm at the RNA or protein level that it is indeed a null allele?

      Yes, as mentioned in the above response we have now mentioned our Wac western blot results early when introducing the mouse mutants and the issues with doing this in fish are presented above as well.

      (5) Why are fish used that are KO while mice are Hets? Are WAC homozygous mice not viable? This should be mentioned. Regardless, the rationale for examining heterozygous mice and homozygous mutant fish should be provided. Each kind of experiment is useful, but they are interpreted in different ways. Hets will genocopy the patients, who are generally hets, while KOs are often useful for a study of the essential roles of the genes, even if they are not really modeling the patient gene dose.

      Wac homozygous mice in our hands are embryonic lethal, now mentioned in the animal methods section, but we found early on that the Hets mimic several human DESSH patients. In zebrafish it is more complicated. We analyzed waca and wacb hets in zebrafish but found no phenotypes. This could be in part due to some complementation between the waca and wacb genes. It is also possible that a full waca KO could resemble a human DESSH individual since wacb may complement somewhat, even though deleting wacb entirely does not have a measurable phenotype. We have added more text to the discussion to explore these complexities. We also made waca/wacb double KO (dKO) zebrafish but they showed lethal phenotype, similar to homozygous mice mutants and suggesting some complementation by the wacb gene even though alone it did not exhibit phenotypes.

      (6) Figure 3A: It does not appear that the authors are directly statistically comparing the two groups (genotypes) that they are drawing conclusions about. This is an unfortunately common mistake in the neuroscience literature across papers. There is a nice older review about it here. https://pubmed.ncbi.nlm.nih.gov/21878926/. To draw conclusions about the differences between the mouse genotypes, they need to compare the two genotypes directly with a statistical test. See Nygard et al for a recommended approach, like comparing social preference indexes

      (https://onlinelibrary.wiley.com/doi/abs/10.1002/aur.2154).

      Thank you for this information. Previous reviewers at a different journal asked for this particular evaluation. We have now made changes to address the assessment, and graphs now reflect comparisons of genotypes instead of a single genotype between time with a mouse or object. We have also moved to using a social discrimination index to compare the genotypes, similar to the study mentioned.

      (7) MRI - it is a bit weird to separate the male and female brains just for the MRI. Was there a premise from human data to do so? If not, the authors should probably pool them. If they are concerned there are sex effects (or, more likely, a sex by genotype interaction) I recommend that they use a two-factor ANOVA and simply put both sex and genotype into the model. This will also have the advantage of increasing their statistical power for genotype effects a bit. If their current results are robust, they will still show up as a significant sex x genotype interaction.

      All data in the manuscript initially compared the sexes to each other. We have now added this text to the animal section of the methods: For MRI, some zebrafish behaviors and now the RNA-seq data, sex was a difference and due to this observation, sex was (or now is) presented independently for these measurements. We now state that if no sex differences were observed the data were pooled.

      (8) Also, did the authors correct for multiple testing in the MRI analysis? Since they are testing many regions, there is a risk of false positives if they do not. This could be confounded further by their splitting the data by sex, thus doubling the number of tests.

      As noted above we did not do multiple corrections given the large number of regions and low number of replicates.

      (9) How many images per animal were analyzed for the cell counts? This detail is absent from the methods and would help with evaluating the robustness of these findings. What other approaches were used to make sure the counting was unbiased?

      We analyzed 3-4 images per animal for counts and counted hundreds of cells per image. In addition, the person counting was blinded to avoid any bias. These details have now been updated in the methods.

      (10) As with the MRI, for the DEG analysis, I recommend the authors simply put sex and genotype into the same model as two factors (with an interaction), to increase their sensitivity to genotype effects, as well as be able to report on robust genotype x sex differences, if there are any. They may also consider testing the model with and without excluding the three outlier animals on their PCA. It may be that the noise of those outliers is detracting from their sensitivity for DEGs somewhat.

      We greatly expanded our analyses and found more robust and unique changes in males that are now added to Figure 7 and supplemental files. After considering the data, decided to highlight the sex differences separately.

      (11) A few more relatively simple things could readily be done with the RNAseq data to add some depth and interpretation. For example, do the hits here overlap other published IDD/autism DEG lists from mouse knockouts studies of genes like FoxP2, Chd8, Dnmt3a, Myt1l, Tcf4, etc? Do autism genes show up in the lists of hits here? And if so, more than expected by chance? Can they provide some visualization of their GO results in the main figure?

      When we looked into the sex differences more we found that only the males showed significant upregulation of other autism risk genes increase that was previously unappreciated when the sexes were assessed together. Yes, several autism genes do show up but is heavily biased to males. Our main Figure 7 and new supplemental files show new GO term analyses and provide additional data looking not only autism but other factors.

      (12) It appears the IMPC has phenotyped this mouse somewhat, including craniofacial abnormalities. They also report on some blood cell differences. Anyway, if no one has written about that data yet (as it was generated in the context of a big consortium effort), their guidelines may allow you to include some of their data as Supplementary Figures here with proper attribution. It might help to at least summarize useful findings from there in your discussion.

      Due to the large number of figures/tables already in this report we don’t think this will be helpful. However, we do refer readers to the consortium in the animal methods section so they can explore data already generated by the IMPC.

      (13) Minor/Typos:

      (a) Figure 2K: I am confused by the description of three genotypes in the legend, but only two in the panel?

      Corrected.

      (b) I found it a little distracting that some results figures were embedded in the introduction.

      We have moved the figures further in the manuscript to start in the results section.

      (c) I don't understand this sentence: "Due to reduced sample size, sex-stratified DE was performed without model corrections at FDR < 0.1, 7 and found genes significantly upregulated and downregulated, respectively;" The sample size here seemed robust, so I am not sure what they were referring to? Are there missing numbers form this sentence? What is the 7? I think there are enough typos here that I am not sure how to evaluate this claim. Thus, the writing and clarity of this part could be improved.

      This section had several typos that have now been corrected.

      (d) "Marwan Shinawi, (unpublished results)" is a bit atypical of a citation. Are these results being reported with his permission? If so, then it should say 'personal communication' (if the journal permits this - some do not). If not, they should not report someone else's unpublished results without their explicit permission. It might upset some people to have their results presented this way.

      We have changed unpublished results to personal communication. Marwin Shinawi is an author on this manuscript and has approved of everything we have reported.

      (e) In all figures, consider shape or color coding for sex, even when pooling the data (e.g, the data points in the behavior figures).

      This is a good idea but since we found no difference when analyzing the data we don’t see how this extra work will make a difference. Since we now mention that sex differences were only presented as separate graphs when observed in the methods we think this should be acceptable.

    1. eLife Assessment

      This important study reports that an oncogenic population in an epithelium can either be repressed or spread, depending on the tissues. This work provides convincing evidence, supported by pharmacological perturbations and numerical simulations using the vertex model, that the principle of "high heterotypic interfacial tension" that appears to drive cell sorting and tissue segregation in embryonic models similarly applies to cancer cell behaviour.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      The behaviour of cells expressing constitutively active HRas is examined in mosaic monolayers, both in MCF10a breast epithelial and Beas2b bronchial epithelial cell lines, mimicking the potential initial phase of development of carcinoma. Single HRas-positive cells are excluded from MCF10a but not Beas2b monolayers. Most interestingly, however, when in groups, these cells are not excluded, but rather sharply segregated within a MCF10a monolayer. In contrast, they freely mix with wt Beas2b cells. Biophysical analysis identifies high tension at heterotypic interfaces between HRas and wild-type cells as the likely reason for segregation of MCF10a cells. The hypothesis is supported experimentally, as myosin inhibition abolishes segregation. The probable reason for lack of segregation in the bronchial epithelium is to be found in the different intrinsic properties of these cells, which form a looser tissue with lower basal actomyosin activity. The behaviour of single cells and groups is recapitulated in a vortex model based on the principle of differential interfacial tension, under the condition of high heterotypic interfacial tension.

      Strengths:

      Despite being long recognized as a crucial event during cancer development, segregation of oncogenic cells has been a largely understudied question. This nice work addresses the mechanics of this phenomenon through a straightforward experimental design, applying the biophysical analytical approaches established in the field of morphogenesis. Comparison between two cell types provides some preliminary clues on the diversity of effects in various cancers.

    3. Reviewer #2 (Public review):

      Summary:

      The authors investigate the behavior of oncogenic cells in mammary and bronchial epithelia. They observe that individual oncogenic cells are preferentially excluded from the mammary epithelium, but they remain integrated in the bronchial epithelium. They also observe that clusters of oncogenic cells form a compact cluster in mammary epithelium, but they disperse in the bronchial epithelium. The authors demonstrate experimentally and in the vertex model simulations that the difference in observed behavior is due to the differential tension between the mutant and wild-type cells due to a differential expression of actin and myosin.

      Strengths:

      (1) Very detailed analysis of experiments to systematically characterize and quantify differences between mammary and bronchial epithelia.

      (2) Detailed comparison between the experiments and vertex model simulations to identify the differential cell line tension between the oncogenic and wild-type cells as one of the key parameters that are responsible for the different behavior of oncogenic cells in mammary and bronchial epithelia.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The behaviour of cells expressing constitutively active HRas is examined in mosaic monolayers, both in MCF10a breast epithelial and Beas2b bronchial epithelial cell lines, mimicking the potential initial phase of development of carcinoma. Single HRas-positive cells are excluded from MCF10a but not Beas2b monolayers. Most interestingly, however, when in groups, these cells are not excluded, but rather sharply segregated within a MCF10a monolayer. In contrast, they freely mix with wt Beas2b cells. Biophysical analysis identifies high tension at heterotypic interfaces between HRas and wild-type cells as the likely reason for segregation of MCF10a cells. The hypothesis is supported experimentally, as myosin inhibition abolishes segregation. The probable reason for lack of segregation in the bronchial epithelium is to be found in the different intrinsic properties of these cells, which form a looser tissue with lower basal actomyosin activity. The behaviour of single cells and groups is recapitulated in a vortex model based on the principle of differential interfacial tension, under the condition of high heterotypic interfacial tension.

      Strengths:

      Despite being long recognized as a crucial event during cancer development, segregation of oncogenic cells has been a largely understudied question. This nice work addresses the mechanics of this phenomenon through a straightforward experimental design, applying the biophysical analytical approaches established in the field of morphogenesis. Comparison between two cell types provides some preliminary clues on the diversity of effects in various cancers.

      Weaknesses:

      Although not calling into question the main message of this study, there are a few issues that one may want to address:

      (1) One may be careful in interpreting the comparison between MCF10a and Beas2b cells as used in this study. The conditions may not necessarily be representative of the actual properties of breast and bronchial epithelia. How much of the epithelial organization is reconstituted under these experimental conditions remains to be established. This is particularly obvious for bronchial cells, which would need quite specific culture conditions to build a proper bronchial layer. In this study, they seemed to be on the verge of a mesenchymal phenotype (large gaps, huge protrusions, cells growing on top of each other, as mentioned in the manuscript).

      As an alternative to Beas2b, comparison of MCF10a with another cell line capable of more robust in vitro epithelial organization, but ideally with different adhesive and/or tensile properties, would be highly interesting, as it may narrow down the parameters involved in segregation of oncogenic cells.

      (2) While the seminal description of tissue properties based on interfacial tensions (Brodland 2002) is clearly key to interpreting these data, the actual "Differential Interfacial Tension Hypothesis" poses that segregation results from global differences, i.e., juxtaposition of two tissues displaying different intrinsic tensions. On the contrary, the results of the present work support a different scenario, where what counts is the actual difference in tension ALONG the tissue boundary, in other words, that segregation is driven by high HETEROTYPIC interfacial tension. This is an important distinction that should be clarified.

      (3) Related: The fact that actomyosin accumulates at the heterotypic interface is key here. It would be quite informative to better document the pattern of this accumulation, which is not clear enough from the images of the current manuscript: Are we talking about the actual interface between mutant and wt cells (membrane/cortex of heterotypic contacts)? Or is it more globally overactivated in the whole cell layer along the border? Some better images and some quantification would help.

      (4) In the case of Beas2b cells, mutant cells show higher actin than wt cells, while actin is, on the contrary, lower in mutant MCF10a cells (Figure 2b). Has this been taken into account in the model? It may be in line with the idea that HRas may have a different action on the two cell types, a possibility that would certainly be worth considering and discussing.

      Comments on revisions:

      There is still one last point that should be made even clearer:

      The system is being modelled based on the principle of INTERFACIAL TENSION, a description pioneered by the works of Steinberg and of Harris, and nicely conceptualized by Brodland (2002). Now the observed behaviour is a perfect case of sorting based on higher interfacial tension AT the boundary between cell types (with nice additional documentation of local actin and myosin enrichment in the revised manuscript). What needs to be made crystal clear it that this is NOT equivalent to the model of DITH ("DIFFERENTIAL INTERFACIAL TENSION HYPOTHESIS)" (Brodland 2002, Krieg et al 2008). It is important to stop using DITH in this context, as it leads to confusion and misinterpretations. Indeed, DITH predicts cell/tissue sorting based on differences in interfacial tension WITHIN the two cell types. While DITH accounts for relative POSITIONING (one tissue engulfing the other), it is now established that this is not the motor for cell sorting and tissue segregation, the key parameter is being heterotypic tension at the heterotypic interface. I thus invite the authors to avoid the terms "differential"/DITH, and rather use either "interfacial tension", or specifically to "HIGH HETEROTYPIC INTERFACIAL TENSION".

      Related: the authors correctly cite Canty et al NatComm2017 when discussing this phenomenon. I suggest to add an additional key supporting reference "D.M. Sussman, J.M. Schwarz, M.C. Marchetti, M.L. Manning, Soft yet sharp interfaces in a vertex model of confluent tissue, Phys. Rev. Letters 120 (2018) 058001". One may also include another pioneer work in Drosophila is "M. Aliee, J.C. Roper, K.P. Landsberg, C. Pentzold, T.J. Widmann, F. Julicher, C. Dahmann, Physical mechanisms shaping the Drosophila dorsoventral compartment boundary, Curr. Biol. 22 (2012) 967-976."

      We thank the reviewer for this important clarification. We fully agree that the mechanism underlying the observed segregation in our system is best described in terms of elevated heterotypic interfacial tension, rather than the classical Differential Interfacial Tension Hypothesis (DITH). As the reviewer correctly points out, DITH in its original formulation refers to differences in intrinsic interfacial tensions within each cell population, which primarily governs relative positioning (e.g., tissue engulfment), rather than the local sorting dynamics we observe here.

      In contrast, our experimental and modeling results support a scenario in which segregation is driven by increased tension specifically at heterotypic interfaces between HRasV12 and wild-type cells. We agree that continued use of the term “Differential interfacial tension” in this context may lead to conceptual ambiguity.

      Accordingly, we have revised the manuscript throughout to replace references to “differential interfacial tension” with more precise terminology, namely “interfacial tension” or “heterotypic interfacial tension”, wherever appropriate. We have also updated the Discussion to explicitly clarify this distinction and its implications for interpreting our results.

      We thank the reviewer for suggesting additional relevant literature which have now included.

      Reviewer #2 (Public review):

      Summary:

      The authors investigate the behavior of oncogenic cells in mammary and bronchial epithelia. They observe that individual oncogenic cells are preferentially excluded from the mammary epithelium, but they remain integrated in the bronchial epithelium. They also observe that clusters of oncogenic cells form a compact cluster in mammary epithelium, but they disperse in the bronchial epithelium. The authors demonstrate experimentally and in the vertex model simulations that the difference in observed behavior is due to the differential tension between the mutant and wild-type cells due to a differential expression of actin and myosin.

      Strengths:

      Very detailed analysis of experiments to systematically characterize and quantify differences between mammary and bronchial epithelia

      Detailed comparison between the experiments and vertex model simulations to identify the differential cell line tension between the oncogenic and wild-type cells as one of the key parameters that are responsible for the different behavior of oncogenic cells in mammary and bronchial epithelia

      Weaknesses:

      It is unclear what is the mechanistic origin of the shape-tension coupling, which is used in the vertex model, and how important that coupling is for the presented results. Authors claim that the shape-tension coupling is due to the anisotropic distribution of stress fibers when cells are under external stress. It is unclear why the stress fibers should affect an effective line tension on the cell boundaries and why the stress fibers should be sensitive to the magnitude of the internal isotropic cell pressure. In experiments, it makes sense that stress fibers form when cells are stretched. Similar stress fibers form when cytoskeleton or polymer networks are stretched. It is unclear why the stress fibers should be sensitive to the magnitude of internal isotropic cell pressure. If all the surrounding cells have the same internal pressure, then the cell would not be significantly deformed due to that pressure and stress fibers would not form. Authors should better justify the use of the shape-tension coupling in the model, since most of the observed behavior is already captured by the differential tension even if there is no shape-tension coupling.

      We thank the reviewer for this comment. We agree that we did not provide a mechanistic origin for the shape-tension coupling. In our model, stress fiber formation, along with actin ring formation, indicated that cells at the interface were elongated. Hence, we hypothesised that an interfacial force could induce nematic alignment at the interface. However, such an activity would only be feasible if the interface interaction were sufficiently high. Thus, the isotropic pressure at the heterotypic interface served as a proxy for cell-cell interactions in our model. However, inspired by recent work [1], we have tested whether activation of cells at the interface by shear stress would produce similar results. Exploring this aspect will require additional simulations.

      (1) Pérez-Verdugo, F., Maniou, E., Galea, G. L., & Banerjee, S. (2026). Mechanosensitive feedback organizes cell shape and motion during hindbrain neuropore morphogenesis. Current Biology.

      The observed difference of shape indices between the interfacial and bulk cells in simulations in the absence of differential line tension is concerning. This suggests that either there are not enough statistics from the simulations or that something is wrong with the simulations. For all presented simulation results, the authors should repeat multiple simulations and then present both averages and standard deviations. This way it would be easier to determine whether the observed differences in simulations are statistically significant.

      The observed differences in shape indices between interfacial and bulk cells in simulations in the zero-line-tension case (Lambda=0) remain non-zero at the zero-stress threshold because the interface cells are still subject to the shape-dependent contribution gamma_ij, since the current model treats gamma_ij as independent of Lambda. We are exploring the possible relationship between Lambda and gamma_ij, and we will update this in the next version of the manuscript.

      Recommendations for the authors:

      The editor recommends considering the new comment made by reviewer #1 in his/her report:

      "There is still one last point that should be made even more clear:

      The system is being modelled based on the principle of INTERFACIAL TENSION, a description pioneered by the works of Steinberg and of Harris, and nicely conceptualized by Brodland (2002). Now the observed behaviour is a perfect case of sorting based on higher interfacial tension AT the boundary between cell types (with nice additional documentation of local actin and myosin enrichment in the revised manuscript). What needs to be made crystal clear it that this is NOT equivalent to the model of DITH ("DIFFERENTIAL INTERFACIAL TENSION HYPOTHESIS)" (Brodland 2002, Krieg et al 2008). It is important to stop using DITH in this context, as it leads to confusion and misinterpretations. Indeed, DITH predicts cell/tissue sorting based on differences in interfacial tension WITHIN the two cell types. While DITH accounts for relative POSITIONING (one tissue engulfing the other), it is now established that this is not the motor for cell sorting and tissue segregation, the key parameter is being heterotypic tension at the heterotypic interface. I thus invite the authors to avoid the terms "differential"/DITH, and rather use either "interfacial tension", or specifically to "HIGH HETEROTYPIC INTERFACIAL TENSION".

      Related: the authors correctly cite Canty et al NatComm2017 when discussing this phenomenon. I suggest to add an additional key supporting reference "D.M. Sussman, J.M. Schwarz, M.C. Marchetti, M.L. Manning, Soft yet sharp interfaces in a vertex model of confluent tissue, Phys. Rev. Letters 120 (2018) 058001". One may also include another pioneer work in Drosophila is "M. Aliee, J.C. Roper, K.P. Landsberg, C. Pentzold, T.J. Widmann, F. Julicher, C. Dahmann, Physical mechanisms shaping the Drosophila dorsoventral compartment boundary, Curr. Biol. 22 (2012) 967-976."

      Please see response to Reviewer 1

      Reviewer #2 (Recommendations for the authors):

      The authors have improved the manuscript and addressed some of my concerns. However, some of the questions were not adequately addressed.

      (1) I appreciate additional justification regarding the need for the shape-tension coupling in the vertex model. However, the authors have not answered my question regarding why the shape-tension coupling model should be sensitive to the magnitude of the internal isotropic cell pressure. In experiments, it makes sense that stress fibers form when cells are stretched, but it is unclear why the stress fibers should be sensitive to the magnitude of internal isotropic cell pressure. If all the surrounding cells have the same internal pressure, then the cell would not be significantly deformed due to that pressure, and stress fibers would not form.

      We thank the reviewer for pointing this out. We agree that we did not provide a mechanistic origin for the shape-tension coupling. In our model, stress fiber formation, along with actin ring formation, indicated that cells at the interface were elongated. Hence, we hypothesized that an interfacial force could induce nematic alignment at the interface. However, such an activity would only be feasible if the interface interaction were sufficiently high. Thus, the isotropic pressure at the heterotypic interface served as a proxy for cell-cell interactions in our model.

      However, inspired by recent work [1], we have tested whether activation of cells at the interface by shear stress would produce similar results. Exploring this aspect will require additional simulations.

      (1) Pérez-Verdugo, F., Maniou, E., Galea, G. L., & Banerjee, S. (2026). Mechanosensitive feedback organizes cell shape and motion during hindbrain neuropore morphogenesis. Current Biology.

      (2) I appreciate that the authors provided additional statistics related to simulations. I am still very concerned about the observed difference in the shape indices between the cells at the interface and the bulk, when the interfacial line tension is exactly zero (Lambda=0). In that case, the cells at the interface and at the boundary are identical, and there should be no difference in the shape indices. Are cells at the interface for the zero-line tension case (Lambda=0) still subject to the shape dependent contribution gamma_ij? If that contribution is still included for the cells at the interface, then this could explain why cells at the interface are still different from cells in the bulk even when Lambda=0.

      The observed differences in shape indices between interfacial and bulk cells in simulations in the zero-line-tension case (Lambda=0) remain non-zero at the zero-stress threshold because the interface cells are still subject to the shape-dependent contribution gamma_ij, since the current model treats gamma_ij as independent of Lambda. We are exploring the possible relationship between Lambda and gamma_ij, and we will update this in the next version of the manuscript.

      (3) Authors included several additional supplemental figures (Figs. S4, S5, S6, S7) , but they are not discussed in the manuscript text. These new supplemental figures were only discussed in the rebuttal letter. These figures should also be discussed in the manuscript text.

      We have cited the new supplementary figures in the main text.

      (4) Authors have answered in the rebuttal letter what experimental data was used in Fig. 4c. This information also needs to be provided in the manuscript text.

      We have added this information in the caption of Figure 4

      (5) Supplementary Figure 3 is missing. That figure got moved to the appendix.

      This has been rectified in the Supplementary file and the citations have been updated accordingly in the main text.

      (6) At the end of section 4 in the main text, the authors introduced a new sentence regarding simulations of the vertex model with interfacial tension and mechanochemical feedback. The details of that model are described in the appendix, but it would be helpful to add a sentence or two already in the main text describing what is the mechanism of the mechanochemcial feedback.

      We have added a line describing the mechanism of mechanochemical feedback.

      (7) In the definition of the eccentricity, 'a' should be the minor axis and 'b' the major axis, i.e., 'a' and 'b' should be swapped.

      We have corrected this.

      (8) There is a typo at the end of the vertex model description in the methods section. "The details of the shape-tension coupling is described in the interface." The word interface should be an appendix.

      We have fixed the typo.

      (9) In the appendix section describing the shape-tension coupling, the authors should explain how the cell's director n is defined.

      We have added a line in the appendix section describing shape-tension coupling explaining how the cell’s director n is defined.

      (10) In Appendix Fig. 1, the two angles are defined as theta and theta' but the figure caption is defining angles theta_1 and theta_2. These angles need to be consistent.

      This has been fixed.

    1. eLife Assessment

      This study presents valuable findings on phase-separated condensate formation by the MUT-16 protein, which plays a key role in small RNA biogenesis. A detailed analysis of the interactions governing condensate formation was carried out using coarse-grained and all-atom molecular dynamics simulations, complemented by in vitro phase separation experiments. While many of the results appear solid, a number of technical details are lacking, the computational part appears incomplete and would benefit from additional analyses and clarifications, and the novelty of the study should also be clarified, particularly in comparison with the authors' previous work on MUT-16. Overall, the work will be of interest to biophysicists and molecular biologists studying phase separation and biomolecular condensates.

    2. Reviewer #1 (Public review):

      In this work, Gaurav et al. present an extensive study of phase-separated condensates formed by the foci-forming region (FFR) of the MUT-16 protein. The authors first report in vitro experiments showing that these condensates exhibit upper critical solution temperature (UCST) behavior. They then provide a detailed analysis based on atomistic simulations of MUT-16 FFR condensates, identifying key interactions responsible for LLPS, including salt bridges, cation-π interactions, and the role of Na⁺ ions.

      Overall, the manuscript is well written. However, there are several concerns that should be addressed.

      Major Concerns:

      (1) I have several questions regarding the system preparation that require clarification. The authors state that "65 copies of the coarse-grained MUT-16 FFR were embedded in a slab-shaped simulation," but it is not clear how this initial configuration was generated. Were the molecules randomly distributed in the simulation box, or were they initially arranged in a preformed condensate? Alternatively, were they randomly inserted and allowed to self-assemble into a condensate during NpT simulations?

      In Figure 1, the atomistic snapshot appears to show a well-defined condensate at the center of the simulation box. It would be important to clarify how this configuration was obtained: Was it generated from coarse-grained simulations starting from random initial conditions? Or was a preassembled condensate used as input?

      Related to this, how do the authors ensure that the simulations are equilibrated? While 20 μs appears to be a reasonably long simulation time for coarse-grained simulations, it would be useful to demonstrate equilibration explicitly. For example, the authors could plot the center-of-mass positions (in the long axis of the simulation box) of individual proteins over time to show that all molecules reach a steady state and remain within the condensate without systematic drift.

      (2) The authors experimentally observe UCST behavior for these condensates. Do the coarse-grained or atomistic simulations reproduce this behavior?

      While atomistic simulations may be too computationally demanding to systematically explore temperature dependence, coarse-grained simulations could be used to test whether condensates are stable at lower temperatures and dissolve at higher temperatures. Such an analysis would provide valuable support for the experimental observations.

      (3) Regarding the analysis of ions, several points could be clarified and extended:

      a) It would be helpful to report the total number of ions and quantify how many are located inside vs. outside the condensate. While qualitative trends can be inferred from density profiles, quantitative analysis would strengthen the conclusions.

      b) It would also be interesting to analyze the number of contact ion pairs (e.g., Na⁺-Cl⁻ pairs), as described in J. Chem. Phys. 156, 044505 (2022). It is known that some ion models tend to overestimate ion pairing and underestimate solubility (e.g., J. Chem. Phys. 153, 010903 (2020)).

      c) In this context, the use of scaled-charge models has been shown to improve the description of ionic solutions and biomolecular systems (e.g., J. Phys. Chem. Lett. 2019, 10, 23, 7531-7536). I would suggest that, at least for one trajectory, the authors perform a test simulation using scaled charges (e.g., scaling by ~0.8) to evaluate whether ion distributions and protein-ion interactions are significantly affected.

      d) Finally, while the selected water model is known to be accurate, it would be useful to assess its performance for concentrated salt solutions. For example, the authors could estimate the density of a 6 m salt solution and compare it with experimental data or validated models (e.g., J. Chem. Phys. 151, 134504 (2019)). This would help clarify to what extent the conclusions depend on the chosen force field.

      Minor Concerns

      (1) In the Introduction, it would be helpful to elaborate further on the possible driving forces of LLPS in this region. Are there prior hypotheses or evidence pointing to specific interactions (e.g., cation-π, π-π, electrostatic interactions)? While this work addresses these questions, a brief discussion of previous experimental or theoretical insights would provide useful context.

      (2) On page 18, the authors state:<br /> "MUT-16 FFR satisfies the length (172 residues), aromatic content (20.35%), and Arg enrichment (85.71%) criteria. Its charge content (10.47%) and charge balance (38.89% positive charge fraction) are slightly below the nominal thresholds."<br /> It would be very helpful to include a schematic representation of the protein sequence highlighting these features (aromatic residues, charge distribution, etc.) in the corresponding figure, to provide a more intuitive understanding.

      (3) A question regarding ion hydration: What is the coordination environment of the ions that bridge proteins? Are they still hydrated by water molecules, or does the reduced water content inside the condensate significantly affect their solvation?<br /> Typically, Na⁺ and Cl⁻ ions have coordination numbers around 5-6 in aqueous solution. Do protein interactions and reduced solvent conditions within the condensate alter this coordination? A brief analysis or discussion would be valuable.

    3. Reviewer #2 (Public review):

      Summary:

      Gaurav et al. investigate residue-level interactions within the MUT-16 FFR condensate using all-atom molecular dynamics simulations. The authors first argue, based on sequence analysis, that MUT-16 FFR is more representative than the widely studied FUS LCD. They then characterize the UCST phase behavior of MUT-16 FFR experimentally, followed by a detailed analysis of residue-level contact frequencies and lifetimes. In addition, the manuscript examines ion-residue interactions and water-mediated interactions. Overall, this work provides a comprehensive view of the dynamic interactions within the MUT-16 FFR condensate.

      Strengths:

      Large-scale all-atom molecular dynamics simulations have been performed to investigate dynamical interactions within condensates. The analysis is comprehensive and rigorous, and the claims are strongly justified by the data.

      Weaknesses:

      The large amount of detail in the results section sometimes makes it difficult to identify the central take-home messages. I encourage the authors to more clearly highlight the principal findings and the physical insights that may generalize to other condensate-forming systems. The authors may also consider streamlining parts of the Results section to improve focus and readability.

    4. Reviewer #3 (Public review):

      Summary:

      The authors aim to characterize the molecular interaction network inside phase-separated condensates formed by the MUT-16 foci-forming region (FFR), using atomistic simulations combined with residue-resolved analyses of contact frequencies, contact lifetimes, specific non-covalent interactions, ions, and water.

      Strengths:

      The work addresses an interesting and biologically relevant system, and the combination of large-scale atomistic simulations with an extensive contact analysis has clear potential value for the broader condensate field.

      Weaknesses:

      In its current form, several technical issues need to be addressed before the main conclusions can be considered robust. Most importantly, the simulated sequence is 172 residues long, while the atomistic slab has box dimensions of only 12 nm in two directions. This length scale is comparable to the expected end-to-end distances of a disordered 172-residue chain. It is therefore not clear whether individual protein chains interact with their own periodic images, which could substantially affect overall chain dynamics and subsequently bias contact lifetimes, residue-residue interaction statistics, and the inferred condensate dynamics. The authors should check, for each chain, histograms of end-to-end distances. For chains for which more than ~2-3% of the end-to-end distances exceed ~11 nm, the authors should explicitly check for self-image interactions (for example, using "gmx mindist -pi") and report whether such interactions occur and for what fraction of the trajectory. Without this control, at least in the Supporting Information, I do not think the simulation-derived contact dynamics are sufficiently trustworthy.

      A second major concern is the treatment of ions. The manuscript makes important conclusions about Na⁺ association and Na⁺-mediated bridging, but the atomistic ion model is not explicitly stated. This is a reproducibility problem and also affects interpretation - for example, standard Amber ions are known to bind too strongly to the oppositely charged residues. In their results, one acidic residue appears to interact on average with roughly two Na⁺ ions, which is not obviously expected from charge balance alone. The authors should state the exact Na⁺/Cl⁻ parameters used, justify their compatibility with TIP4P-D and the protein force field, and explicitly interpret why such a strong Na⁺ association with acidic residues is observed.

      More generally, because the manuscript is centered on contact lifetimes, the choice of the atomistic force field needs stronger justification. Salt bridges, cation-pi contacts, pi-pi stacking, ion coordination, and water-mediated interactions are all force-field-sensitive. Since there is no direct experimental observable used here to validate the simulations, the authors should discuss the expected limitations of the chosen force field (while I do acknowledge that testing different force fields would be computationally too demanding).

      I also find the sequence-comparison section somewhat confusing. The authors compare one specific IDR, MUT-16 FFR, with the average properties of human IDRs and then frame it as more representative than FUS LCD. It is not clear how informative this is because IDR behavior depends strongly on sequence-specific patterning, molecular connectivity, and the particular interaction network of each protein. Averages over human IDRs may provide a broad context, but they do not necessarily define what is physically or biologically representative for phase separation. In addition, FUS LCD is not intended to be a representative human IDR; it is an unusually low-complexity, phase-separating domain. Therefore, the "more representative than FUS" framing should be toned down. At most, this analysis shows that MUT-16 FFR is compositionally less extreme than FUS LCD.

      The ion- and water-bridging analyses are also potentially overinterpreted. A distance-based simultaneous contact with two residues does not by itself establish functional mediation or regulation of condensate dynamics. The authors should either add appropriate controls, such as local-density-normalized baselines or randomized-contact expectations, or soften the language to describe these as geometrically defined co-contact events rather than mechanistic bridging interactions.

      Finally, the independence of the atomistic replicas is unclear. The manuscript should state whether all ten all-atom simulations were initiated from the same coarse-grained condensate configuration or from distinct CG frames. If the starting structures came from one CG trajectory, the authors should report how far apart those frames were in simulation time and provide evidence that the initial atomistic configurations are structurally independent. If only velocities differ, the simulations should not be described as fully independent structural replicas.

    1. eLife Assessment

      Overall, this is a manuscript with solid evidence that delivers an important community resource for those performing experimental research in amyotrophic lateral sclerosis. The authors address the lack of validated tools for the detection and quantification of proteins associated with amyotrophic lateral sclerosis (ALS) through an extensive screening of 303 commercially available antibodies to 33 protein targets. The effort invested in generating the knockout lines for validation experiments is a clear strength of the study.

    2. Reviewer #1 (Public review):

      Summary:

      The authors address the lack of validated tools for the detection and quantification of proteins associated with amyotrophic lateral sclerosis (ALS) through an extensive screening of 303 commercially available antibodies to 33 protein targets. Their ALS-Reproducible Antibody Platform (ALS-RAP) delivers a validated antibody toolbox for ALS research, which will provide an advantageous starting point for researchers in this field. Ayoubi R. et al. showcase the characterization workflow, presenting as an example the characterization of antibodies targeting Galectin-1, encoded by the LGALS1 gene. A selection of these antibodies was also used to profile protein levels across human induced pluripotent stem cell (iPSC)-derived and primary neurological cell types, and the findings support that the ALS disease mechanism involves both neuronal and glial cells.

      Strengths:

      The knockout (KO)-based approach is definitely the major strength of this study, providing a high level of confidence in the data collected in human induced pluripotent stem cell (iPSC)-derived and primary neurological cell types. The focus on renewable reagents (monoclonal and recombinant antibodies) is also important. The extensive characterization of this set of antibodies will benefit any scientist interested in any of the 33 target proteins, even in fields other than neuroscience.

      The authors perform an interesting protein profiling study assessing 27 proteins, comparing RNA and protein expression data, and using two independent WB preparations of the same cell types.

      The conclusions that can be drawn from this first assessment might not be final, but the data are compelling because they have been collected with reliable and validated antibodies.

      Another strength of this work is the data dissemination strategy, which includes the Only Good Antibodies (OGA) platform, where YCharOS data are curated and presented in an easy and intuitive manner that facilitates antibody selection by the end user for WB, IP and IF applications.

      Weaknesses:

      The authors mentioned the development of single-chain variable fragment (scFv) recombinant antibodies raised by the SGC against the six proteins (ANXA11, OPTN, MATR3, PFN1, UBQLN2 and VCP) that had limited renewable antibodies that are commercially available. The development was optimized to generate antibodies particularly suitable for IP, and the clone selection process was carried out using IP coupled to mass spectrometry. Even though the generation of these novel reagents is not the focus of this work, the authors do not provide any data on this aspect.

      The protein profiling study is limited to WB data, and the authors did not provide any explanation on why there was no integration with IP and IF data, not even for those targets that have validated antibodies. Also, not all the cell types have been screened by chemiluminescence-based detection and by fluorescence-based WB, and the authors do not elaborate on the reason for such a choice.

    3. Reviewer #2 (Public review):

      Overall, this is a solid manuscript that delivers an important community resource. The execution is relatively simple, but the value is real, the work is rigorously performed, and the open dissemination through Zenodo, the F1000Research YCharOS Gateway and OGA is well executed. The effort invested in generating the knockout lines for validation experiments is a clear strength of the study. I have a number of comments that I think would strengthen the resource and the conclusions drawn from it.

      Below, I list specific points.

      (1) The rationale for the selection of these 33 genes is insufficient. The authors lean on the Nijs & Van Damme classification and on PubMed entry counts, but the number of PubMed entries is not a meaningful criterion for what constitutes an important ALS protein - some of the most disease-relevant genes are precisely those with fewer publications, while heavily cited genes such as CAV1 carry weak ALS-specific evidence. The authors should provide a more transparent and biologically motivated rationale for inclusion and exclusion (ClinGen evidence tier, replicated GWAS signals, large meta-analyses, ALSoD) and explain why specific risk genes outside this list were not part of ALS-RAP.

      (2) "107 of 231 (46%) demonstrated specific target staining in IF." The criteria used to define "specific target staining" at the IF level are not stated. From the Galectin-1 example, the mosaic WT/KO strategy provides a binary readout, but for proteins with low expression, weak punctate staining or unusual subcellular distributions, a single threshold is unlikely to capture specificity uniformly across 231 antibodies.

      (3) Several claims in the manuscript depend on differential protein abundance across cell types. As presented, these claims are supported by qualitative Western blot images only. They should be substantiated by quantification across multiple biological replicates.

      (4) This manuscript represents a unique opportunity to address antibody recognition of splicing variants, which is something of of considerable value to the community. For each target, the predicted isoforms in Ensembl could be cross-referenced against the observed bands, and the pattern of bands compared across cell types could be informative about which isoforms each antibody captures. This would convert ambiguous "extra bands" into useful biological information and would substantially increase the value of the resource. I strongly encourage the authors to include this analysis.

      (5) The iPSC-derived microglia receive a comprehensive QC panel (IBA1/PU.1 IF, CD45/CD11b flow, qRT-PCR for nine canonical markers; Figure S4), which allows the reader to assess culture purity. The other iPSC-derived lineages - motor neurons, dopaminergic neurons, oligodendrocytes and astrocytes - are validated by a single marker each in WB (Figure S3) without purity quantification. Given that several conclusions of the manuscript rest on the cell-type-specific detection of ALS-associated proteins, equivalent quality control should be performed for the other lineages so that the reader can evaluate the purity of each preparation.

      (6) The robustness of the resource would be substantially increased by validating at least a subset of the targets in a second iPSC background, in at least some of the cell types analysed.

      (7) The newly developed SGC scFv antibodies are arguably the most novel reagent contribution of this manuscript, yet they receive a single sentence in the body of the paper. A more thorough description is warranted.

      (8) Accessibility of the resource through Zenodo is not straightforward - the reader currently has to navigate to individual antibody characterization reports one by one to extract recommendations for a given target. While the use of an established public repository is important for permanence, a dedicated ALS-RAP website with an interactive, searchable interface - filterable by target, application, host species and clonality - would meaningfully improve uptake. The relationship between such a portal and the existing OGA platform should also be clarified.

    1. eLife Assessment

      Non-essential amino acids such as glutamine have been known to be required for T cell general activation through sustaining basic biosynthetic processes, including nucleotide biosynthesis, ATP generation, and protein synthesis. In this important study, the authors found that extracellular asparagine (Asn) is required not only for T cells to generally refuel metabolic reprogramming, but to produce helper T cell lineage-specific cytokine, for instance, IL17. In particular, the importance of Asn in IL17 production was convincingly demonstrated in the mouse experimental autoimmune encephalomyelitei (EAE) model, mimicking human multiple sclerosis disease.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors reveal that the availability of extracellular asparagine (Asn) represents a metabolic vulnerability for the activation and differentiation of naive CD4+ T cells. To deplete extracellular Asn, they employed two orthogonal approaches: activating naive CD4+ T cells in either PEGylated asparaginase (PEG-AsnASE)-treated medium or custom-formulated RPMI medium specifically lacking Asn. Importantly, they demonstrate that Asn depletion not only impaired metabolic reprogramming associated with CD4+ T cell activation but also reduced CD4+ helper T cell lineage-specific cytokine production, thereby ameliorating the severity of experimental autoimmune encephalomyelitis.

      The experiments presented here are comprehensive and well-designed, providing compelling evidence for the conclusions. The conclusions will be important to the field.

      Comments on revised version:

      The authors have sufficiently addressed my previous comments. The manuscript represents an excellent contribution to the field.