10,000 Matching Annotations
  1. Mar 2025
    1. Reviewer #1 (Public review):

      Summary:

      This study is built on the emerging knowledge of trained immunity, where innate immune cells exhibit enhanced inflammatory responses upon being challenged by a prior insult. Trained immunity is now a very fast-evolving field and has been explored in diverse disease conditions and immune cell types. Earhart and the team approached the topic from a novel angle and were the first to explore a potential link to the complement system.

      The study focused on the central complement protein C3 and investigated how its signalling may modulate immune training in alveolar macrophages. The authors first performed in vivo experiments in C57BL mouse models to observe the presence of enhanced inflammation and C3a in BAL fluid following immune training. These changes were then compared with those from C3-deficient mice, which confirmed the involvement of C3a. This trained immunity was further validated in ex vivo experiments using primary alveolar macrophage, which was blunted in C3-deficiency, and, intriguingly, rescued by adding exogenous C3 protein, but not C3a. The genetic-based findings were supported by pharmacological experiments using the C3aR antagonist SB290157. Mechanistically, transcriptomic analyses suggested the involvement of metabolism-linked, particularly glycolytic, genes, which was in agreement with an upregulation of glycolytic flux in WT but not C3-deficient macrophages.

      Collectively, these data suggest that C3, possibly through engaging with C3aR, contributes to trained immunity in alveolar macrophages.

      Strengths:

      The conclusions reached were well supported by in vivo and ex vivo experiments, encompassing both genetic-knockout animal models and pharmacological tools.

      The transcriptomic and cell metabolism studies provided valuable mechanistic insights.

      Weaknesses:

      For the in vivo experiments, the histopathological and other inflammatory markers (Figure 1) were not directly linked to alveolar macrophages by experimental evidence. Other innate immune cells (eg. dendritic cells, neutrophils) and endothelial cells could also be involved in immune training and contribute to the pathological outcomes. These cells were not examined or mentioned in the study.

      For the ex vivo experiments assessing immune training in alveolar macrophages, only the release of selected inflammatory factors were measured. Macrophage activities constitute multiple aspects (e.g. phagocytosis, ROS production, microbe killing), which should also be considered to better depict the effect of trained immunity.

      The proposed mechanism of C3 getting cleaved intracellularly and then binding to lysosomal C3aR needs to be further supported by experimental evidence.

      There was an absence of any validation in human-based models.

    2. Reviewer #2 (Public review):

      Earhart et al. investigated the role of the complement system in trained innate immunity (TII) in alveolar macrophages (AM). They used a WT and C3 knockout murine model primed with locally administered heat-killed P. aeruginosa (HKPA). Additionally, they employed ex vivo AM training models using C3 knockout mice, where reconstitution of C3 and blockade of C3R were performed. The study concluded that the C3-C3R axis is essential for inducing TII in macrophages in the ex vivo model. The manuscript is well-written and easy to follow. However, I have the following major concerns.

      (1) The secondary challenge to assess the reprogramming of innate cells in the BAL was conducted 14 days after the initial exposure to HKPA. However, no evidence is provided to confirm that homeostasis was re-established following the primary exposure. Demonstrating the resolution of acute inflammation is essential to ensure that the observed responses to the secondary challenge are not confounded by persistent inflammation from the initial exposure.

      (2) In Figure 1D, cytokine production by BAL cells from WT and C3KO mice after HKPA exposure and LPS challenge is shown. However, it is unclear whether the reduced response in trained C3KO mice is due to a defect in trained immunity or an intrinsic inability of C3KO cells to respond to LPS. To clarify this, the response of trained C3KO cells should also be compared to untrained C3KO controls after the LPS challenge. This comparison is necessary to determine if the reduction is specifically related to innate immune memory or a broader impairment in LPS responsiveness. Such control should be included in all ex vivo training and LPS stimulation experiments as well.

      (3) The data presented provide evidence of alterations in the functional and metabolic activities of innate cells in the lung, indicating the induction of innate immune memory in a C3-C3R axis-dependent pathway. However, it remains to be established whether such changes can lead to altered disease outcomes. Therefore, the impact of these changes should be demonstrated, for instance, through an infection model to support the claim made in the study that C3 modulates trained immunity in AMs through C3aR signalling.

      (4) Figure 3, panels B and C - stats should be shown for comparing WT-HKPA-trained and C3KO HKPA-trained.

      (5) In Figure 4, where the proper untrained C3KO is included, the data presented in Figure 4C show an increase in basal and maximum glycolysis in trained C3KO compared to their untrained control counterparts. Statistical analysis should be provided for this comparison. Based on these data, it appears that metabolic reprogramming occurs even in the absence of C3. Furthermore, C3KO cells intrinsically exhibit reduced glycolytic capacity compared to WT. These observations challenge the conclusions made in the manuscript. Therefore, without the proper control (untrained C3KO) included in all experimental approaches, it is impossible to draw an evidence-based conclusion that the C3-C3R axis plays a role in the induction of innate immune memory.

      (6) The Results and Discussion sections should be separated, and the results should be thoroughly analyzed in the context of published literature. Separating these sections will allow for a clearer presentation of findings and ensure that the discussion provides a comprehensive interpretation of the data.

    3. Author response:

      We thank both reviewers for their suggestions on improving our manuscript, which is focused on demonstrating that the C3a-C3aR axis modulates trained immune responses in alveolar macrophages. The Short Report format precludes separating the Results and Discussion sections. However, we will work towards a clearer presentation of findings and providing a more comprehensive interpretation of the data in the Revision, by addressing the points brought up by both Reviewers.

      We agree with the suggestions from Reviewer 1 that (1) other cell types such as dendritic cells, neutrophils, and endothelial cells can also be involved in immune training, and (2) macrophages have other activities beyond releasing inflammatory cytokines, and will clarify both these points in the Revision. The mechanism of C3 being cleaved intracellularly and binding to lysosomal C3aR involves cathepsin-dependent cleavage of C3 to C3a and has been experimentally proven (Liszewski et al. Immunity 2013). However, we will clarify this mechanism in the revision. We also acknowledge that the observations need to be validated in human-based models. Currently, we do not have access to an adequate representation of human alveolar macrophages for our ex vivo testing to account for individual-level variation in immune responses. However, we anticipate this work will form the basis of these future studies.

      We also appreciate Reviewer 2’s suggestions regarding demonstrating the resolution of acute inflammation after the initial exposure to heat-killed Pseudomonas. We will address this critique by performing additional experiments, which will be included in the Revision. We also agree that the responses of trained C3-deficient cells should be compared to untrained C3-deficient controls after the LPS challenge. We will include this data in the Revision, in addition to the requested data for Figures 3 and 4. We would like to clarify that we do not observe baseline differences between untrained C3-sufficient (wildtype) and C3-deficient alveolar macrophages, even in their glycolytic capacity, and thus, anticipate that our revised data will strengthen the conclusions from the original manuscript.

    1. eLife Assessment

      This manuscript provides valuable novel insights into the role of interpersonal guilt in social decision-making by showing that responsibility for a partner's bad lottery outcomes influences happiness. Through the integration of neuroimaging and computational modelling methods, and by combining findings from two studies, the authors provide solid support for their claims.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aimed to characterize neurocomputational signals underlying interpersonal guilt and responsibility. Across two studies, one behavioral and one fMRI, participants made risky economic decisions for themselves or for themselves and a partner; they also experienced a condition in which the partners made decisions for themselves and the participant. The authors also assessed momentary happiness intermittently between choices in the task. Briefly, results demonstrated that participants' self-reported happiness decreased after disadvantageous outcomes for themselves and when both they and their partner were affected; this effect was exacerbated when participants were responsible for their partner's low outcome, rather than the opposite, reflecting experienced guilt. Consistent with previous work, BOLD signals in the insula correlated with experienced guilt, and insula-right IFG connectivity was enhanced when participants made risky choices for themselves and safe choices for themselves and a partner.

      Strengths:

      This study implements an interesting approach to investigating guilt and responsibility; the paradigm in particular is well-suited to approach this question, offering participants the chance to make risky v. safe choices that affect both themselves and others. I appreciate the assessment of happiness as a metric for assessing guilt across the different task/outcome conditions, as well as the implementation of both computational models and fMRI.

      Weaknesses:

      In spite of the overall strengths of the study, I think there are a few areas in which the paper fell a bit short and could be improved.

      (1) While the framing and goal of this study was to investigate guilt and felt responsibility, the task implemented - a risky choice task with social conditions - has been conducted in similar ways in past research that were not addressed here. The novelty of this study would appear to be the additional happiness assessments, but it would be helpful to consider the changes noted in risk-taking behavior in the context of additional studies that have investigated changes in risky economic choice in social contexts (e.g., Arioli et al., 2023 Cerebral Cortex; Fareri et al., 2022 Scientific Reports).

      (2) The authors note they assessed changes in risk preferences between social and solo conditions in two ways - by calculating a 'risk premium' and then by estimating rho from an expected utility model. I am curious why the authors took both approaches (this did not seem clearly justified, though I apologize if I missed it). Relatedly, in the expected utility approach, the authors report that since 'the number of these types of trials varied across participants', they 'only obtained reliable estimates for [gain and loss] trials in some participants' - in study 1, 22 participants had unreliable estimates and in study 2, 28 participants had unreliable estimates. Because of this, and because the task itself only had 20 gains, 20 losses, and 20 mixed gambles per condition, I wonder if the authors can comment on how interpretable these findings are in the Discussion. Other work investigating loss aversion has implemented larger numbers of trials to mitigate the potential for unreliable estimates (e.g., Sokol-Hessner et al., 2009).

      (3) One thing seemingly not addressed in the Discussion is the fact that the behavioral effect did not replicate significantly in study 2.

      (4) Regarding the computational models, the authors suggest that the Reponsibility and Responsibility Redux models provided the best fit, but they are claiming this based on separate metrics (e.g., in study 1, the redux model had the lowest AIC, but the responsibility only model had the highest R^2; additionally, the basic model had the lowest BIC). I am wondering if the authors considered conducting a direct model comparison to statistically compare model fits.

      (5) In the reporting of imaging results, the authors report in a univariate analysis that a small cluster in the left anterior insula showed a stronger response to low outcomes for the partner as a result of participant choice rather than from partner choice. It then seems as though the authors performed small volume correction on this cluster to see whether it survived. If that is accurate, then I would suggest that this result be removed because it is not recommended to perform SVC where the volume is defined based on a result from the same whole-brain analysis (i.e., it should be done a priori).

    3. Reviewer #2 (Public review):

      Summary

      This manuscript focuses on the role of social responsibility and guilt in social decision-making by integrating neuroimaging and computational modeling methods. Across two studies, participants completed a lottery task in which they made decisions for themselves or for a social partner. By measuring momentary happiness throughout the task, the authors show that being responsible for a partner's bad lottery outcome leads to decreased happiness compared to trials in which the participant was not responsible for their partner's bad outcome. At the neural level, this guilt effect was reflected in increased neural activity in the anterior insula, and altered functional connectivity between the insula and the inferior frontal gyrus. Using computational modeling, the authors show that trial-by-trial fluctuations in happiness were successfully captured by a model including participant and partner rewards and prediction errors (a 'responsibility' model), and model-based neuroimaging analyses suggested that prediction errors for the partner were tracked by the superior temporal sulcus. Taken together, these findings suggest that responsibility and interpersonal guilt influence social decision-making.

      Strengths

      This manuscript investigates the concept of guilt in social decision-making through both statistical and computational modeling. It integrates behavioral and neural data, providing a more comprehensive understanding of the psychological mechanisms. For the behavioral results, data from two different studies is included, and although minor differences are found between the two studies, the main findings remain consistent. The authors share all their code and materials, leading to transparency and reproducibility of their methods.

      The manuscript is well-grounded in prior work. The task design is inspired by a large body of previous work on social decision-making and includes the necessary conditions to support their claims (i.e., Solo, Social, and Partner conditions). The computational models used in this study are inspired by previous work and build on well-established economic theories of decision-making. The research question and hypotheses clearly extend previous findings, and the more traditional univariate results align with prior work.

      The authors conducted extensive analyses, as supported by the inclusion of different linear models and computational models described in the supplemental materials. Psychological concepts like risk preferences are defined and tested in different ways, and different types of analyses (e.g., univariate and multivariate neuroimaging analyses) are used to try to answer the research questions. The inclusion and comparison of different computational models provide compelling support for the claim that partner prediction errors indeed influence task behavior, as illustrated by the multiple model comparison metrics and the good model recovery.

      Weaknesses

      As the authors already note, they did not directly ask participants to report their feelings of guilt. The decrease in happiness reported after a bad choice for a partner might thus be something else than guilt, for example, empathy or feelings of failure (not necessarily related to guilt towards the other person). Although the patterns of neural activity evoked during the task match with previously found patterns of guilt, there is no direct measure of guilt included in the task. This warrants caution in the interpretation of these findings as guilt per se.

      As most comparisons contrast the social condition (making the decision for your partner) against either the partner condition (watching your partner make their decision) or the solo condition (making your own decision), an open question remains of how agency influences momentary happiness, independent of potential guilt. Other open questions relate to individual differences in interpersonal guilt, and how those might influence behavior.

      This manuscript is an impressive combination of multiple approaches, but how these different approaches relate to each other and how they can aid in answering slightly different questions is not very clearly described. The authors could improve this by more clearly describing the different methods and their added value in the introduction, and/or by including a paragraph on implications, open questions, and future work in the discussion.

      However, taken together, this study provides useful insights into the neural and behavioral mechanisms of responsibility and guilt in social decision-making, and how they influence behavior.

    4. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors aimed to characterize neurocomputational signals underlying interpersonal guilt and responsibility. Across two studies, one behavioral and one fMRI, participants made risky economic decisions for themselves or for themselves and a partner; they also experienced a condition in which the partners made decisions for themselves and the participant. The authors also assessed momentary happiness intermittently between choices in the task. Briefly, results demonstrated that participants' self-reported happiness decreased after disadvantageous outcomes for themselves and when both they and their partner were affected; this effect was exacerbated when participants were responsible for their partner's low outcome, rather than the opposite, reflecting experienced guilt. Consistent with previous work, BOLD signals in the insula correlated with experienced guilt, and insula-right IFG connectivity was enhanced when participants made risky choices for themselves and safe choices for themselves and a partner.

      Strengths:

      This study implements an interesting approach to investigating guilt and responsibility; the paradigm in particular is well-suited to approach this question, offering participants the chance to make risky v. safe choices that affect both themselves and others. I appreciate the assessment of happiness as a metric for assessing guilt across the different task/outcome conditions, as well as the implementation of both computational models and fMRI.

      We thank Reviewer 1 for their positive assessment of our manuscript.

      Weaknesses:

      In spite of the overall strengths of the study, I think there are a few areas in which the paper fell a bit short and could be improved.

      We are looking forward to improving our manuscript based on the Reviewers’ comments. According to eLife’s policy, here are our provisional replies as well as plans for changes.

      (1) While the framing and goal of this study was to investigate guilt and felt responsibility, the task implemented - a risky choice task with social conditions - has been conducted in similar ways in past research that were not addressed here. The novelty of this study would appear to be the additional happiness assessments, but it would be helpful to consider the changes noted in risk-taking behavior in the context of additional studies that have investigated changes in risky economic choice in social contexts (e.g., Arioli et al., 2023 Cerebral Cortex; Fareri et al., 2022 Scientific Reports).

      We certainly agree that several previously published studies have relied on risky choice tasks with social conditions. We will happily refer to the studies mentioned when discussing changes in risk-taking behaviour in our revised manuscript.

      (2) The authors note they assessed changes in risk preferences between social and solo conditions in two ways - by calculating a 'risk premium' and then by estimating rho from an expected utility model. I am curious why the authors took both approaches (this did not seem clearly justified, though I apologize if I missed it). Relatedly, in the expected utility approach, the authors report that since 'the number of these types of trials varied across participants', they 'only obtained reliable estimates for [gain and loss] trials in some participants' - in study 1, 22 participants had unreliable estimates and in study 2, 28 participants had unreliable estimates. Because of this, and because the task itself only had 20 gains, 20 losses, and 20 mixed gambles per condition, I wonder if the authors can comment on how interpretable these findings are in the Discussion. Other work investigating loss aversion has implemented larger numbers of trials to mitigate the potential for unreliable estimates (e.g., Sokol-Hessner et al., 2009).

      We agree that we have not clearly justified why we have taken two approaches to assess risk preferences. In short, both approaches have advantages and inconveniences when applied to our experiment. We will happily detail our reasons in the revised manuscript. Regarding the second point of this comment: the small number of reliable estimates is one of the reasons that we have used another approach to assess risk preferences. We would certainly have obtained more reliable estimates if we had implemented more trials. We will discuss the interpretability of all the risk preference estimates we used in the revised Discussion.

      (3) One thing seemingly not addressed in the Discussion is the fact that the behavioral effect did not replicate significantly in study 2.

      We agree that we could have discussed more the fact that there were (slight but significant) differences in risk preferences between the Solo and Social conditions in Study 1 but not in Study 2. While the absence of a significant difference in Study 2 is helpful to compare the neural mechanisms involved in making decisions for oneself vs. for oneself and another person (because any differences could not be explained by differences in risk preferences), we certainly should expand our discussion of the differences in findings between the two studies, which we will do in the revised manuscript.

      (4) Regarding the computational models, the authors suggest that the Reponsibility and Responsibility Redux models provided the best fit, but they are claiming this based on separate metrics (e.g., in study 1, the redux model had the lowest AIC, but the responsibility only model had the highest R^2; additionally, the basic model had the lowest BIC). I am wondering if the authors considered conducting a direct model comparison to statistically compare model fits.

      We agree that we should run formal, direct model comparison tests using for example chi-square or log-likelihood-ratio tests. We will do so in the revised manuscript.

      (5) In the reporting of imaging results, the authors report in a univariate analysis that a small cluster in the left anterior insula showed a stronger response to low outcomes for the partner as a result of participant choice rather than from partner choice. It then seems as though the authors performed small volume correction on this cluster to see whether it survived. If that is accurate, then I would suggest that this result be removed because it is not recommended to perform SVC where the volume is defined based on a result from the same whole-brain analysis (i.e., it should be done a priori).

      As indicated in the manuscript, the small insula cluster centered at [-28 24 -4] and shown in Figure 4F survived corrections for multiple tests within the anatomically-defined anterior insula (based on the anatomical maximum probability map described in Faillenot et al., 2017), which is independent of the result of our analysis. We agree that one should not (and we did not) perform multiple corrections based on the results one is correcting – that would indeed be circular and misleading “double-dipping”. The anterior insula is one of the regions most frequently associated with guilt (see the explanations in our Introduction, which refers for example to Bastin et al., 2016; Lamm & Singer, 2010; Piretti et al., 2023). Thus we feel that performing small-volume correction within the anatomically-defined anterior insula is an acceptable approach to correct for multiple tests in this case. We fully acknowledge that, independently of any correction, the effect and the cluster are small. We will clarify these explanations in the revised manuscript.

      Reviewer #2 (Public review):

      Summary

      This manuscript focuses on the role of social responsibility and guilt in social decision-making by integrating neuroimaging and computational modeling methods. Across two studies, participants completed a lottery task in which they made decisions for themselves or for a social partner. By measuring momentary happiness throughout the task, the authors show that being responsible for a partner's bad lottery outcome leads to decreased happiness compared to trials in which the participant was not responsible for their partner's bad outcome. At the neural level, this guilt effect was reflected in increased neural activity in the anterior insula, and altered functional connectivity between the insula and the inferior frontal gyrus. Using computational modeling, the authors show that trial-by-trial fluctuations in happiness were successfully captured by a model including participant and partner rewards and prediction errors (a 'responsibility' model), and model-based neuroimaging analyses suggested that prediction errors for the partner were tracked by the superior temporal sulcus. Taken together, these findings suggest that responsibility and interpersonal guilt influence social decision-making.

      Strengths

      This manuscript investigates the concept of guilt in social decision-making through both statistical and computational modeling. It integrates behavioral and neural data, providing a more comprehensive understanding of the psychological mechanisms. For the behavioral results, data from two different studies is included, and although minor differences are found between the two studies, the main findings remain consistent. The authors share all their code and materials, leading to transparency and reproducibility of their methods.

      The manuscript is well-grounded in prior work. The task design is inspired by a large body of previous work on social decision-making and includes the necessary conditions to support their claims (i.e., Solo, Social, and Partner conditions). The computational models used in this study are inspired by previous work and build on well-established economic theories of decision-making. The research question and hypotheses clearly extend previous findings, and the more traditional univariate results align with prior work.

      The authors conducted extensive analyses, as supported by the inclusion of different linear models and computational models described in the supplemental materials. Psychological concepts like risk preferences are defined and tested in different ways, and different types of analyses (e.g., univariate and multivariate neuroimaging analyses) are used to try to answer the research questions. The inclusion and comparison of different computational models provide compelling support for the claim that partner prediction errors indeed influence task behavior, as illustrated by the multiple model comparison metrics and the good model recovery.

      We thank Reviewer 2 very much for their comprehensive description of our study and the positive assessment of our study and approach.

      Weaknesses

      As the authors already note, they did not directly ask participants to report their feelings of guilt. The decrease in happiness reported after a bad choice for a partner might thus be something else than guilt, for example, empathy or feelings of failure (not necessarily related to guilt towards the other person). Although the patterns of neural activity evoked during the task match with previously found patterns of guilt, there is no direct measure of guilt included in the task. This warrants caution in the interpretation of these findings as guilt per se.

      We fully agree that not directly asking participants about feelings of guilt is a clear limitation of our study. While we already mention this in our Discussion, we will happily expand our discussion of the consequences on interpretation of our results along the lines described by the reviewer in the revised manuscript. We would like to thank Reviewer 2 for proposing these lines of thought.

      As most comparisons contrast the social condition (making the decision for your partner) against either the partner condition (watching your partner make their decision) or the solo condition (making your own decision), an open question remains of how agency influences momentary happiness, independent of potential guilt. Other open questions relate to individual differences in interpersonal guilt, and how those might influence behavior.

      We fully agree that the way agency influences happiness has not been much discussed in our manuscript so far, and we would happily do so in the revised manuscript. The same goes for individual differences in interpersonal guilt which we have not investigated due to our relatively small sample sizes but would certainly be worth investigation in subsequent work.

      This manuscript is an impressive combination of multiple approaches, but how these different approaches relate to each other and how they can aid in answering slightly different questions is not very clearly described. The authors could improve this by more clearly describing the different methods and their added value in the introduction, and/or by including a paragraph on implications, open questions, and future work in the discussion.

      We again thank the reviewer for their praise of our approach and fully agree that we can improve the description of the benefit of combining methods in the Introduction, which we will do in the revised manuscript. We will also include a paragraph on implications, open questions, and future work in the Discussion of the revised manuscript.

      However, taken together, this study provides useful insights into the neural and behavioral mechanisms of responsibility and guilt in social decision-making, and how they influence behavior.

      We again thank Reviewer 2 for their attentive reading and thoughtful comments and look forward to submitting our revised and improved manuscript.

    1. Reviewer #2 (Public review):

      Summary:

      This study investigates the effect of a fed vs hungry state on food decision-making.

      70 participants performed a computerized food choice task with eye tracking. Food images came from a validated set with variability in food attributes. Foods ranged from low caloric density unprocessed (fruits) to high caloric density processed foods (chips and cookies).

      Prior to the choice task participants rated images for taste, health, wanting, and calories. In the choice task participants simply selected one of two foods. They were told to pick the one they preferred. Screens consisted of two food pictures along with their "Nutri-Score". They were told that one preferred food would be available for consumption at the end.

      A drift-diffusion model (DDM) was fit to the reaction time values. Eye tracking was used to measure dwell time on each part of the monitor.

      Findings:

      Participants tended to select the item they had rated as "tastier", however, health also contributed to decisions.

      Strengths:

      The most interesting and innovative aspect of the paper is the use of the DDM models to infer from reaction time and choice the relative weight of the attributes.

      Were the ratings redone at each session? E.g. were all tastiness ratings for the sated session made while sated? This is relevant as one would expect the ratings of tastiness and wanting to be affected by the current fed state.

      Weaknesses:

      My main criticism, which doesn't affect the underlying results, is that the labeling of food choices as being taste- or health-driven is misleading. Participants were not cued to select health vs taste. Studies in which people were cued to select for taste vs health exist (and are cited here). Also, the label "healthy" is misleading, as here it seems to be strongly related to caloric density. A high-calorie food is not intrinsically unhealthy (even if people rate it as such). The suggestion that hunger impairs making healthy decisions is not quite the correct interpretation of the results here (even though everyone knows it to be true). Another interpretation is that hungry people in negative calorie balance simply prefer more calories.

    2. Author response:

      Reviewer 1:

      (1) We appreciate the reviewer’s suggestion to test a multi-attribute attentional drift-diffusion model (maaDDM) that does not constrain the taste and health weights to the range of 0 and 1 and will test such a model.

      (2) Similarly, we will follow the reviewer’s suggestion to address potential demand effects. First, we will add “order” (binary: hungry-sated or sated hungry) as a predictor to our GLMM, to test for potential systematic effects of order on choices and response times. Second, we will split the participants by “order” and examine whether we see group differences of tasty and healthy decisions within the first testing session. Note that we already anticipate that looking at only 50% of the data and testing for a between-subject rather than within-subject effect is likely to reduce effect size and statistical sensitivity.

      (3) We thank the reviewer for their observant remark about faster tasty choices and potential markers in the drift rate. While our starting point models show that there might be a small starting point bias towards the taste boundary which result in faster decisions, we will take a closer look at the simulated value differences as obtained in our posterior predictive checks to see if the drift rate is systematically more extreme for tasty choices.

      (4) Regarding the mtDDM, we will verify that the relative starting time (rst) effects are minuscule. While we will follow the recommendation of correlating first fixations with rst, we would like to point out that a majority of fixations (see Figure 3b) and first fixations (see Figure S6b) are on food images. We will also provide a parameter recovery of the mtDDM.

      Reviewer 2:

      (1) We would like to verify the reviewer’s interpretation that hungry people in negative calorie balance simply prefer more calories and would like to point to our supplementary analyses, in which we show that hunger state also increases the probability of higher wanted and higher caloric decisions (see SOM4, SOM5, Figure S4). Moreover, we agree that high caloric items might not be unhealthy and are happy to demonstrate the correlations between health ratings and objective caloric content, to demonstrate the strong negative correlation in our dataset, which our principal component analyses hints at, too.

      Reviewer 3:

      (1) We agree that choosing tasty over healthy options under hunger may be evolutionarily adaptive. We will address the adaptiveness of this hunger driven mechanism in our discussion, reiterating the differentiation made in the introduction that this system no longer be adaptive in our obesogenic environment, leading to suboptimal decisions.

      (2) We will address alternative explanations of the observed effects in our discussion with respect to the macro-nutritional content of the Shake and potential placebo effects arising from the shake vs no shake manipulation.

    3. Reviewer #3 (Public review):

      Summary:

      This well-powered study tested the effects of hunger on value-based dietary decision-making. The main hypothesis was that attentional mechanisms guide choices toward unhealthier and tastier options when participants are hungry and are in the fasted state compared to satiated states. Participants were tested twice - in a fasted state and in a satiated state after consuming a protein shake. Attentional mechanisms were measured during dietary decision-making by linking food choices and reaction times to eye-tracking data and mathematical drift-diffusion models. The results showed that hunger makes high-conflict food choices more taste-driven and less health-driven. This effect was formally mediated by relative dwell time, which approximates attention drawn to chosen relative to unchosen options. Computational modeling showed that a drift-diffusion model, which assumed that food choices result from a noisy accumulation of evidence from multiple attributes (i.e., taste and health) and discounted non-looked attributes and options, best explained observed choices and reaction times.

      Strengths:

      This study's findings are valuable for understanding how energy states affect decision-making and provide an answer to how hunger can lead to unhealthy choices. These insights are relevant to psychology, behavioral economics, and behavioral change intervention designs.

      The study has a well-powered sample size and hypotheses were pre-registered. The analyses comprised classical linear models and non-linear computational modeling to offer insight into putative cognitive mechanisms.

      In summary, the study advances the understanding of the links between energy states and value-based decision-making by showing that depleting is powerful for shaping the formation of food preferences. Moreover, the computational analysis part offers a plausible mechanistic explanation at the algorithmic level of observed effects.

      Weaknesses:

      Some parts of the positioning of the hunger state manipulation and the interpretation of its effects could be improved.

      On the positioning side, it does not seem like a 'bad' decision to replenish energy states when hungry by preferring tastier, more often caloric options. In this sense, it is unclear whether the observed behavior in the fasted state is a fallacy or a response to signals from the body. The introduction does mention these two aspects of preferring more caloric food when hungry. However, some ambiguity remains about whether the study results indeed reflect suboptimal choice behavior or a healthy adaptive behavior to restore energy stores.

      On the interpretation side, previous work has shown that beliefs about the nourishing and hunger-killing effectiveness of drinks or substances influence subjective and objective markers of hunger, including value-based dietary decision-making, and attentional mechanisms approximated by computational models and the activation of cognitive control regions in the brain. The present study shows differences between the protein shake and a natural history condition (fasted, state). This experimental design, however, cannot rule between alternative interpretations of observed effects. Notably, effects could be due to (a) the drink's active, nourishing ingredients, (b) consuming a drink versus nothing, or (c) both.

    4. eLife Assessment

      This is an important study showing that people who are hungry (vs. sated) put more weight on taste (vs. health) in their food choices. The experiment is well-designed and includes choice behavior, eye-tracking, and state-of-the-art computational modeling, resulting in compelling evidence supporting the conclusions. The manuscript could be further improved through appropriate revisions to data analysis and interpretation.

    5. Reviewer #1 (Public review):

      Summary:

      In this article, the authors set out to understand how people's food decisions change when they are hungry vs. sated. To do so, they used an eye-tracking experiment where participants chose between two food options, each presented as a picture of the food plus its "Nutri-Score". In both conditions, participants fasted overnight, but in the sated condition, participants received a protein shake before making their decisions. The authors find that participants in the hungry condition were more likely to choose the tastier option. Using variants of the attentional drift-diffusion model, they further find that the best-fitting model has different attentional discounts on the taste and health attributes and that the attentional discount on the health information was larger for the hungry participants.

      Strengths:

      The article has many strengths. It uses a food-choice paradigm that is established in neuroeconomics. The experiment uses real foods, with accurate nutrition information, and incentivized choices. The experimental manipulation is elegant in its simplicity - administering a high-calorie protein shake. It is also commendable that the study was within-participant. The experiment also includes hunger and mood ratings to confirm the effectiveness of the manipulation. The modeling work is impressive in its rigor - the authors test 9 different variants of the DDM, including recent models like the mtDDM and maaDDM, as well as some completely new variants (maaDDM2phi and 2phisp). The model fits decisively favor the maaDDM2phi.

      Weaknesses:

      First, in examining some of the model fits in the supplements, e.g. Figures S9, S10, S12, S13, it looks like the "taste weight" parameter is being constrained below 1. Theoretically, I understand why the authors imposed this constraint, but it might be unfairly penalizing these models. In theory, the taste weight could go above 1 if participants had a negative weight on health. This might occur if there is a negative correlation between attractiveness and health and the taste ratings do not completely account for attractiveness. I would recommend eliminating this constraint on the taste weight.

      Second, I'm not sure about the mediation model. Why should hunger change the dwell time on the chosen item? Shouldn't this model instead focus on the dwell time on the tasty option?

      Third, while I do appreciate the within-participant design, it does raise a small concern about potential demand effects. I think the authors' results would be more compelling if they replicated when only analyzing the first session from each participant. Along similar lines, it would be useful to know whether there was any effect of order.

      Fourth, the authors report that tasty choices are faster. Is this a systematic effect, or simply due to the fact that tasty options were generally more attractive? To put this in the context of the DDM, was there a constant in the drift rate, and did this constant favor the tasty option?

      Fifth, I wonder about the mtDDM. What are the units on the "starting time" parameters? Seconds? These seem like minuscule effects. Do they align with the eye-tracking data? In other words, which attributes did participants look at first? Was there a correlation between the first fixations and the relative starting times? If not, does that cast doubt on the mtDDM fits? Did the authors do any parameter recovery exercises on the mtDDM?

    1. eLife Assessment

      The manuscript by Mancl et al. provides valuable mechanistic insights into the conformational dynamics of Insulin Degrading Enzyme (IDE), a zinc metalloprotease involved in the clearance of various bioactive peptides. Supported by a convincing combination of cryo-EM, SEC-SAXS, enzymatic assays, and molecular dynamics simulations, the study characterizes the dynamic transitions between IDE's open and closed states in the presence of a sub-saturating concentration of insulin. This work contributes to a refined model of IDE's functional cycle, enhancing our understanding of its role in proteolysis.

    2. Reviewer #1 (Public review):

      Summary:

      Mancl et al. present cryo-EM structures of the Insulin Degrading Enzyme (IDE) dimer and characterize its conformational dynamics by integrating structures with SEC-SAXS, enzymatic activity assays, and all-atom molecular dynamics (MD) simulations. They present five cryo-EM structures of the IDE dimer at 3.0-4.1 Å resolution, obtained with one of its substrates, insulin, added to IDE in a 1:2 ratio. The study identified R668 as a key residue mediating the open-close transition of IDE, a finding supported by simulations and experimental data. The work offers a refined model for how IDE recognizes and degrades amyloid peptides, incorporating the roles of IDE-N rotation and charge-swapping events at the IDE-N/C interface.

      Strengths:

      The study by Mancl et al. uses a combination of experimental (cryoEM, SEC-SAXS, enzymatic assays) and computational (MD simulations, multibody analysis, 3DVA) techniques to provide a comprehensive characterization of IDE dynamics. The identification of R668 as a key residue mediating the open-to-close transition of IDE is a novel finding, supported by both simulations and experimental data presented in the manuscript. The work offers a refined model for how IDE recognizes and degrades amyloid peptides, incorporating the roles of IDE-N rotation and charge-swapping events at the IDE-N/C interface. The study identifies the structural basis and key residues for IDE dynamics that were not revealed by static structures.

      Weaknesses:

      Based on MD simulations and enzymatic assays of IDE, the authors claim that the R668A mutation in IDE affects the conformational dynamics governing the open-closed transition, which leads to altered substrate binding and catalysis. The functional importance of R668 would be substantiated by enzymatic assays that included some of the other known substrates of IDE than insulin such as amylin and glucagon.

      It is unclear to what extent the force field (FF) employed in the MD simulations favors secondary structures and if the lack of any observed structural changes within the IDE domains in the simulations - which is taken to suggest that the domains behave as rigid bodies - stems from bias by the FF.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript describes various conformational states and structural dynamics of the Insulin degrading enzyme (IDE), a zinc metalloprotease by nature. Both open and closed-state structures of IDE have been previously solved using crystallography and cryo-EM which reveal a dimeric organization of IDE where each monomer is organized into N and C domains. C-domains form the interacting interface in the dimeric protein while the two N-domains are positioned on the outer sides of the core formed by C-domains. It remains elusive how the open state is converted into the closed state but it is generally accepted that it involves large-scale movement of N-domains relative to the C-domains. The authors here have used various complementary experimental techniques such as cryo-EM, SAXS, size-exclusion chromatography, and enzymatic assays to characterize the structure and dynamics of IDE protein in the presence of substrate protein insulin whose density is captured in all the structures solved. The experimental structural data from cryo-EM suffered from a high degree of intrinsic motion among the different domains and consequently, the resultant structures were moderately resolved at 3-4.1 Å resolution. A total of five structures were generated by cryo-EM. The authors have extensively used Molecular dynamics simulation to fish out important inter-subunit contacts which involve R668, E381, D309, etc residues. In summary, authors have explored the conformational dynamics of IDE protein using experimental approaches which are complemented and analyzed in atomic details by using MD simulation studies. The studies are meticulously conducted and lay the ground for future exploration of the protease structure-function relationship.

    1. Reviewer #1 (Public review):

      This work employs both in vitro and in vivo/transplant methods to investigate the contribution of BDNF/TrkB signaling to enhancing differentiation and dentin-repair capabilities of dental pulp stem cells in the context of exposure to a variety of inflammatory cytokines. A particular emphasis of the approach is the employment of dental pulp stem cells in which BDNF expression has been enhanced using CRISPR technology. Transplantation of such cells is said to improve dentin regeneration in a mouse model of tooth decay.

      The study provides several interesting findings, including demonstrating that exposure to several cytokines/inflammatory agents increases the quantity of (activated) phospho-Trk B in dental pulp stem cells.

      However, a variety of technical issues weaken support for the major conclusions offered by the authors. These technical issues include the following:

      (1) It remains unclear exactly how the cytokines tested affect BDNF/TrkB signaling. For example, in Figure 1C, TNF-alpha increases TrkB and phospho-TrkB immunoreactivity to the same degree, suggesting that the cytokine promotes TrkB abundance without stimulating pathways that activate TrkB, whereas in Figure 2D, TNF-alpha has little effect on the abundance of TrkB, while increasing phospho-TrkB, suggesting that it affects TrkB activation and not TrkB abundance.

      (2) I find the histological images in Figure 3 to be difficult to interpret. I would have imagined that DAPI nuclear stains would reveal the odontoblast layer, but this is not apparent. An adjacent section labeled with conventional histological stains would be helpful here. Others have described Stro-1 as a stem cell marker that is expressed on a minority of cells associated with vasculature in the dental pulp, but in the images in Figure 3, Stro-l label is essentially co-distributed with DAPI, in both control and injured teeth, indicating that it is expressed in nearly all cells. Although the authors state that the Stro-1-positive cells are associated with vasculature, but I see no evidence that is true.

      (3) The data presented convincingly demonstrate that they have elevated BDNF expression in their dental pulp stem cells using a CRISPR-based approach I have a number of questions about these findings. Firstly, nowhere in the paper do they describe the nature of the CRISPR plasmid they are transiently transfecting. Some published methods delete segments of the BDNF 3'-UTR while others use an inactivated Cas9 to position an active transactivator to sequences in the BDNF promoter. If it is the latter approach, transient transfection will yield transient increases in BDNF expression. Also, as BDNF employs multiple promoters, it would be helpful to know which promoter sequence is targeted, and finally, knowing the identity of the guide RNAs would allow assessment for the potential of off-target effects I am guessing that the investigators employ a commercially obtained system from Santa Cruz, but nowhere is this mentioned. Please provide this information.

      (4) Another question left unresolved is whether their approach elevated BDNF, proBDNF, or both. Their 28 kDa western blot band apparently represents proBDNF exclusively, with no mature BDNF apparent, yet only mature BDNF effectively activates TrkB receptors. On the other hand, proBDNF preferentially activates p75NTR receptors. The present paper never mentions p75NTR, which is a significant omission, since other investigators have demonstrated that p75NTR controls odontoblast differentiation.

      (5) In any case, no evidence is presented to support the conclusion that the artificially elevated BDNF expression has any effect on the capability of the dental pulp stem cells to promote dentin regeneration. The results shown in Figures 4 and 5 compare dentin regeneration with BDNF-over-expressing stem cells with results lacking any stem cell transplantation. A suitable control is required to allow any conclusion about the benefit of over-expressing BDNF.

      (6) Whether increased BDNF expression is beneficial or not, the evidence that the BDNF-overexpressing dental pulp stem cells promote dentin regeneration is somewhat weak. The data presented indicate that the cells increase dentin density by only 6%. The text and figure legend disagree on whether the p-value for this effect is 0.05 or 0.01. In either case, nowhere is the value of N for this statistic mentioned, leaving uncertainty about whether the effect is real.

      (7) The final set of experiments applies transcriptomic analysis to address the mechanisms mediating function differences in dental pulp stem cell behavior. Unfortunately, while the Abstract indicates " we conducted transcriptomic profiling of TNFα-treated DPSCs, both with and without TrkB antagonist CTX-B" that does not describe the experiment described, which compared the transcriptome of control cells with cells simultaneously exposed to TNF-alpha and CTX-B. Since CTX-B blocks the functional response of cells to TNF-alpha, I don't understand how any useful interpretation can be attached to the data without controls for the effect of TNF alone and CTX-B alone.

    2. Reviewer #2 (Public review):

      Summary:<br /> In this manuscript, the authors investigate the potential for overexpressing BDNF in dental pulp stem cells to enhance dentin regeneration. They suggest that in the inflammatory environment of injured teeth, there is increased signaling of TrkB in response to elevated levels of inflammatory molecules.

      Strengths:<br /> The potential application to dentin regeneration is interesting.

      Weaknesses:<br /> There are a number of concerns with this manuscript to be addressed.

      (1) Insufficient citation of the literature. There is a vast literature on BDNF-TrkB regulating survival, development, and function of neurons, yet there is only one citation (Zhang et al 2012) which is on Alzheimer's disease.

      (2) There are several incorrect statements. For example, in the introduction (line 80) TrkA is not a BDNF receptor.

      (3) Most important - Specific antibodies must be identified by their RRID numbers. To state that "Various antibodies were procured:... from BioLegend" is unacceptable, and calls into question the entire analysis. Specifically, their Western blot in Figure 4B indicates a band at 28 kDa that they say is BDNF, however the size of BDNF is 14 kDa, and the size of proBDNF is 32 and 37 kDa, therefore it is not clear what they are indicating at 28 kDa. The validation is critical to their analysis of BDNF-expressing cells.

      (4) Figure 2 indicates increased expression of TrkB and TrkA, as well as their phosphorylated forms in response to inflammatory stimuli. Do these treatments elicit increased secretion of the ligands for these receptors, BDNF and NGF, respectively, to activate their phosphorylation? Or are they suggesting that the inflammatory molecules directly activate the Trk receptors? If so, further validation is necessary to demonstrate that.

      (5) Figure 7 - RNA-Seq data, what is the rationale for treatment with TNF+ CTX-B? How does this identify any role for TrkB signaling? They never define their abbreviations, but if CTX-B refers to cholera toxin subunit B, which is what it usually refers to, then it is certainly not a TrkB antagonist.

    3. Reviewer #3 (Public review):

      In general, although the authors interpret their results as pointing towards a possible role of BDNF in dentin regeneration, the results are over-interpreted due to the lack of proper controls and focus on TrkB expression, but not its isoforms in inflammatory processes. Surprisingly, the authors do not study the possible role of p75 in this process, which could be one of the mechanisms intervening under inflammatory conditions.

      (1) The authors claim that there are two Trk receptors for BDNF, TrkA and TrkB. To date, I am unaware of any evidence that BDNF binds to TrkA to activate it. It is true that two receptors have been described in the literature, TrkB and p75 or NGFR, but the latter is not TrkA despite its name and capacity to bind NGF along with other neurotrophins. It is crucial for the authors to provide a reference stating that TrkA is a receptor for BDNF or, alternatively, to correct this paragraph.

      (2) The authors discuss BDNF/TrkB in inflammation. Is there any possibility of p75 involvement in this process?

      (3) The authors present immunofluorescence (IF) images against TrkB and pTrkB in the first figure. While they mention in the materials and methods section that these antibodies were generated for this study, there is no proof of their specificity. It should be noted that most commercial antibodies labeled as anti-TrkB recognize the extracellular domain of all TrkB isoforms. There are indications in the literature that pathological and excitotoxic conditions change the expression levels of TrkB-Fl and TrkB-T1. Therefore, it is necessary to demonstrate which isoform of TrkB the authors are showing as increased under their conditions. Similarly, it is essential to prove that the new anti-p-TrkB antibody is specific to this Trk receptor and, unlike other commercial antibodies, does not act as an anti-phospho-pan-Trk antibody.

      (4) I believe this initial conclusion could be significantly strengthened, without opening up other interpretations of the results, by demonstrating the specificity of the antibodies via Western blot (WB), both in the presence and absence of BDNF and other neurotrophins, NGF, and NT-3. Additionally, using WB could help reinforce the quantification of fluorescence intensity presented by the authors in Figure 1. It's worth noting that the authors fixed the cells with 4% PFA for 2 hours, which can significantly increase cellular autofluorescence due to the extended fixation time, favoring PFA autofluorescence. They have not performed negative controls without primary antibodies to determine the level of autofluorescence and nonspecific background. Nor have they indicated optimizing the concentration of primary antibodies to find the optimal point where the signal is strong without a significant increase in background. The authors also do not mention using reference markers to normalize specific fluorescence or indicating that they normalized fluorescence intensity against a standard control, which can indeed be done using specific signal quantification techniques in immunocytochemistry with a slide graded in black-and-white intensity controls. From my experience, I recommend caution with interpretations from fluorescence quantification assays without considering the aforementioned controls.

      (5) In Figure 2, the authors determine the expression levels of TrkA and TrkB using qPCR. Although they specify the primers used for GAPDH as a control in materials and methods, they do not indicate which primers they used to detect TrkA and TrkB transcripts, which is essential for determining which isoform of these receptors they are detecting under different stimulations. Similarly, I recommend following the MIQE guidelines (Minimum Information for Publication of Quantitative Real-Time PCR experiments), so they should indicate the amplification efficiency of their primers, the use of negative and positive controls to validate both the primer concentration used, and the reaction, the use of several stable reference genes, not just one.

      (6) Moreover, the authors claim they are using the same amounts of cDNA for qPCRs since they have quantified the amounts using a Nanodrop. Given that dNTPs are used during cDNA synthesis, and high levels remain after cDNA synthesis from mRNA, it is not possible to accurately measure cDNA levels without first cleaning it from the residual dNTPs. Therefore, I recommend that the authors clarify this point to determine how they actually performed the qPCRs. I also recommend using two other reference genes like 18S and TATA Binding Protein alongside GAPDH, calculating the geometric mean of the three to correctly apply the 2^-ΔΔCt formula.

      (7) Similarly, given that the newly generated antibodies have not been validated, I recommend introducing appropriate controls for the validation of in-cell Western assays.

      (8) The authors' conclusion that TrkB levels are minimal (Figure 2E) raises questions about what they are actually detecting in the previous experiments might not be the TrkB-Fl form. Therefore, it is essential to demonstrate beyond any doubt that both the antibodies used to detect TrkB and the primers used for qPCR are correct, and in the latter case, specify at which cycle (Ct) the basal detection of TrkB transcripts occurs. Treatment with TNF-alpha for 14 days could lead to increased cell proliferation or differentiation, potentially increasing overall TrkB transcript levels due to the number of cells in culture, not necessarily an increase in TrkB transcripts per cell.

      (9) Overall, there are reasonable doubts about whether the authors are actually detecting TrkB in the first three images, as well as the phosphorylation levels and localization of this receptor in the cells. For example, in Figure 3 A to J, it is not clear where TrkB is expressed, necessitating better resolution images and a magnified image to show in which cellular structure TrkB is expressed.

      (10) In Figure 4, the authors indicate they have generated cells overexpressing BDNF after recombination using CRISPR technology. However, the WB they show in Figure 4B, performed under denaturing conditions, displays a band at approximately 28kDa. This WB is absolutely incorrect with all published data on BDNF detection via this technique. I believe the authors should demonstrate BDNF presence by showing a WB with appropriate controls and BDNF appearing at 14kDa to assume they are indeed detecting BDNF and that the cells are producing and secreting it. What antibodies have been used by the authors to detect BDNF? Have the authors validated it? There are some studies reporting the lack of specificity of certain commercial BDNF antibodies, therefore it is necessary to show that the authors are convincingly detecting BDNF.

      (11) While the RNA sequencing data indicate changes in gene expression in cells treated with TNFalpha+CTX-B compared to control, the authors do not show a direct relationship between these genetic modifications with the rest of their manuscript's argument. I believe the results from these RNA sequencing assays should be put into the context of BDNF and TrkB, indicating which genes in this signaling pathway are or are not regulated, and their importance in this context.

    1. eLife Assessment

      The macromolecular organization of photosynthetic complexes within the thylakoids of higher plant chloroplasts has been a topic of significant debate. Using in situ cryo-electron tomography, this study reveals the native thylakoid architecture of spinach thylakoid membranes with single-molecule precision. The experimental methods are unique and compelling, providing important information for understanding the structural features that impact photosynthetic regulation in vascular plants and addressing several long-standing questions about the organization and regulation of photosynthesis.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, authors utilized in situ cryo-electron tomography (cryo-ET) to uncover the native thylakoid architecture of spinach chloroplasts and mapped the molecular organization of these thylakoids with single-molecule resolution. The obtained images show the detailed ultrastructural features of grana membranes and highlight interactions between thylakoids and plastoglobules. Interestingly, despite the distinct three-dimensional architecture of vascular plant thylakoids, their molecular organization closely resembles that of green algae. The pronounced lateral segregation of PSII and PSI was observed at the interface between appressed and non-appressed thylakoid regions, without evidence of a specialized grana margin zone where these complexes might intermix. Furthermore, unlike isolated thylakoid membranes, photosystem II (PSII) did not form a semi-crystalline array and distributed uniformly within the membrane plane and across stacked grana membranes in intact chloroplasts. Based on the above observations, the authors propose a simplified two-domain model for the molecular organization of thylakoid membranes that can apply to both green algae and vascular plants. This study suggests that the general understanding of the functional separation of thylakoid membranes in vascular plants should be reconsidered.

      Strengths:

      By employing and refining AI-driven computational tools for the automated segmentation of membranes and identification of membrane proteins, this study successfully quantifies the spatial organization of photosynthetic complexes both within individual thylakoid membranes and across neighboring stacked membranes.

      Weaknesses:

      This study's weakness is that it requires the use of chloroplasts isolated from leaves and the need to freeze them on a grid for observation, so it is unclear to what extent the observations reflect physiological conditions. In particular, the mode of existence of the thylakoid membrane complexes seems to be strongly influenced by the physicochemical environment surrounding the membranes, as indicated by the different distribution of PSII between intact chloroplasts and those with ruptured envelope membranes.

    3. Reviewer #2 (Public review):

      Summary:

      For decades, the macromolecular organization of photosynthetic complexes within the thylakoids of higher plant chloroplasts has been a topic of significant debate. Using focused ion beam milling, cryo-electron tomography, and advanced AI-based image analysis, the authors compellingly demonstrate that the macromolecular organization in spinach thylakoids closely mirrors the patterns observed in their earlier research on Chlamydomonas reinhardtii. Their findings provide strong evidence challenging long-standing assumptions about the existence of a 'grana margin'-a region at the interface between grana and stroma lamellae domains that was thought to contain intermixed particles from both areas. Instead, the study establishes that this mixed zone is absent and reveals a distinct, well-defined boundary between the grana and stroma lamellae.

      Strengths:

      By situating high-resolution structural data within the broader cellular context, this work contributes valuable insights into the molecular mechanisms governing the spatial organization of photosynthetic complexes within thylakoid membranes.

    1. eLife Assessment

      This paper provides an important proposal for why learning can be much faster and more accurate if synapses have a fast component that immediately corrects errors, as well as a slower component that corrects behavior averaged over a longer timescale. It is convincingly shown that integrating these two learning timescales improves performance compared to classical strategies, particularly in terms of robustness and generalization when learning new target signals. However, the biological plausibility and justification for the proposed rapid learning mechanism require further elaboration and supporting mechanistic examples.

    2. Reviewer #1 (Public review):

      Summary:

      This paper proposes a new set of local synaptic plasticity rules that differs from classic rules in two regards: First, working under the assumption that signals coming into synapses change smoothly over time and thus have temporal correlations such that immediate activity is positively correlated with subsequent activity, it proposes both fast plasticity that immediately corrects errors as well as slower plasticity. Second, it derives these rules from optimal, Bayesian control theory principles that, even without the fast component of plasticity, are shown to provide more accurate performance than classic, non-Bayesian plasticity rules. As a proof of principle, it applies these to a simple cerebellar learning example that demonstrates how the proposed rules lead to learning performance that exceeds that achieved with classic cerebellar learning rules. The work also provides a potential normative explanation for post-climbing fiber spike pauses in Purkinje cell firing and proposes testable predictions for cerebellar experiments. Overall, I found the idea to be compelling and potentially broadly applicable across many systems. Further, I thought the work was a rare, very beautiful display of the application of optimal control theory to fundamental problems in neuroscience. My comments are all relatively minor and more expressions of interest than criticism.

      Comments:

      (1) The algorithm assumes, reasonably, that inputs are relatively smooth. However, I was wondering if this could make additional experimental predictions for the system being exceptionally noisy or otherwise behaving in signature ways if one were able to train a real biological network to match a rapidly changing or non-smooth function that does not align with the underlying assumptions of the model.

      (2) The algorithm assumes that one can, to a good approximation, replace individual input rates by their across-synapse average. How sensitive is the learning to this assumption, as one might imagine scenarios where a neuron is sensitive to different inputs for different tasks or contexts so that a grand average might not be correct? Or, the functional number of inputs driving the output might be relatively low or otherwise highly fluctuating and less easily averaged over.

      (3) On the cerebellar example, it is nice that the Bayesian example provides a narrower PF-CF interval for plasticity than the classical rules, but the window is not nearly as narrow as the Suvrathan et al. 2016 paper cited by the authors. Maybe this is something special about that system having well-defined, delayed feedback, but (optional) further comments or insights would be welcome if available.

      (4) In the discussion, I appreciated the comparison with the Deneve work which has fast and slow feedback components. I was curious whether, although non-local, there were also conceptual similarities with FORCE learning in which there is also an immediate correction of activity through fast changing of synaptic weights, which then aids the slow long-term learning of synaptic weights.

    3. Reviewer #2 (Public review):

      Summary:

      Bricknell and Latham investigate the computational benefits of a dual-learning algorithm that combines a rapid, millisecond-scale weight adjustment mechanism with a conventional, slower gradient descent approach. A feedback error signal drives both mechanisms at the synaptic level.

      Strengths:

      Integrating these two learning timescales is intriguing and demonstrates improved performance compared to classical strategies, particularly in terms of robustness and generalization when learning new target signals.

      Weaknesses:

      The biological plausibility and justification for the proposed rapid learning mechanism require further elaboration and supporting mechanistic examples.

    1. eLife Assessment

      This work presents an important genetic toolkit for Drosophila neurobiologists to access and manipulate neuronal lineages during development and adulthood. The evidence supporting the fidelity of this toolkit is convincing. This work will interest Drosophila neurobiologists in general, and some of the genetic tools may be used outside the nervous system. The conceptual approaches used in this paper are likely transferable to other fields as comparable data and genomic methods are obtained.

    2. Reviewer #1 (Public review):

      The ventral nerve cord (VNC) of organisms like Drosophila is an invaluable model for studying neural development and organisation in more complex organisms. Its well-defined structure allows researchers to investigate how neurons develop, differentiate, and organise into functional circuits. As a critical central nervous system component, the VNC plays a key role in controlling motor functions, reflexes, and sensory integration.

      Particularly relevant to this work, the VNC provides a unique opportunity to explore neuronal hemilineages - groups of neurons that share molecular, genetic, and functional identities. Understanding these hemilineages is crucial for elucidating how neurons cooperate to form specialized circuits, essential for comprehending normal brain function and dysfunction.

      A significant challenge in the field has been the lack of developmentally stable, hemilineage-specific driver lines that enable precise tracking and measurement of individual VNC hemilineages. The authors address this need by generating and validating a comprehensive, lineage-specific split-GAL4 driver library.

      Strengths and weaknesses

      The authors select new marker genes for hemilineages from previously published single-cell data of the VNC. They generate and validate specific and temporally stable lines for almost all the hemilineages in the VNC. They successfully achieved their aims, and their results support their conclusions. This will be a valuable resource for investigating neural circuit formation and function.

    3. Reviewer #2 (Public review):

      It is my pleasure to review this manuscript from Stoffers, Lacin, and colleagues, in which they identify pairs of transcription factors unique to (almost) every ventral nerve cord hemilineage in Drosophila and use these pairs to create reagents to label and manipulate these cells. The advance is sold as largely technical-as a pipeline for identifying durably expressed transcription factor codes in postmitotic neurons from single cell RNAseq data, generating knock-in alleles in the relevant genes, using these to match transcriptional cell types to anatomic cell types, and then using the alleles as a genetic handle on the cells for downstream explication of their function. Yet I think the work is gorgeous in linking the expression of genes that are causal for neuron-type-specific characteristics to the anatomic instantiations of those neurons. It is astounding that the authors are able to use their deep collective knowledge of hemilineage anatomy and gene expression to match 33 of 34 transcriptional profiles. Together with other recent studies, this work drives a major course correction in developmental biology, away from empirically identified cell type "markers" (in Drosophila neuroscience, often genomic DNA fragments that contain enhancers found to be expressed in specific neurons at specific times), and towards methods in which the genes that generate neuronal type identity are actually used to study those neurons. Because the relationship between fate and form/function is built into the tools, I believe that this approach will be a trojan horse to integrate the fields of neural development and systems neuroscience.

    4. Reviewer #3 (Public review):

      Summary:

      Soffers et al. developed a comprehensive genetic toolkit that enables researchers to access neuronal hemilineages during developmental and adult time points using scRNA-seq analysis to guide gene cassette exchange-based or CRISPR-based tool building. Currently, research groups studying neural circuit development are challenged with tying together findings in the development and mature circuit function of hemilineage-related neurons. Here, authors leverage publicly available scRNA-seq datasets to inform the development of a split-Gal4 library that targets 32 of 34 hemilineages in development and adult stages. The authors demonstrated that the split-Gal4 library, or genetic toolkit, can be used to assess the functional roles, neurotransmitter identity, and morphological changes in targeted cells. The tools presented in this study should prove to be incredibly useful to Drosophila neurobiologists seeking to link neural developmental changes to circuit assembly and mature circuit function. Additionally, some hemilineages have more than one split-Gal4 combination that will be advantageous for studies seeking to disrupt associated upstream genes.

      Strengths:

      Informing genetic tool development with publicly available scRNA-seq datasets is a powerful approach to creating specific driver lines. Additionally, this approach can be easily replicated by other researchers looking to generate similar driver lines for more specific subpopulations of cells, as mentioned in the Discussion.

      The unification of optogenetic stimulation data of 8B neurons and connectomic analysis of the Giant-Fiber-induced take-off circuit was an excellent example of the utility of this study. The link between hemilineage-specific functional assays and circuit assembly has been limited by insufficient genetic tools. The tools and data present in this study will help better understand how collections of hemilineages develop in a genetically constrained manner to form circuits amongst each other selectively.

      Weaknesses:

      Although cell position, morphology (to some extent), and gene expression are good markers to track cell identity across developmental time, there are genetic tools available that could have been used to permanently label cells that expressed genes of interest from birth, ensuring that the same cells are being tracked in fixed tissue images.

      Although gene activation is a good proxy for assaying neurochemical features, relying on whether neurochemical pathway genes are activated in a cell to determine its phenotype can be misleading given that the Trojan-Gal4 system commandeers the endogenous transcriptional regulation of a gene but not its post-transcriptional regulation. Therefore, neurochemical identity is best identified via protein detection. (strong language used in this section of the paper).

      The authors mainly rely on the intersectional expression of transcription factors to generate split-Gal4 lines and target hemilineages specifically. However, the Introduction (Lines 97-99) makes a notable point about how driver lines in the past, which have also predominantly relied on the regulatory sequences of transcription factors, lack the temporal stability to investigate hemilineages across time. This point seems to directly conflict with the argument made in the Results (Lines 126-127) that states that most transcription factors are stably expressed in hemilineage neurons that express them. It is generally known that transcription factors can be expressed stably or transiently depending on the context. It is unclear how using the genes of transcription factors in this study circumvents the issue of creating temporally stable driver lines.

    1. eLife Assessment

      This study introduces a useful toolkit for zebrafish transgenesis, significantly enhancing the flexibility and efficiency of transgene generation for immunological applications. The authors provide supporting evidence through well-designed experiments, demonstrating the toolkit's utility in generating diverse and functional transgenic lines. While the findings are solid, additional functional validation and broader comparisons to existing systems would strengthen the overall evidence base and ensure broader relevance to the zebrafish field, thereby increasing the significance of the study.

    2. Reviewer #1 (Public review):

      Summary:

      The authors introduce ImPaqT, a modular toolkit for zebrafish transgenesis, utilizing the Golden Gate cloning approach with the rare-cutting enzyme PaqCI. The toolkit is designed to streamline the construction of transgenes with broad applications, particularly for immunological studies. By providing a versatile platform, the study aims to address limitations in generating plasmids for zebrafish transgenesis.

      Strengths:

      The ImPaqT toolkit offers a modular method for constructing transgenes tailored to specific research needs. By employing Golden Gate cloning, the system simplifies the assembly process, allowing seamless integration of multiple genetic elements while maintaining scalability for complex designs. The toolkit's utility is evident from its inclusion of a diverse range of promoters, genetic tools, and fluorescent markers, which cater to both immunological and general zebrafish research needs. Furthermore, the modular design ensures expandability, enabling researchers to customize constructs for diverse experimental designs. The validation provided in the manuscript is solid, demonstrating the successful generation of several functional transgenic lines. These examples highlight the toolkit's efficacy, particularly for immune-focused applications.

      Weaknesses:

      While the toolkit's technical capabilities are well-demonstrated, there are several areas where additional validation and examples could enhance its impact. One limitation is the lack of data showing whether the toolkit can be directly used for rapid cloning and testing of enhancers or promoters, particularly cloning them directly from PCR using PaqCI overhangs without needing an entry vector. Similarly, the feasibility of cloning genes directly from PCR products into the system is not demonstrated, which would significantly increase the utility for researchers working with genomic elements.

      The authors discuss potential applications such as using the toolkit for tissue-specific knockout applications by assembling CRISPR/Cas9 gRNA constructs. However, they do not demonstrate the cloning of short fragments, such as gRNA sequences downstream of a U6 promoter, which would be an important proof-of-concept to validate these applications. Furthermore, while the manuscript focuses on macrophage-specific promoters, the widely used mpeg1.1 promoter is not included or tested, which limits the toolkit's appeal for researchers studying macrophages and microglia.

      Another potential limitation is the handling of sequences containing PaqCI recognition sites. Although the authors discuss domestication to remove these sites, a demonstration of cloning strategies for such cases or alternative methods to address these challenges would provide practical guidance for users.

    3. Reviewer #2 (Public review):

      Summary:

      Hurst et al. developed a new Tol2-based transgenesis system ImPaqT, an Immunological toolkit for PaqCl-based Golden Gate Assembly of Tol2 Transgenes, to facilitate the production of transgenic zebrafish lines. This Golden Gate assembly-based approach relies on only a short 4-base pair overhang sequence in their final construct, and the insertion construct and backbone vector can be assembled in a single-tube reaction using PaqCl and ligase. This approach can also be expandable by introducing new overhang sequences while maintaining compatibility with existing ImPaqT constructs, allowing users to add fragments as needed.

      Strengths:

      The generation of several lines of transgenic zebrafish for the immunologic study demonstrates the feasibility of the ImPaqT in vivo. The lineage tracing of macrophages by LPS injection shows this approach's functionality, validating its usage in vivo.

      Weaknesses:

      (1) There is no quantitative data analysis showing the percentage of off-target based on these 4-bp overhang sequences.

      (2) There is no statement for the upper limitation of the expandability.

      (3) There is no data about any potential side effect on their endogenous function of promoter/protein of interest with the ImPaqT method.

    1. eLife Assessment

      This manuscript introduces Opto-PKCε, an optogenetic tool that enabled important findings derived from interactome and phosphoproteome studies. Light-dependent recruitment of Opto-PKCε to the plasma membrane revealed the specific phosphorylation of the insulin receptor at Thr 1160. In turn, recruitment to mitochondria led to phosphorylation of the complex I subunit NDUFS4, correlating with reduced spare respiratory capacity. The evidence supporting these conclusions is solid, although additional clarification on data analysis would further enhance readability.

    2. Reviewer #1 (Public review):

      Summary:

      Optogenetic tools enable very precise spatiotemporal control of the signaling pathway. The authors developed an optimized light-regulated PKC epsilon, Opto-PKCepsilon using AlphaFold for rational design. Interactome and phosphoproteome studies of light-activated Opto-PKCepsilon confirmed a high similarity of interaction partners to PMA-stimulated wild-type PKCepsilon and high specificity for PKCepsilon substrates. Light-dependent recruitment of Opto-PKCepsilon to the plasma membrane revealed the specific phosphorylation of the insulin receptor at Thr 1160 and recruitment to mitochondria the phosphorylation of the complex I subunit NDUFS4 correlating with reduced spare respiratory capacity, respectively. The interactome and phosphoproteome studies confirm the functionality of Opto-PKCepsilon.

      Strengths:

      AlphaFold simulations enable the design of an optimized Opto-PKCepsilon with respect to dark-light activity. Opto-PKCepsilon is a versatile tool to study the function of PKCepsilon in a precisely controlled manner.

      Weaknesses:

      Light-controlled PCKepsilon was recently reported by Gada et al. (2022). Ong et al. developed an optimized Opto-PKCepsilon and presented in their manuscript the potential of this tool for controlling signaling pathways. However, some data have to be improved and appropriate controls are still missing for some experiments.

      Major comments:

      (1) The group of proteins detected as phosphorylated PKC substrates (phospho-Ser PKC substrate antibody) induced by Opto-PKCepsilon varies significantly between Figure 1 C and Figure 2 C. Have the authors any explanation for this? Do both figures show similar areas of the membrane? The size marker indicates that this is not the case.

      (2) The ratio of endogenous and exogeneous PCKepsilon is quite different in the experiments shown in Figure 1 C and Figure 2 C. What is the reason for this effect?

      (3) In addition to the overall phosphorylation of PKC substrates, the PKCepsilon mutants should be tested for phosphorylation of a known PKCepsilon substrate. The phosphorylation of the insulin receptor at Thr 1160 by Opto-PKCepsilon (see Figure 6) is very convincing and would provide clearer results for comparing the mutants.

      (4) The quality of the fluorescence images shown in Figure 5 is poor and should be improved. In addition, a MitoTracker dye for mitochondria labeling should be included to confirm the mitochondrial localization of Opto-PKCepsilon.

      (5) Figure S6 shows a light experiment in the absence of insulin, as stated in the headline of the figure legend and in the main text. Does this mean that Figure 6B shows an experiment in which the cells were exposed to light in the presence of insulin? If so, this should be mentioned in the legend of the figure and in the main text. What influence does insulin have on IR phosphorylation at Thr 1160?

      (6) The signal of NDUSF4 phosphorylation induced by Opto-PKCepsilon is weak in the experiment shown in Figure 7E. What about the effect of shorter and longer exposure times? How many times was this experiment repeated?

    3. Reviewer #2 (Public review):

      Summary:

      The authors developed an optogenetic tool (Opto-PKCε) and demonstrated spatiotemporal control of optoPKCε at different subcellular compartments such as the plasma membrane or mitochondria. Signaling outcomes of optoPKCε were characterized by phosphoproteomics and biochemical analysis of downstream signaling effectors.

      Strengths:

      (1) Conventional strategy to activate PKC often involves activation of multiple downstream signaling pathways. This work showcases an alternative strategy that could help dissect the effect of specific PKC-elicited signaling outcomes.

      (2) The differential phosphoproteomic analysis of PKC substrates between PMA stimulation and optoPKCε activation is insightful. A follow-up question is whether co-transfection of CIBN-GFP-CaaX and optoPKCε increases the pool of substrate compared to optoPKCε only, or optoPKCε activation at the plasma membrane is more effective in phosphorylating its substrates?

      (3) The finding that PKC activation at the plasma membrane is required for insulin receptor activation is interesting. Why does Thr1160 phosphorylation lead to a reduction of Thr1158/1162/1163? Does "insulin-stimulated" imply that insulin was administrated in the culture during optogenetic stimulation? Also, did the author observe any insulin receptor endocytosis upon optoPKCε activation?

      Weaknesses:

      (1) When citing the previous work on optogenetics, the reviewer believes a broader scope of papers (reviews) and recent research articles should be cited, especially those that used similar strategies, i.e., membrane translocation followed by oligomerization (of cryptochrome), as reported in this work.

      (2) In terms of molecular modeling, how would the author enable AlphaFold3 structure prediction of activated optoPKCε (or the blue-light stimulated state of cryptochrome)? Current methods only describe that "To generate models of the monomer, an amino acid sequence corresponding to Opto-PKCɛ, 2 ATPs and 1 FAD were used as input whereas for the tetramer, copies of Opto-PKCɛ, 8 ATPs and 4 FADs were used as input" (likely missing "four" between "tetramer" and "copies"). However, simply putting four monomers would not ensure that each monomer is in the "activated" state, which involves excitation of the FAD cofactor and likely conformational changes in cryptochrome.

      (3) It would be helpful if the authors could help interpret some results. For example, Figure S1: Was the puncta of mCherry-PKCε on the plasma membrane or within the cytosol? Also, why does optoPKCε only work when PKCε is fused at the C-terminus? When screening for the optoPKCε system with the largest light-to-dark contrast, the AGC domain was truncated. What is the physiological function of AGC? Does AGC removal limit PKC's access to its endogenous substrates?

    1. eLife Assessment

      The study reports valuable findings on the nature of genotype-by-climate interaction, parameterised in a framework that allows integrating genetics and ecophysiological variation in switchgrass. The evidence provided is solid overall but the analysis could be improved to better support some of the claims.

    2. Reviewer #1 (Public review):

      Summary:

      The authors present results and analysis of an experiment studying the genetic architecture of phenology in two geographically and genetically distinct populations of switchgrass when grown in 8 common gardens spanning a wide range of latitudes. They focused primarily on two measures of phenology - the green-up date in the spring, and the date of flowering. They observed generally positive correlations of flowering date across the latitudinal gradient, but negative correlations between northern and southern (i.e. Texas) green-up dates. They use GWAS and multivariate meta-analysis methods to identify and study candidate genetic loci controlling these traits and how their effect sizes vary across these gardens. They conclude that much of the genetic architecture is garden-specific, but find some evidence for photoperiod and rainfall effects on the locus effect sizes.

      Strengths:

      The strengths of the study are in the large scale and quality of the field trials, the observation of negative correlations among genotypes across the latitudinal gradient, and the importance of the central questions: Can we predict how genetic architecture will change when populations are moved to new environments? Can we breed for more/less sensitivity to environmental cues?

      Weaknesses:

      I have tried hard to understand the concept of the GxWeather analysis presented here, but still do not see how it tests for interactions between weather and genetic effects on phenology. I may just not understand it correctly, but if so, then I think more clarity in the logical model would help - maybe a figure explaining how this approach can detect genotype-weather interactions. Also, since this is a proposal for a new approach to detecting gene-environment effects, simulations would be useful to show power and false positive rates, or other ways of validating the results. The QTL validation provided is not very convincing because the same trials and the same ways of calculating weather values are used again, so it's not really independent validation, plus the QTL intervals are so large overlap between QTL and GWAS is not very strong evidence.

      The term "GxWeather" is never directly defined, but based on its pairing with "GxE" on page 5, I assumed it means an interaction between genotypes (either plant lines or genotypes at SNPs) and weather variables, such that different genotypes alter phenology differently as a response to a specific change in weather. For example, some genotypes might initiate green-up once daylengths reach 12 hours, but others require 14 hours. Alternatively (equivalently), an SNP might have an effect on greenup at 12 hours (among plants that are otherwise physiologically ready to trigger greenup on March 21, only those with a genotype trigger), while no effect on greenup with daylengths of 14 hours (e.g., if plants aren't physiologically ready to greenup until June when daylengths are beyond 14 hours, both aa and AA genotypes will greenup at the same time, assuming this locus doesn't affect physiological maturity).

      Either way, GxE and (I assume) GxWeather are typically tested in one of two ways. Either genotype effects are compared among environments (which differ in their mean value for weather variables) and GxWeather would be inferred if environments with similar weather have similar genotype effects. Or a model is fit with an environmental (maybe weather?) variable as a covariate and the genotype:environment interaction is measured as a change of slope between genotypes. Basically, the former uses effect size estimates across environments that differ in mean for weather, while the latter uses variation in weather within an experiment to find GxWeather effects.

      However, the analytical approach here seems to combine these in a non-intuitive way and I don't think it can discover the desired patterns. As I understand from the methods, weather-related variables are first extracted for each genotype in each trial based on their green-up or flowering date, so within each trial each genotype "sees" a different value for this weather variable. For example, "daylength 14 days before green-up" is used as a weather variable. The correlation between these extracted genotype-specific weather variables across the 8 trials is then measured and used as a candidate mixture component for the among-trial covariance in mash. The weight assigned to these weather-related covariance matrices is then interpreted as evidence of genotype-by-weather interactions. However, the correlation among genotypes between these weather variables does not measure the similarity in the weather itself across trials. Daylengths at green-up are very different in MO than SD, but the correlation in this variable among genotypes is high. Basically, the correlation/covariance statistic is mean-centered in each trial, so it loses information about the mean differences among trials. Instead, the covariance statistic focuses on the within-trial variation in weather. But the SNP effects are not estimated using this within-trial variation, they're main effects of the SNP averaged over the within-trial weather variation. Thus it is not clear to me that the interpretation of these mash weights is valid. I could see mash used to compare GxWeather effects modeled in each trial (using the 2nd GxE approach above), but that would be a different analysis. As is, mash is used to compare SNP main effects across trials, so it seems to me this comparison should be based on the average weather differences among trials.

      A further issue with this analysis is that the weather variables don't take into account the sequence of weather events. If one genotype flowers after the 1st rain event and the second flowers after the 2nd rain event, they can get the same value for the cumulative rainfall 7d variable, but the lack of response after the 1st rain event is the key diagnostic for GxWeather. There's also the issue of circularity. Since weather values are defined based on observed phenology dates, they're effectively caused by the phenology dates. So then asking if they are associated with phenology is a bit circular. Also, it takes a couple of weeks after flowering is triggered developmentally before flowers open, so the < 2-week lags don't really make developmental sense.

      Thus, I don't think this sentence in the abstract is a valid interpretation of the analysis: "in the Gulf subpopulation, 65% of genetic effects on the timing of vegetative growth covary with day length 14 days prior to green-up date, and 33% of genetic effects on the timing of flowering covary with cumulative rainfall in the week prior to flowering". There's nothing in this analysis that compares the genetic effects under 12h days to genetic effects under 14h days (as an example), or genetic effects with no rainfall prior to flowering to genetic effects with high rainfall prior to flowering. I think the only valid conclusion is: "65% of SNPs for green-up have a GxE pattern that mirrors the similarity in relationships between green-up and day length among trials." However I don't know how to interpret that statement in terms of the overall goals of the paper.

      Next, I am confused about the framing in the abstract and the introduction of the GxE within and between subpopulations. The statement: "the key expectation that different genetic subpopulations, and even different genomic regions, have likely evolved distinct patterns of GxE" needs justification or clarification. The response to an environmental factor (ie plasticity) is a trait that can evolve between populations. This happens through the changing frequencies of alleles that cause different responses. But this doesn't necessarily mean that patterns of GxE are changing. GxE is the variance in plasticity. When traits are polygenic, population means can change a lot with little change in variance within each population. Most local adaptation literature is focused on changes in mean trait values or mean plasticities between populations, not changes in the variance of trait values or plasticities within populations. Focusing on the goal of this paper, differences in environmental or weather responses between the populations are interesting (Figure 1). However the comparisons of GxE between populations and with the combined population are hard to interpret. GxE within a population means that that population is not fixed for this component of plasticity, meaning that it likely hasn't been strongly locally selected. Doesn't this mean that in the context of comparing the two populations, loci with GxE within populations are less interesting than loci fixed for different values between populations? Also, if there is GxE in the Gulf population, by definition it is also present in the "Both" population. Not finding it there is just a power issue. If individuals in the two subpopulations never cross, the variance across the "Both" population isn't relevant in nature, it's an artificial construct of this experimental design. I wonder if there is confusion about the term "genetic" in GxE and as used in the first paragraph of the intro ("Genetic responses" and "Genetic sensitivity"). These sentences would be most clear if the "genetic" term referred to the mechanistic actions of gene products. But the rest of the paper is about genetic variation, ie the different effects of different alleles at a locus. I don't think this latter definition is what these first uses intend, which is confusing.

      Note that the cited paper (26) is not relevant to this discussion about GxE patterns. This paper discusses the precision of estimating sub-group-specific genetic effects. With respect to the current paper, reference 26 shows that you might get more accurate measures of the SNP effects in the Gulf population using the full "Both" population dataset because i) the sample size is larger, and ii) as long as the true effects are not that different between populations. That paper is not focused on whether effect size variation is caused by evolution but on the technical question of whether GxG or GxE impacts the precision of within-group effect size estimates. The implication of paper 26 is that comparing SNP effects estimated in the "Both" population among gardens might be more powerful for detecting GxE than using only Gulf samples, even if there is some difference in SNP effects among populations. But if there magnitudes (or directions) of SNP effects change a lot among populations (ie not just changes in allele frequency), then modeling the populations separately will be more accurate.

    3. Reviewer #2 (Public review):

      The provided evidence in the study by MacQueen and colleagues is convincing, albeit some methodological challenges still exist. The authors rightly state that different subpopulations are likely to have evolved distinct patterns of GxE. It has been recently shown that the genetic architecture for adaptive traits differs across subpopulations (Lopez-Arboleda et al. 2021), hence this effect should be even more pronounced for GxE. How to best account for this in a statistical framework is not utterly clear. Here the authors describe their efforts to asses these interactions and to estimate the magnitude of the respective effects. Building on the statistical framework described, it could be possible to translate their findings from switchgrass to other species. A plus of the study is the effort to use an independent pseudo-F2 population to confirm the found associations.<br /> The manuscript is written coherently and all data and code used is freely available and explained in detail in the supplementary information.

      Nevertheless, I feel that there are some points in the data analysis that could be clarified some more.

      (1) Dividing GxE interactions into discrete, measurable GxWeather terms is a nice idea to gain a reliable measurement of E. I also appreciate the effort to create date-related values as a summary function of a weather variable across a specified date range. Using cumulative data the week prior to flowering seems like a good choice to associate weather patterns to this phenotype, but there are many - including non-linear ways - to accumulate these data. Additionally, weather parameters like temperature and precipitation can show interaction effects. I wonder if there is a way to consider these.

      (2) As pointed out in Section S1, a trait measured in eight common gardens could be modeled at eight genetically correlated traits. To assess the genetic correlation one would need to estimate the genetic variance within each trait and 28 genetic covariance structures. Here model convergence would be painful given the sample sizes. There are different statistical solutions for this including the mash algorithm the authors choose. I highly appreciate the effort in how the rationale is described in the supplementary information, but to me, it is still not completely clear how 'strong' and random effects have been selected from GWAS. How sensitive is the model to a selection of different effects? Could one run permutations to assess this? Why is the number of total markers different for different phenotypes and subsets and does this affect statistical power?

      (3) The mash model chooses different covariance matrices for the different analyses. Although I do understand the rationale for this, I am not sure how this will impact the respective analysis and how comparable the results are. Would one not like to have the same covariance matrices selected for all analyses?

      (4) Although the observed pattern of different GxE in different subpopulations is intriguing, it remains a little unclear what we actually learn apart from the fact that GxE in adaptive traits is complex. Figure 3 divides GxE into sign and magnitude effects. Interestingly the partition differs significantly between Greenup date and Flowering Date. Still, the respective QTLs in Figure 4 do - at least partially - overlap (e.g. on CHR05N). What is the interpretation of these? Here, I would appreciate a more detailed discussion and hearing the thoughts of the authors.

      (5) Figure 4 states that Stars indicate QTLs with significant enrichment for SNPs in the 1% mash tail. The shown Rug plots indicate this, but unfortunately, I am missing the respective stars. Is there a way to identify what is underlying these QTLs?

      To summarize, the manuscript nicely shows the complex nature of GxE in different switchgrass subpopulations. The goal now would be to identify the causative alleles for these phenomena and understand how these have evolved. Here the provided study paves the way for further analyses in this perspective.

    1. eLife Assessment

      This paper reports the fundamental finding of how Raman spectral patterns correlate with proteome profiles. The authors then go further to show that this can be used to infer global stochiometric regulation of the proteomes. These findings are likely general and the authors provide compelling evidence by analyzing bacterial and human cells but there are some suggestions provided below to make the work clearer and more accessible for it to reach a broader audience.

    2. Reviewer #1 (Public review):

      Summary

      This work performed Raman spectral microscopy at the single-cell level for 15 different culture conditions in E. coli. The Raman signature is systematically analyzed and compared with the proteome dataset of the same culture conditions. With a linear model, the authors revealed correspondence between Raman pattern and proteome expression stoichiometry indicating that spectrometry could be used for inferring proteome composition in the future. With both Raman spectra and proteome datasets, the authors categorized co-expressed genes and illustrated how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase-dependent genes. Overall, the authors demonstrate a strong and solid data analysis scheme for the joint analysis of Raman and proteome datasets.

      Strengths and major contributions

      (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions.

      (2) In data analysis, the authors developed a scheme to compare proteome and Ramen datasets. Protein co-expression clusters were identified, and their biological meaning was investigated.

      Weaknesses

      The experimental measurements of Ramen microscopy were conducted at the single-cell level; however, the analysis was performed by averaging across the cells. The author did not discuss if Ramen microscopy can used to detect cell-to-cell variability under the same condition.

      Discussion and impact on the field

      Ramen signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition of biomolecules. It has the advantage that single-cell level data could be acquired and both in vivo and in vitro data can be compared. This work is a strong initiative for introducing the powerful technique to systems biology and providing a rigorous pipeline for future data analysis.

    3. Reviewer #2 (Public review):

      Summary and strengths:

      Kamei et al. observe the Raman spectra of a population of single E.Coli cells in diverse growth conditions. Using LDA, Raman spectra for the different growth conditions are separated. Using previously available protein abundance data for these conditions, a linear mapping from Raman spectra in LDA space to protein abundance is derived. Notably, this linear map is condition-independent and is consequently shown to be predictive for held-out growth conditions. This is a significant result and in my understanding extends the earlier Raman to RNA connection that has been reported earlier.

      They further show that this linear map reveals something akin to bacterial growth laws (ala Scott/Hwa) that the certain collection of proteins shows stoichiometric conservation, i.e. the group (called SCG - stoichiometrically conserved group) maintains their stoichiometry across conditions while the overall scale depends on the conditions. Analyzing the changes in protein mass and Raman spectra under these conditions, the abundance ratios of information processing proteins (one of the large groups where many proteins belong to "information and storage" - ISP that is also identified as a cluster of orthologous proteins) remain constant. The mass of these proteins deemed, the homeostatic core, increases linearly with growth rate. Other SCGs and other proteins are condition-specific.

      Notably, beyond the ISP COG the other SCGs were identified directly using the proteome data. Taking the analysis beyond they then how the centrality of a protein - roughly measured as how many proteins it is stoichiometric with - relates to function and evolutionary conservation. Again significant results, but I am not sure if these ideas have been reported earlier, for example from the community that built protein-protein interaction maps.

      Finally, the paper built a lot of "machinery" to connect \Omega_LE, built directly from proteome, and \Omega_B, built from Raman, spaces. I am unsure how that helps and have not been able to digest the 50 or so pages devoted to this.

      Strengths:

      The rigorous analysis of the data is the real strength of the paper. Alongside this, the discovery of SCGs that are condition-independent and that are condition-dependent provides a great framework.

      Weaknesses:

      Overall, I think it is an exciting advance but some work is needed to present the work in a more accessible way.

    1. eLife Assessment

      This valuable study identifies a protein called adenosine deaminase-related growth factor (ADGF) as a key regulator of tip formation in the slime mold Dictyostelium discoideum. The authors convincingly show that ADGF catalyses the formation of ammonia from adenosine, allowing ammonia to initiate tip formation, and then elucidate pathways upstream and downstream from ADGF. The authors discuss the intriguing possibility that mammalian ADGF may also similarly regulate development.

    2. Reviewer #1 (Public review):

      Summary:

      This work shows that a specific adenosine deaminase protein in Dictyostelium generates the ammonia that is required for tip formation during Dictyostelium development. Cells with an insertion in the ADGF gene aggregate but do not form tips. A remarkable result, shown in several different ways, is that the ADGF mutant can be rescued by exposing the mutant to ammonia gas. The authors also describe other phenotypes of the ADGF mutant such as increased mound size, altered cAMP signaling, and abnormal cell type differentiation. It appears that the ADGF mutant has defects in the expression of a large number of genes, resulting in not only the tip defect but also the mound size, cAMP signaling, and differentiation phenotypes.

      Strengths:

      The data and statistics are excellent.

      Weaknesses:

      The key weakness is understanding why the cells bother to use a diffusible gas like ammonia as a signal to form a tip and continue development. The rescue of the mutant by adding ammonia gas to the entire culture indicates that ammonia conveys no positional information within the mound. By the time the cells have formed a mound, the cells have been starving for several hours, and desperately need to form a fruiting body to disperse some of themselves as spores, and thus need to form a tip no matter what. One can envision that the local ammonia concentration is possibly informing the mound that some minimal number of cells are present (assuming that the ammonia concentration is proportional to the number of cells), but probably even a minuscule fruiting body would be preferable to the cells compared to a mound. This latter idea could be easily explored by examining the fate of the ADGF cells in the mound - do they all form spores? Do some form spores? Or perhaps the ADGF is secreted by only one cell type, and the resulting ammonia tells the mound that for some reason that cell type is not present in the mound, allowing some of the cells to transdifferentiate into the needed cell type. Thus elucidating if all or some cells produce ADGF would greatly strengthen this puzzling story.

    3. Reviewer #2 (Public review):

      Summary:

      The paper describes new insights into the role of adenosine deaminase-related growth factor (ADGF), an enzyme that catalyses the breakdown of adenosine into ammonia and inosine, in tip formation during Dictyostelium development. The ADGF null mutant has a pre-tip mound arrest phenotype, which can be rescued by the external addition of ammonia. Analysis suggests that the phenotype involves changes in cAMP signaling possibly involving a histidine kinase dhkD, but details remain to be resolved.

      Strengths:

      The generation of an ADGF mutant showed a strong mound arrest phenotype and successful rescue by external ammonia. Characterisation of significant changes in cAMP signaling components, suggesting low cAMP signaling in the mutant and identification of the histidine kinase dhkD as a possible component of the transduction pathway. Identification of a change in celltype differentiation towards prestalk fate

      Weaknesses:

      Lack of details on the developmental time course of ADGF activity and celltype type-specific differences in ADGF expression. The absence of measurements to show that ammonia addition to the null mutant can rescue the proposed defects in cAMP signaling. No direct measurements in the dhkD mutant to show that it acts upstream of sdgf in the control of changes in cAMP signaling and tip formation.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work shows that a specific adenosine deaminase protein in Dictyostelium generates the ammonia that is required for tip formation during Dictyostelium development. Cells with an insertion in the ADGF gene aggregate but do not form tips. A remarkable result, shown in several different ways, is that the ADGF mutant can be rescued by exposing the mutant to ammonia gas. The authors also describe other phenotypes of the ADGF mutant such as increased mound size, altered cAMP signalling, and abnormal cell type differentiation. It appears that the ADGF mutant has defects in the expression of a large number of genes, resulting in not only the tip defect but also the mound size, cAMP signalling, and differentiation phenotypes.

      Strengths:

      The data and statistics are excellent.

      Weaknesses:

      (1) The key weakness is understanding why the cells bother to use a diffusible gas like ammonia as a signal to form a tip and continue development.

      Diffusion of a gas can affect the signalling process of the entire colony of cells and will be quicker than other signaling mechanisms. A number of findings suggest that ammonia acts as both a local and long-range regulatory signal, integrating environmental and cellular cues to coordinate multicellular development. Ammonia serves as a crucial signalling molecule, influencing both multicellular organization and differentiation in Dictyostelium (Francis, 1964; Bonner et al., 1989; Bradbury and Gross, 1989). By raising the pH of the intracellular acidic vesicles of prestalk cells (Poole and Ohkuma, 1981; Gross et al, 1983), and the cytoplasm, ammonia is known to increase the speed of chemotaxing amoebae (Siegert and Weijer 1989; Van Duijn and Inouye, 1991), triggering multicellular movement (Bonner et al., 1988, 1989) to favor tipped mound development. The slug tip is known to release ammonia while the slime sheath at the back of the slug prevents diffusion thus maintaining high ammonia levels to (Bonner et al., 1989) promote pre-spore differentiation (Newell et al., 1969). Ammonia has been found to favor slug migration rather than fruiting (Schindler and Sussman, 1977) and thus, tip-derived ammonia may stimulate synchronized development of the entire colony. The tip exerts negative chemotaxis towards ammonia, potentially directing the slugs away from each other to ensure equal spacing of fruiting bodies (Feit and Sollitto, 1987).  

      Ammonia released in pulses acts as a long-distance signalling molecule between colonies of yeast cells indicating depletion of nutrient resources and promoting synchronous development (Palkova et al., 1997; Palkova and Forstova, 2000). A similar mechanism may be at play to influence neighbouring Dictyostelium colonies. Furthermore, ammonia produced in millimolar concentrations (Schindler and Sussman, 1977) may also ward off predators in soil as observed in Streptomyces symbionts of leaf-cutting ants to inhibit fungal pathogens (Dhodary and Spiteller, 2021). Additionally, ammonia may be recycled into amino acids, within starving Dictyostelium cells to supporting survival and differentiation as observed in breast cancer cells (Spinelli et al., 2017). Therefore, using a diffusible gas like ammonia as a signalling molecule is likely to have bioenergetic advantages. Ammonia is a natural metabolic byproduct of amino acid catabolism and other cellular processes, making it readily available without requiring additional energy for synthesis. Instead of producing a dedicated signalling molecule, cells can exploit an existing by-product for developmental regulation.

      (2) The rescue of the mutant by adding ammonia gas to the entire culture indicates that ammonia conveys no positional information within the mound.

      Ammonia is known to influence rapid patterning of Dictyostelium cells confined in a restricted environment (Sawai et al., 2002). Both neutral red staining (a marker for prestalk and ALCs) (Fig. S2) and the prestalk marker ecmA/ ecmB expression (Fig. 8C) in the adgf mutants suggest that the mounds have differentiated prestalk cells but are blocked in development. The mound arrest phenotype can be reversed by exposing the adgf mutant mounds to ammonia.  

      Based on cell cycle phases, there exists a dichotomy of cell types, that biases cell fate to prestalk or prespore (Weeks and Weijer, 1994; Jang and Gomer, 2011). Prestalk cells are enriched in acidic vesicles, and ammonia, by raising the pH of these vesicles and the cytoplasm (Davies et al 1993; Van Duijn and Inouye 1991), plays an active role in collective cell movement (Bonner et al., 1989). Thus, ammonia reinforces or maintains the positional information by elevating cAMP levels, favouring prespore differentiation (Bradbury and Gross, 1989; Riley and Barclay, 1990; Hopper et al., 1993). 

      (3) By the time the cells have formed a mound, the cells have been starving for several hours, and desperately need to form a fruiting body to disperse some of themselves as spores, and thus need to form a tip no matter what.

      When the adgf mutants were exposed to ammonia just after tight mound formation, tips developed within 4 h (Fig. 6). In contrast, adgf mounds not exposed to ammonia remained at the mound stage for at least 30 h. This demonstrates that starvation alone is not sufficient to drive tip development and ammonia serves as a cue that promotes the transition from mound to tipped mound formation. 

      Many mound arrest mutants are blocked in development and do not proceed to form fruiting bodies (Carrin et al., 1994). Furthermore, not all the mound arrest mutants tested in this study were rescued by ADA enzyme (Fig. S3 A), and they continue to stay as mounds without dispersing as spores, suggesting that mound arrest in Dictyostelium can result from multiple underlying defects, whereas ammonia is an important factor controlling transition from mound to tip formation.

      (4) One can envision that the local ammonia concentration is possibly informing the mound that some minimal number of cells are present (assuming that the ammonia concentration is proportional to the number of cells), but probably even a minuscule fruiting body would be preferable to the cells compared to a mound. This latter idea could be easily explored by examining the fate of the ADGF cells in the mound - do they all form spores? Do some form spores?

      Or perhaps the ADGF is secreted by only one cell type, and the resulting ammonia tells the mound that for some reason that cell type is not present in the mound, allowing some of the cells to transdifferentiate into the needed cell type. Thus, elucidating if all or some cells produce ADGF would greatly strengthen this puzzling story.

      A fraction of adgf mounds form bulkier spore heads by the end of 36 h as shown in Fig. 3. This late recovery may be due to the expression of other ADA isoforms. Mixing WT and adgf mutant cell lines results in a slug with the mutants occupying the prestalk region (Fig. 9) suggesting that WT ADGF favours prespore differentiation. However, it is not clear if ADGF is secreted by a particular cell type, as adenosine can be produced by both cell types, and the activity of three other intracellular ADAs may vary between the cell types. To address whether adgf expression is cell type-specific, we will isolate prestalk and prespore cells, and thereafter examine adgf expression in each population.

      ADGF activity is likely to be higher in the tip to remove excess adenosine, the tip-inhibiting molecule (Wang and Schaap, 1985). Moreover, our results show that adgf<sup>-</sup> cells with high adenosine preferentially migrate to the prestalk rather than the prespore region when mixed with WT cells. Ammonia generated from adenosine deamination could thus drive tip development and prespore differentiation.

      Reviewer #2 (Public review):

      Summary:

      The paper describes new insights into the role of adenosine deaminase-related growth factor (ADGF), an enzyme that catalyses the breakdown of adenosine into ammonia and inosine, in tip formation during Dictyostelium development. The ADGF null mutant has a pre-tip mound arrest phenotype, which can be rescued by the external addition of ammonia. Analysis suggests that the phenotype involves changes in cAMP signalling possibly involving a histidine kinase dhkD, but details remain to be resolved.

      Strengths:

      The generation of an ADGF mutant showed a strong mound arrest phenotype and successful rescue by external ammonia. Characterization of significant changes in cAMP signalling components, suggesting low cAMP signalling in the mutant and identification of the histidine kinase dhkD as a possible component of the transduction pathway. Identification of a change in cell type differentiation towards prestalk fate

      Weaknesses:

      (1) Lack of details on the developmental time course of ADGF activity and cell type type-specific differences in ADGF expression.

      ADGF expression was examined at 0, 8, 12, and 16 h (Fig. 1), and the total ADA activity was assayed at 12 and 16 h (Fig. 4). As per the reviewer’s suggestion, we have now included the 12 h data (Fig. 4A) to provide additional insights into the kinetics of ADGF activity. The adgf expression was found to be highest at 16 h and hence, the ADA assay was carried out at that time point. However, the ADA assay will not exclusively reflect ADGF activity since it reports the activity of the three other isoforms as well.

      A fraction of adgf<sup>-</sup> mounds form bulkier spore heads by the end of 36 h as shown in Fig. 3. This late recovery may be due to the expression of the other ADA isoforms. Mixing WT and adgf mutant cell lines results in a slug with the mutants occupying the prestalk region (Fig. 9), suggesting that WT adgf favours prespore differentiation.

      However, it’s not clear if ADGF is secreted by a particular cell type, as adenosine can be produced by both cell types, and the activity of the other three intracellular ADAs may vary between the cell types. To address whether adgf expression is cell typespecific, we will isolate prestalk and prespore cells, and thereafter examine adgf expression in each population.

      ADGF activity is likely to be higher in the tip to remove excess adenosine, the tipinhibiting molecule (Wang and Schaap, 1985). Moreover, our results show that adgf<sup>-</sup> cells with high adenosine preferentially migrate to the prestalk rather than the prespore region when mixed with WT cells.

      (2) The absence of measurements to show that ammonia addition to the null mutant can rescue the proposed defects in cAMP signalling.

      The cAMP levels were measured at two time points 8 h and 12 h in the mutant. The adgf mutant has lower ammonia levels (Fig. 6), diminished acaA expression (Fig. 7) and reduced cAMP levels (Fig. 7) in comparison to WT at both 12 and 16 h of development. Since ammonia is known to increase cAMP levels (Riley and Barclay, 1990; Feit et al., 2001), addition of ammonia addition to the mutant is likely to increase acaA expression, thereby rescuing the defects in cAMP signalling.

      (3) No direct measurements in the dhkD mutant to show that it acts upstream of adgf in the control of changes in cAMP signalling and tip formation.

      The histidine kinases dhkD and dhkC are reported to modulate phosphodiesterase RegA activity, thereby maintaining cAMP levels (Singleton et al., 1998; Singleton and Xiong, 2013). By activating RegA, dhkD ensures proper cAMP distribution within the mound, which is essential for the patterning of prestalk and prespore cells, as well as for tip formation (Singleton and Xiong, 2013). Therefore, ammonia exposure to dhkD mutants is likely to regulate cAMP signalling and thereby tip formation. We will address this issue by measuring cAMP levels in the dhkD mutant.

    1. eLife Assessment

      This important study by Liu et al. presents a comprehensive structure-function analysis of the presynaptic protein UNC-13, leading to new insights into how its distinct domains control neurotransmitter release. The methods, data, and analyses are convincing, and the genetic and electrophysiological approaches support many of their conclusions. The work will be of interest to neuroscientists studying synaptic transmission, as it provides a foundation for future mechanistic studies of Munc13/UNC-13 family proteins.

    2. Joint Public Review:

      Summary:

      In this manuscript, the authors investigate how different domains of the presynaptic protein UNC-13 regulate synaptic vesicle release in the nematode C. elegans. By generating numerous point mutations and domain deletions, they propose that two membrane-binding domains (C1 and C2B) can exhibit "mutual inhibition," enabling either domain to enhance or restrain transmission depending on its conformation. The authors also explore additional N-terminal regions, suggesting that these domains may modulate both miniature and evoked synaptic responses. From their electrophysiological data, they present a "functional switch" model in which UNC-13 potentially toggles between a basal state and a gain-of-function state, though the physiological basis for this switch remains partly speculative.

      Strengths:

      (1) The authors conduct a thorough exploration of how mutations in the C1, C2B, and other regulatory domains affect synaptic transmission. This includes single, double, and triple mutations, as well as domain truncations, yielding a large, informative dataset.

      (2) The study includes systematically measure both spontaneous and evoked synaptic currents at neuromuscular junctions, under various experimental conditions (e.g., different Ca²⁺ levels), which strengthens the reliability of their functional conclusions.

      (3) Findings that different domain disruptions produce distinct effects on mEPSCs, mIPSCs, and evoked EPSCs suggest UNC-13 may adopt an elevated functional state to regulate synaptic transmission.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Overview:

      We appreciate all the constructive comments from the reviewer and the reviewing editor, as their suggestions have significantly improved our manuscript. In response to their comments, we have made several key revisions: First, we have performed new colocalization analyses between the active zone marker UNC-10::GFP and all UNC-13L variants (UNC13L, UNC-13L<sup>HK</sup>, UNC-13L<sup>D1-5N</sup>, and UNC-13L<sup>HK+D1-5N</sup>, all tagged with mApple). These results confirm that the mutations do not affect synaptic localization. Second, we have provided a clearer explanation of the “gain-of-function” term used in this study, emphasizing that it reflects an increased SV release due to C1-C2B module dysfunction rather than a single mechanistic state. Third, we have expanded the discussion on the physiological implications of the C1-C2B model, particularly its role in regulating synaptic transmission under varying neuronal activity conditions. Finally, to improve clarity and focus, we have removed unnecessary speculative discussions, ensuring that the revised manuscript centers on the most relevant findings.

      We have reorganized the manuscript to incorporate these new results into the figures and text. Full responses to all reviewer comments are provided below. We hope that the reviewer and the editor find these revisions satisfactory and that our manuscript is now suitable for publication in eLife.

      Joint Public Review:

      Summary:

      In this manuscript, the authors investigate how different domains of the presynaptic protein UNC-13 regulate synaptic vesicle release in the nematode C. elegans. By generating numerous point mutations and domain deletions, they propose that two membrane-binding domains (C1 and C2B) can exhibit "mutual inhibition," enabling either domain to enhance or restrain transmission depending on its conformation. The authors also explore additional Nterminal regions, suggesting that these domains may modulate both miniature and evoked synaptic responses. From their electrophysiological data, they present a "functional switch" model in which UNC-13 potentially toggles between a basal state and a gain-of-function state, though the physiological basis for this switch remains partly speculative.

      Strengths:

      (1) The authors conduct a thorough exploration of how mutations in the C1, C2B, and other regulatory domains affect synaptic transmission. This includes single, double, and triple mutations, as well as domain truncations, yielding a large, informative dataset.

      (2) The study includes systematically measuring both spontaneous and evoked synaptic currents at neuromuscular junctions, under various experimental conditions (e.g., different Ca²⁺ levels), which strengthens the reliability of their functional conclusions.

      (3) Findings that different domain disruptions produce distinct effects on mEPSCs, mIPSCs, and evoked EPSCs suggest UNC-13 may adopt an elevated functional state to regulate synaptic transmission.

      Weaknesses:

      It remains unclear whether the various domain alterations truly converge on a single "gain-offunction" state or instead represent multiple pathways for enhancing UNC-13 activity. Different mutations selectively affect spontaneous or evoked release, suggesting that each variant may not share the same underlying mechanism. Moreover, many conclusions rely on combining domain deletions or point mutations, yet the electrophysiological data show distinct outcomes across EPSCs, IPSCs, mini, and evoked responses. This raises questions about whether these manipulations all act on the same pathway and whether their observed additivity or suppression genuinely reflects a single mechanistic process. A unifying model-or at least a clearer explanation of why the authors infer one mechanistic state across different domain manipulations would strengthen the paper's conclusions.

      We appreciate the comment and understand the potential confusion regarding the use of the term "gain-of-function" in the manuscript. To clarify, the gain-of-function state described in this study does not refer to a single specific mechanistic change in UNC-13 but rather to a high synaptic vesicle (SV) release state achieved by disrupting the C1-C2B module - either through dysfunction of the C1 domain or the C2B domain (as seen with the HK and DN mutations).

      Our findings support a "seesaw" model in which the C1 and C2B domains maintain a dynamic balance in their interaction with the plasma membrane, binding to DAG and PIP2. This balance may increase the energy barrier for SV release, preventing excessive neurotransmitter release under basal conditions. However, the C1-C2B toggle may be disrupted by high neuronal activity and act in an unbalanced state, thereby enhancing synaptic transmission (i.e., the gain-of-function state). To address these concerns, we have provided a clearer explanation of this functional switch in the revised version of the manuscript (page 27).

      Regarding the differences between spontaneous and evoked neurotransmitter release, our previous studies have revealed that these two forms of release do not always respond similarly to various unc-13 mutations. This is a common phenomenon observed in other synaptic protein mutants, including synaptotagmin, tomosyn, and complexin, which indicates distinct yet partially overlapping regulatory mechanisms. Our model is well supported by most of the electrophysiological results from HK, DN, and HK+DN mutations across different unc-13 isoforms (UNC-13L, UNC-13S, UNC-13R, UNC-13ΔC2A, UNC-13ΔX). The main exception is that in UNC-13ΔX<sup>HK+DN</sup> mutants, the changes in mEPSCs and mIPSCs differ from those observed in evoked EPSCs. This suggests that the mechanisms regulating the functional switch of unc-13 may differ slightly between spontaneous and evoked release. Since the X region of unc-13 and Munc13 remains largely uncharacterized, our findings provide intriguing insights into its potential functional role.

      The manuscript proposes that UNC-13 toggles from a basal to a "gain-of-function" state under normal synaptic activity. However, it does not address when or how this switch might occur in vivo, since it is demonstrated principally via artificial mutations. Providing direct evidence or additional discussion of such switching under physiological conditions would be particularly informative.

      What is the physiological significance of the proposed gain-of-function state? The data suggest that certain mutants (e.g., HK+D1-5N) lacking the gain-of-function state can still support synaptic transmission at wild-type levels. How do the authors reconcile this with the idea that the gain-of-function state plays a critical role at the synapse?

      We appreciate these comments. While our model is mainly based on the dysfunction of the C1-C2B module (through HK and DN mutations), it provides a potential physiological framework for understanding how the structural balance of C1-C2B relates to the variability of synaptic transmission in the nervous system. In the CNS, synaptic transmission is highly variable, and the temporal pattern of the presynaptic activity may require dynamic switching of the fusion machinery, including UNC-13, between different functional modes, thereby triggering synaptic transmission at various levels. Our model suggests that under conditions of high neuronal activity, the C1-C2B module may transition from a balanced to an unbalanced state (gain-of-function state), thereby enhancing synaptic transmission.

      Regarding the physiological significance of the gain-of-function state, we acknowledge that certain mutants (e.g., HK+D1-5N) lacking this state can still support wild-type levels of synaptic transmission. This observation suggests that the gain-of-function state may not be strictly required for baseline synaptic function but rather plays a modulatory role under specific conditions, such as heightened neuronal activity or synaptic plasticity. Further investigations will be needed to determine the precise in vivo triggers and functional consequences of this switch under physiological conditions. Moreover, we will focus on several linker regions (between C1 and C2B, C2B and MUN) to investigate their potential roles in regulating synaptic transmission and their broader functional significance in UNC-13 dynamics.

      The authors determined the fluorescence intensity of mApple-tagged UNC-13 variants (Figure 1J-K and Figure 7J-K), finding no significant changes compared to the wild-type. However, a more detailed analysis of the density or distribution of fluorescent puncta in axons could clarify whether certain mutations alter the localization of UNC-13 at synapses. Demonstrating colocalization with wild-type UNC-13 (or another presynaptic marker) would help rule out mislocalization effects.

      We appreciate the comment. In response, we have included a more detailed analysis of the synaptic localization of both wild-type and mutated UNC-13L in the revised manuscript. Our data show that in all scenarios, UNC-13 proteins exhibit strong colocalization with the active zone marker UNC-10::GFP (Figure 1L). Along with the fluorescence intensity data in Figure 1J, our findings indicate that the C1 and C2B mutations do not affect the expression level or the localization of UNC-13 at synapses. These results have been incorporated into the revised manuscript (page 8) and in Figure 1L.

      The study mainly relies on extrachromosomal transgenes, which can show variable copy numbers and expression levels among individual worm strains. This variability might complicate interpretation, as differences in expression could mask or exaggerate certain phenotypes.

      We agree that the expression levels of synaptic proteins can influence synaptic transmission levels. However, given the large number of mutations and truncations employed in this study, generating single-copy rescue lines for all transgenic strains would be a significant undertaking. On average, we need to microinject 50-100 worms to obtain one single-copy line, whereas injecting only 5-10 worms allows us to generate at least three independent extrachromosomal arrays. Based on our previous work, we found that the synaptic transmission levels are comparable between various extrachromosomal rescue arrays of unc13 and their single-copy rescue lines (e.g., UNC-13L, UNC-13S, UNC-13R, UNC-13ΔC2A, UNC-13ΔC2B, etc.). In future studies, we aim to use single-copy expression or CRISPRbased methods to induce deletions or mutations in various synaptic proteins.

      Finally, the discussion is somewhat diffused. Streamlining the text to focus on the most direct connections would help readers pinpoint the key conclusions and open questions.

      We appreciate the comment. As suggested, we have refined the discussion section. Specifically, we have removed the last part of the discussion (Functional roles of the linkers in UNC-13).  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Clarify the "Gain-of-Function" State. Provide stronger justification or explicit discussion of whether all manipulations that enhance SV release truly correspond to the same mechanistic state or if multiple conformational states might be at play.

      The “gain-of-function” state in this manuscript refers to a specific conformational status of UNC-13 that enhances synaptic vesicle (SV) release probability (both spontaneous and evoked) as a result of mutations (HK and DN) in the C1 and C2B domains. This effect is observed across multiple UNC-13 isoforms, including UNC-13L, UNC-13S, and UNC-13R. Prior studies from our group and others have demonstrated that C1 and C2B exhibit conserved functions in regulating synaptic transmission (Li et al., 2019, Cell Reports; Liu et al., 2021, Cell Reports; Michelassi et al., 2017, Neuron), supporting the idea that these domains share a common mechanism for modulating SV release. Given that C1 and C2B act as a functional unit (Michelassi et al., 2017, Neuron; and this study), we define all synaptic states induced by the dysfunction of these two domains as the "gain-of-function" mode.

      However, it is important to note that this classification does not apply to high-release probability states induced by mutations in other domains.

      The concept of a gain-of-function state due to C1 and C2B dysfunction has been previously proposed in studies of Munc13. Basu et al. (2007, Journal of Neuroscience) demonstrated that the H567K mutation in Munc13-1 C1 increases both spontaneous and evoked release probability, leading to a gain-of-function mode. Similarly, work from the Südhof group showed that KW and DN mutations in Munc13-1 C2B also enhance release probability, thereby inducing a gain-of-function state (Shin et al., 2010, Nature Structural & Molecular Biology). Our recent findings further support this idea, showing that UNC-13 C2B D3,4N (Li et al., 2019, Cell Reports; Liu et al., 2021, Cell Reports; Michelassi et al., 2017, Neuron) and the newly identified D1-5N mutation (this study) significantly elevate SV release, consistent with the D1,2N mutations reported by Shin et al.

      Overall, our study integrates and extends previous findings, providing strong evidence that the C1 and C2B domains function as a regulatory switch between a basal physiological mode, a gain-of-function mode (enhanced release), and a loss-of-function mode (impaired release). This framework advances our understanding of how C1 and C2B dysfunction affects synaptic transmission and plasticity.

      (2) Add comparisons to wild-type UNC-13L: When presenting data for deletions/mutants as "controls," include a visual reference (e.g., dashed line in figures) showing wild-type UNC13L levels. This will help readers see whether each construct is above or below the normal activity baseline.

      As suggested, a dashed line showing the level of UNC-13L has been added to the bar graphs of all evoked EPSCs. The functional switch model is well supported by the results of the evoked EPSCs.

      (3) Mutant and wild-type UNC-13 colocalization analysis: Demonstrating whether each mutant localizes robustly to synapses, in comparison to wild-type UNC-13, would bolster the interpretation of electrophysiological changes. If the authors have these data, adding them would address the possibility of mislocalization.

      We agree with the reviewer that there would be value to address the possibility of mislocalization. However, in our experience working with UNC-13 mutant colocalization, we have found that neither deleting the X, C1 and C2B domains in UNC-13L  nor deleting C1 and C2B domain in UNC-13MR or UNC-13R altered the synaptic colocalization with the active zone protein UNC-10/RIM (Li 2019, Liu 2021), suggesting that C1 and C2B domains in UNC-13 are not involved in the regulation of protein localization. Thus, the mutations in the C1 and C2B domains are unlikely leading to protein mislocalization in the synaptic region.

      (4) If possible, adding analysis using single-copy transgenes to confirm that extrachromosomal array expression variability does not qualitatively change the conclusions.

      We strongly agree with the reviewer that single-copy transgenes would provide more stable protein expression levels and further consolidate our conclusions. However, several factors give us confidence that the extrachromosomal array rescue approach does not introduce significant variability in our results: First, our prior research has shown that SV release levels are generally comparable between extrachromosomal arrays carrying various unc13 transgenes and their corresponding single-copy rescue lines (e.g., UNC-13L, UNC-13S, UNC-13R, UNC-13ΔC2A, and UNC-13ΔC2B). Second, the major conclusions in this study are drawn from highly consistent and robust changes in SV release between different rescue lines (e.g., UNC-13L<sup>HK+DN</sup> vs UNC-13L<sup>DN</sup>; UNC-13S<sup>HK+DN</sup> vs UNC-13S<sup>HK</sup> or UNC-13S<sup>DN</sup> ). Third, our imaging data indicate that the protein levels are indistinguishable between different unc-13 rescue arrays carrying C1 and C2B mutations, further supporting the validity of our findings.

      Additionally, due to our recent relocation to a new institute, we are still in the process of setting up our microinjection system. Generating single-copy transgenes for all the extrachromosomal arrays used in this study would require significant time. We appreciate the reviewer’s understanding of our current situation. For our future studies regarding unc-13 and other synaptic proteins, we will prefer to use single-copy expression rather than extrachromosomal arrays.

      (5) Reduce the length and speculation in the Discussion. A concise discussion that focuses on the most direct implications of the present findings will help improve the readability of this paper.

      We appreciate the comment. As suggested, we have refined the discussion section.

      Specifically, the last part of the discussion (Functional roles of the linkers in UNC-13) was removed.

      (6) Minor formatting detail: In Figure 5C (left panel), adjust the y-axis label to ensure it aligns properly and improves clarity.

      We appreciate the reviewer’s suggestion and have adjusted the y-axis label accordingly in the revised version (see revised Figure 5).

    1. eLife Assessment

      This valuable work investigates the social interactions of mice living together in a system of multiple connected cages. It provides solid evidence for a statistical approach capturing changes in social interactions after manipulating prefrontal cortical plasticity. This research will be of broad interest to researchers studying animal social behaviour.

    2. Reviewer #2 (Public review):

      The authors have constructively responded to previous referee comments and I believe that the manuscript is a useful addition to the literature. I particularly appreciate the quantitative approach to social behavior, but have two cautionary comments.

      (1) Conceptually it is important to further justify why this particular maximum entropy model is appropriate. Maximum entropy models have been applied across a dizzying array of biological systems, including genes, neurons, the immune system, as well as animal behavior, so would seem quite beneficial to explain the particular benefits here, for mouse social behavior as coarse-grained through the eco-hab chamber occupancy. This would be an excellent chance to amplify what the models can offer for biological understanding, particularly in the realm of social behavior

      (2) Maximum entropy models of even intermediate size systems involve a large number of parameters. The authors are transparent about that limitation here, but I still worry that the conclusion of the sufficiency of pairwise interactions is simply not general, and this may also relate to the differences from previous work. If, as the authors suggest in the discussion, this difference is one of a choice of variables, then that point could be emphasized. The suggestion of a follow up study with a smaller number of mice is excellent.

    3. Reviewer #3 (Public review):

      Summary:

      Chen et al. present a thorough statistical analysis of social interactions, more precisely, co-occupying the same chamber in the Eco-HAB measurement system. They also test the effect of manipulating the prelimbic cortex by using TIMP-1 that inhibits the MMP-9 matrix metalloproteinase. They conclude that altering neural plasticity in the prelimbic cortex does not eliminate social interactions, but it strongly impacts social information transmission.

      Strengths:

      The quantitative approach to analyzing social interactions is laudable and the study is interesting. It demonstrates that the Eco-HAB can be used for high throughput, standardized and automated tests of the effects of brain manipulations on social structure in large groups of mice.

      Weaknesses:

      A demonstration of TIMP-1 impairing neural plasticity specifically in the prelimbic cortex of the treated animals would greatly strengthen the biological conclusions. The Eco-HAB provides coarser spatial information compared to some other approaches, which may influence the conclusions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewer and the editor for carefully reading our manuscript, and acknowledging the strength of combining quantitative analysis with semi-naturalistic experiments on mice social behavior. Please find below our response to both the public review and the recommendation to the authors. As a summary, we have included additional figures and texts such as 

      - a new Results subsection “Choosing timescales for analysis ” (page 6)

      - a new Materials and Methods subsection “Maximum entropy model with triplet interactions” (page 17)

      - new supplementary figures, which have current labels of:

      - Figure 2 - figure supplement 5

      - Figure 2 - figure supplement 6

      - Figure 2 - figure supplement 7

      - Figure 4 - figure supplement 1

      - Figure 4 - figure supplement 2    

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      In this manuscript, Chen et al. investigate the statistical structure of social interactions among mice living together in the ECO-Hab. They use maximum entropy models (MEM) from statistical physics that include individual preferences and pair-wise interactions among mice to describe their collective behavior. They also use this model to track the evolution of these preferences and interactions across time and in one group of mice injected with TIMP-1, an enzyme regulating synaptic plasticity. The main result is that they can explain group behavior (the probability of being together in one compartment) by a MEM that only includes pair-wise interactions. Moreover, the impact of TIMP-1 is to increase the variance of the couplings J_ij, the preference for the compartment containing food, as well as the dissatisfaction triplet index (DTI). 

      Strengths: 

      The ECO-Hab is a really nice system to ask questions about the sociability of mice and to tease apart sociability from individual preference. Moreover, combining the ECO-Hab with the use of MEM is a powerful and elegant approach that can help statistically characterize complex interactions between groups of mice -- an important question that requires fine quantitative analysis. 

      Weaknesses: 

      However, there is a risk in interpreting these models. In my view, several of the comparisons established in the current study would require finer and more in-depth analysis to be able to establish firmer conclusions (see below). Also, the current study, which closely resembles previous work by Shemesh et al., finds a different result but does not provide the same quantitative model comparison included there, nor a conclusive explanation of why their results are different. In total, I felt that some of the results required more solid statistical testing and that some of the conclusions of the paper were not entirely justified. In particular, the results from TIMP-1 require proper interaction tests (group x drug) which I couldn't find. This is particularly important when the control group has a smaller N than the drug groups.  

      We would like to thank the reviewer and the editor for carefully reading our manuscript, and acknowledging the strength of combining quantitative analysis with semi-naturalistic experiments on mice social behavior. Thanks to the reviewer’s suggestion, we have improved our manuscript by 

      (1) A proper comparison with Shemesh et al., especially to include maximum entropy models with triplet interactions. We show that triplet models overfit even given the entire 10 day dataset, which limits our study to look at pairwise interactions.

      (2) Results on cross-validation for both triplet interaction models and pairwise interaction models, completed on aggregates of various length of days. This analysis showed that pairwise models overfit for single-day data, and led us to learn pairwise models only on 5day aggregation of data. We have updated the manuscript (both the text and the figures) to present these results.

      (3) New results that subsample the drug groups to the same size as the control group. The conclusions about TIMP-1 treated mice hordes hold when we compare groups of the same size. 

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors): 

      (1) COMPARISON WITH PREVIOUS WORK. The comparison with the cited previous work of Shemesh et al. 2013 rests novelty to the use of ME models in characterizing social interactions between groups of mice as well as sheds doubts on the main claim of the manuscript, namely that second-order correlations are sufficient to describe the joint distribution of occupancies of all mice (in particular triplets; there is no quantification of the variance explained by model in panel Fig. 2D). In my view, to make the claim "These results show that pairwise interaction among mice are sufficient to assess the observed collective behavior", the authors should compare models with 2nd and 3rd order interactions and quantify how much of the total correlation can be explained by pair-wise interactions, triplet interactions, and so on. Without a proper model comparison, it is unclear how the authors can make such a claim. One thing observed by Shemesh et al. is that, on average, J_ij are negative. This does not seem to be the case in the current study and the authors should discuss why. 

      Finally, the explanation provided in the Discussion about this discrepancy (spatial resolution and different group size) are not completely satisfactory. With more animals, one would imagine that the impact of higher order correlations would increase (and not decrease) as the number of terms of 3rd, 4th, ... order will be very big. I would also think that the same could be true for the spatial scale: assessing interactions with a coarser spatial grid (whole cages in the case of the ECO-Hab) would allow for simultaneous interactions among more mice to happen compared with a situation in which the spatial grid is so small that only a few animals can fit in each subdivision. 

      We thank the reviewer for the recommendation. In the updated version of the manuscript, we explicitly learn the triplet interaction model. We show that because the number of mice in our experiment is much larger than Shemesh et al., a triplet model runs into the problem of overfitting.

      In particular, we found that the test set likelihood increases monotonically when the L2 regularization strength increases, which corresponds to a suppression of the triplet interaction strength (see additional supplementary figure, now Figure 2 - figure supplement 5). More specifically, for the range of regularization strength (β<sub>G</sub>) we tested (10<sup>-1</sup> < β<sub>G</sub> < 10<sup>1</sup>), the maximum test set likelihood is achieved at β<sub>G</sub> = 10<sup>1</sup>, which corresponds to . Notice that those learned triplet interactions are very close to zero. This means we should select a model with pairwise interactions over a model with triplet interactions.

      We have added the above reasoning in page 5, line 166-169 of the Results section with the sentence “Moreover, models with triplet interactions show signs of overfitting under crossvalidation, which is mitigated when the triplet interactions are suppressed close to zero using L2 regularization”,  a new subsection “Maximum entropy model with triplet interactions” in Materials and Methods (page 16-17, line 548 - 563) to describe the protocols of learning and crossvalidation for these triplet interaction models. 

      Furthermore, we extended the discussion about the difference between Shemesh et al. and our results in the Discussion section. In addition to the difference of spatial scales (chamber vs. location in the chamber), and the difference of group size and its impact on data analysis (N = 15 in our largest cohort and N = 4 in theirs), we added a discussion about the difference of experimental arena, which in Eco-HAB contains connected chambers that mimic the naturalistic environment, and in Shemesh et al. contains a single chamber. The change in the text is on page 12, between line 390 and line 394.

      We thank the reviewers for pointing out that the mean 2nd order interaction in Shemesh et al. is negative. One possibility is that the labeled areas in Shemesh et al. are much smaller than in our Eco-HAB setup, which could suggest that mice do have the space to stay in the same area, which will lead to a negative mean 2nd order interaction.

      (2) ASSESSMENT OF THE TEMPORAL EVOLUTION OF THE INTERACTIONS. The analysis of the stability of the social structure is not conclusive. First, I don't think the authors can conclude that "These results suggest that the structure of social interactions in a cohort as a whole is consistent across all days." If anything is preserved, they would be the statistics of that structure but not the structure itself (i.e., there is no evidence for that). The comparison of the stability of the mean <h\_i> and the mean <J\_ik> would also require a statistical test to be able to state that "Delta h_i changed more strongly from day to day (Fig. 3D, top panel) relative to the interaction measured as the Jij's." The same is true for the assessment of the TIMP: the differences found in the variability in J_ij and in the mean and variance of the h_i's, look noisy and would require a proper statistical test. The traces look quite variable across days in the control condition, so assessing differences may be difficult. Finally, it would be good to know if the variability in individual J_ij is because they truly vary from day to day or because estimating them within one day is difficult (statistical error). If the reason is the latter, one could decrease the temporal resolution to 2-3 days and see whether the estimated J_ijs are more stable. Perhaps, also for that reason, the summed interaction strength J_i is also more stable, simply because it aggregates more data and has a smaller statistical error. 

      We thank the reviewer for pointing out the necessity of assessing the temporal evolution of the interactions. The problem of shorter data duration leads to more noise in the estimation, together with the reviewer’s Comment 4 about the risk of overfitting, led us to add a new Results subsection “Choosing timescales for analysis” (page 6, line 171 to line 189). Specifically, we assess whether the pairwise maximum entropy model overfits using data from _K-_day aggregates, by computing the log-likelihood of both the training sets and the test sets,which is chosen to be 1 hour from the 6 hour data window of each day. We found that for single day data, the pairwise maximum entropy model overfits. In contrast, for data with aggregates of more or equal to 4 days of data, the pairwise model does not overfit. This new result is supported by an additional supplementary figure, now Figure 2 - figure supplement 6.

      To be consistent with later approaches in the manuscript where we consider the effects of TIMP1, we choose the analysis windows to be data aggregates from 5 days. This means for the experiment that collects a total of 10 days of data, there are only two time points, thus a study of the temporal evolution is limited to comparison between the first 5 days and the last 5 days of the experiment. We describe these results in the Results subsection “Stability of sociability over time” (page 6, line 190 - 220). An additional supplementary figure, now Figure 2 - figure supplement 7, shows in details the comparison of the inferred interaction strength J and the chamber preference between the first 5 days and the last 5 days for the 4 cohorts of male C57BL6/J mice, which shows the inferred interactions have a consistent variability across first and last 5 days, and across all cohorts. The small value of Pearsons’ correlation coefficient shows that the exact structure (pairspecific J<sub>ij</sub>) is not stable. At the end of the Results subsection “Stability of sociability over time”, we explicitly say that “This implies that the maximum entropy model does not infer a social structure that is stable over time.”

      (3) EFFECT OF TIMP-1. The reported effects of TIMP-1 on the variance of the J_ij seem very small and possibly caused by a few outlier J_ijs (perhaps from one or two animals) which

      are not present in the control group which seems to have fewer animals (N = 9 minus two mice that died after the surgery vs. N = 14 in the drug group), so the lack of a significant difference in the sigma[J_ij] could simply be due to a smaller N (a test for the interaction group x drug was not done). 

      The clearest effect of TIMP-1 seems to be a change in place preference (h_i) and not the interaction terms (J_ij) (Fig. 3F bottom). But this could be explained by a number of factors that have nothing to do with sociability such as that recovery from surgery makes them eat more/less. The fact that it seems to be present, as recognized by the authors, in the control group with no TIMP-1 and that this effect was not observed in the female group F1, puts into question the specificity and reproducibility of the result. 

      Finally, the effect of TIMP-1 in the DTI would require more statistics (testing the interaction group x drug). The fact that the control group has fewer animals (N = 9 vs. 15 and 13 in the drug groups), and that there is a weaker trend in the DTI of the control group to start high and then decrease, makes this test necessary.  

      Now, after we select a proper timescale to learn the pairwise maximum entropy model, we update the manuscript to present results only on 5-day aggregation of data (see updated Figure 3, updated supplementary figures, Figure 3 - figure supplement 1 and 2). For the variance of the J<sub>ij</sub>, the F-test between different 5-day aggregates before and after TIMP for the male drug group now shows a nonsignificant p-value after applying the Bonferroni correction. For the female drug group, the difference of the J<sub>ij</sub> variance is still significant. 

      To test the effect of different group size on DTI, we subsampled the drug groups by 1) subsampling the inferred interactions learned from the original N = 15 or N = 13 data, or 2) subsampling the mice colocalization data and then inferring the pairwise interactions.  In both cases, the resulting DTI for the subsampled drug group still exhibits the same global pattern as before, i.e. after TIMP-1 injection, DTI significantly increases, which after 5 days falls back to the baseline level. The results are supported by two additional supplementary figures, Figure 4 - figure supplement 1 and 2. This result is referred to in the text in the Results subsection “Impaired neuronal plasticity in the PL affects the structure of social interactions” (page 10, line 333 - 336): “Notably, the difference of the DTI is not due to the control group M4 has less mice, as subsampling both on the level of the inferred interactions (Figure 4 - figure supplement 1) and on the level of the mice locations (Figure 4 - figure supplement 2) give the same DTI for cohorts M1 and F1.”

      (4) MODEL COMPARISON. Any quantitative measure of "goodness" of the model , (i.e., comparison of the predictions of the model with triplet frequency as well as the distribution of p(K)) should be cross-validated. In particular, Fig. S2 needs to be cross-validated for the goodness of fit to be properly quantified. Is the analysis shown in Fig. 3F crossvalidated? Because otherwise, there is an expected increase in the likelihood simply explained by an increase in the number of parameters of the model (i.e., adding the J_ij's). 

      As discussed in our responses to Comment 1 and 2, we have added results about cross-validation in the new supplementary figures, Figure 2 – figure supplement 5 and 6 , for which we computed the test-set and training-set likelihood for maximum entropy models with pairwise interactions and also for models with triplet interactions. Figure 2 - figure supplement 6 shows the pairwise model does not overfit when we consider the aggregated data from more or equal to 4 days. 

      (5) EFFECT OF SLEEP. The comparison of p(K) between the data and the model requires a bit more investigation: the model underestimates instances in which almost all mice were in the same compartment (i.e., for K >= 13. p(K)_data >> p(K)_MEM; btw where is the pairwise point p(15) in Fig. 2E and Fig. S4?). Could this be because there were still short periods during the dark cycle in which all mice were asleep in one of the cages? As explained by the authors, sleep introduces very strong higher order correlations between animals as they like sleeping altogether. Knowing whether removing light periods was enough to remove this "sleep contamination" or not, would be important in order to interpret discrepancies between the pairwise model and the data. 

      Figure 2E shows that the pairwise maximum entropy model (in black) overestimates the data (in blue circles) for P(K) at large K (and not underestimates). In the data, we never observe all 15 mice being in the same box; hence P<sub>data</sub>(15) = 0, and does not show up in the log-scaled figure (same for Figure 2 - figure supplement 3). A possible explanation for the pairwise model overestimating P(K) at large K is that the finite-sized box limits the total number of mice that are comfortably staying in the same box. It can also be due to the fact that the number of time points at which K >= 13 is small and hence causes an underestimation due to finite data. We have added this interpretation of the discrepancy of P(K) to Section “Pairwise interaction model explains the statistics of social behavior” in page 6, line 160. 

      We thank the Reviewer for raising the point of “sleep contamination”. Indeed, Eco-HAB data, as do data from other 24h-testing behavioral systems, demonstrate distinct differences in activity levels during the light and dark phases of the light-dark cycle (Rydzanicz et al., EMBO Mol. Med., 2024). During the light phases, mice primarily sleep and, as noted, they huddle, so many individuals within the cohort tend to remain in close proximity for extended periods. We acknowledge that including such periods in the analysis could potentially introduce confounding effects to the model due to limited movement and interactions, and this is why we decided not to use this data. However, during the dark phases, mice are highly active, with individuals rarely staying in the same compartment for long periods. Specifically, in the dark phases, while there are occasional instances where a few mice may remain in the same compartment for over 1 hour, the majority exhibit considerable mobility, actively exploring and transitioning between compartments. We see no compelling reason to exclude these periods from our analysis, as such activity aligns with the natural behavioral repertoire of the mice and provides robust data for our model. Furthermore, it is well-established that mammals, including nocturnal species such as mice, are most active shortly after waking, typically at the onset of their active phase (i.e., the beginning of the dark phase). To ensure a conservative approach, we specifically analyzed the first 6 hours of the dark phase when the cumulative number of box visits is at its peak, indicating heightened activity levels. In our view, this period offers an optimal window for studying natural behaviors, including social interactions.

      Additionally, prior studies using the Eco-HAB system have consistently demonstrated that mice engage in social interactions both within the compartments and in the connecting tubes during the dark phase (Puścian et al., eLife, 2016, Winiarski et al. in press). Given this evidence and the observed behavioral dynamics in our data, the likelihood of mice being asleep during the analyzed periods of the dark phase is very low.

      We hope this clarification addresses the reviewer’s concerns and highlights the rationale underpinning our analysis choices. Thank you for raising this important point, which allowed us to provide additional context for our approach.

      (6) COMPARTMENT PREFERENCES. The differences between p(K) across compartments also would require a bit more attention: of a MEM with non-spatially dependent pair-wise interactions shows differences across compartments, it must be because of the terms h_{i,r} terms which contain a compartment index, right? Wouldn't this imply that the independence model, which always underrepresents data events with large K, already contains the difference in goodness of fit between compartments (1, 3) and (2, 4)? In the plots, it does not look like the goodness of the independent model depends on the compartment (the authors could compare directly the models' predictions between compartments). Moreover, when looking at Fig. 2C, it does not look like the value of h_{i,r} in compartments (1,3) is higher than in (2,4) (if anything, it would be the other way around). How can this be explained? It would be good to know if the difference across compartments comes from differences in the empirical p(K) or in the models' prediction? If the difference is in the data p(K), could it be that the compartments 2-4 showing higher p(K=15) (i.e., larger difference with the pairwise MEM prediction) are those chosen by mice to sleep during the light cycle? If not, what could explain these differences across compartments? Could the presence of food and water explain this difference? 

      The reviewer is correct, in the pairwise MEM, the difference across compartments enter in the box preference h<sub>ir</sub>. Greater h<sub>ir</sub> means compartment r is more attractive to mouse i. Because box 2 and 4 contain food and water, we expect that mice are more attracted to box 2 and 4, and this is what we see in Figure 2C, bottom subpanels. To reduce the number of parameters to look at, we introduce an index Δh<sub>i</sub> = h<sub>i2</sub> + h<sub>i4</sub> - h<sub>i1</sub> - h<sub>i3</sub>. This index Δh<sub>i</sub> is found to be mostly positive (see updated Figure 3C), which makes sense because mice are attracted to food and water. 

      Next we analyze the difference of P(K) across compartments (Figure 2 - figure supplement 3). There is already a difference in the P(K) calculated from empirical data. For example, P(K) in compartment 2 has a maximum at K = 5 while P(K) in compartment 1 has a maximum at K = 3

      One interesting observation is that it seems from Figure 2 - figure supplement 3 that the pairwise model explains P(K) in compartment 1 and compartment 3 better than in compartment 2 and in compartment 4. In compartment 2 and 4, the pairwise MEM overestimates P(K) for large K. An alternative MEM could include compartment-specific interaction strength, but it will also introduce 315 new parameters for a mice cohort with size N = 15.

      MINOR

      (1) A more quantitative comparison between in-cohort sociability and couplings J_ij as œwell as mean rates and parameters h_i is required. The matrices in Fig. 2C do look similar. So it is not clear how the comparison between these values is contributing to characterizing the correlation structure of the data. 

      The comparison between in-cohort sociability and coupling J<sub>ij</sub> is given by supplementary Figure 2 - figure supplement 2.  The key point for the model with the learned J<sub>ij</sub> reproducing the in-cohort sociability is given by Figure 2 - figure supplement 1.

      (2) Analysis of "in-state" probability is not explained. To me, it wasn't obvious what Fig. S5 is showing. I was assuming that this analysis was comparing the prediction of the MEM about the position of each animal at each time point, given its preference (h), pairwise interactions (J_ij), and the position of all other animals and the true position of the animal. But it seems like it is comparing the shape of the distribution of this prob across time between the data and the model (I guess the data had to be temporally binned in coarser temporal periods to yield prob values other than 0s and 1s). Also, not clear whether this analysis was done for each compartment separately and then averaged. This needs explanation. 

      The in-state probability is comparing the prediction of the MEM about the position of each animal at each time point, given its preference (h), pairwise interactions (J<sub>ij</sub>), and the position of all other animals and the true position of the animal. To achieve values between 0s and 1s, we bin the data temporally according to the model-predicted in-state probability. 

      We have added the explanation of in-state probability on page 6, line 163-166. We have also improved the description of in-state probability in Materials and Methods (subsection “Comparing in-state probability between model prediction and data”, line 493 - 503, page 15), and added a pointer from the main text to it. 

      (3) Looks like Fig. S3 is not cited in the text. 

      We added a pointer to Fig. S3 (now Figure 2 - figure supplement 2) in line 154. 

      (4) The authors say that "TIMP-1 release from the TIMP-1-loaded nanoparticles diminishes after 5 days." Does that mean from the day of the injection (4-5 days before the "After Day 1") or five days after reintroduced in the ECO-Hab? 

      It means five days after the mice were re-introduced in the ECO-Hab. We have updated the text in Results/Effects of impairing neuronal plasticity in the PL on subterritory preferences and sociability (the end of the first paragraph of this subsection) to 

      “The choice of five-day aggregated data for analysis is in line both with the proper timescales needed for the pairwise maximum entropy model to not overfit, and with the literature that TIMP-1 release from the TIMP-1-loaded nanoparticles is stable for 7-10 days after injection (Chaturvedi et al., 2014)  (i.e. 2-5 days after the mice are reintroduced to Eco-HAB).” (line 272 - 276, page 9)

      (5) In Methods, the authors should report the final N of each of the three groups. 

      The number of final N is reported in Table 1 (page 13). In the updated version, we have added a pointer to Table 1 in Materials and Methods/Animals, and in Materials and Methods/Exclude inactive and dead mice from analysis. We have also expanded the caption of Table 1 to clarify the difference between final N and initial N, and added a pointer to Materials and Methods/Exclude inactive and dead mice from analysis.

    1. eLife Assessment

      The study presents valuable insights into the role of periosteal stem cells in bone marrow regeneration. The evidence is convincing. The data broadly support their claims and in line with state-of-art methodology. Future study on their model will help to strengthen their discovery further.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript under review investigates the role of periosteal stem cells (P-SSC) in bone marrow regeneration using a whole bone subcutaneous transplantation model. While the model is somewhat artificial, the findings were interesting, suggesting the migration of periosteal stem cells into the bone marrow and their potential to become bone marrow stromal cells. This indicates a significant plasticity of P-SSC consistent with previous reports using fracture models (Cell Stem Cell 29:1547, Dev Cell 59:1192).

      Major comments from previous round of review:

      (1) The authors assert that the periosteal layer was completely removed in their model, which is crucial for their conclusions. To substantiate this claim, it is recommended that the authors provide evidence of the successful removal of the entire periosteal stem cell (P-SSC) population. A colony-forming assay, with and without periosteal removal, could serve as a suitable method to demonstrate this.

      (2) The observation that P-SSCs do not express Kitl or Cxcl12, while their bone marrow stromal cell (BM-MSC) derivatives do, is a key finding. To strengthen this conclusion, the authors are encouraged to repeat the experiment using Cxcl12 or Scf reporter alleles. Immunofluorescence staining that confirms the migration of periosteal cells and their transformation into Cxcl12- or Scf-reporter-positive cells would significantly enhance the paper's key conclusion.

      (3) On page 8, line 20, the authors' statement regarding the detection of Periostin+ cells outside the periosteum layer could be misinterpreted due to the use of the periostin antibody. Given that periostin is an extracellular matrix protein, the staining may not accurately represent Periostin-expressing cells but rather the presence of periostin in the extracellular matrix. The authors should revise this section for greater precision.

      Comments on revised version:

      My comments from the previous round of review have mostly been addressed.

    3. Reviewer #2 (Public review):

      Summary:

      The authors have established a femur graft model that allows the study of hematopoietic regeneration following transplantation. They have extensively characterized this model, demonstrating the loss of hematopoietic cells from the donor femur following transplantation, with recovery of hematopoiesis from recipient cells. They also show evidence that BM MSCs present in the graft following transplantation are graft-derived. They have utilized this model to show that following transplantation, periosteal cells respond by first expanding, then giving rise to more periosteal SSCs, then migrating into the marrow to give rise to BM MSCs.

      Strengths:

      These studies are notable in several ways: 1) establishment of a novel femur graft model for the study of hematopoiesis; 2) Use of lineage tracing and surgery models to demonstrate that periosteal cells can give rise to BM MSCs.

      Weaknesses:

      There are a few weaknesses. First, the authors do not definitively demonstrate the requirement of periosteal SSC movement into the BM cavity for hematopoietic recovery. Hematopoiesis recovers significantly before 5 months, even before significant P-SSC movement has been shown, and hematopoiesis recovers significantly even when periosteum has been stripped. Second, it is not clear how the periosteum is changing in the grafts. Which cells are expanding is unclear, and it is not clear if these cells have already adopted a more MSC-like phenotype prior to entering the marrow space. Indeed, given the presence of host-derived endothelial cells in the BM, these studies are reminiscent of prior studies from this group and others that re-endothelialization of the marrow may be much more important for determining hematopoietic regeneration, rather the P-SSC migration. Third, the studies exploring the preferential depletion of BM MSCs vs P-SSCs are difficult to interpret. The single metabolic stress condition chosen was not well-justified, and the use of purified cell populations to study response to stress ex vivo may have introduced artifacts into the system.

      Comments on the current version: The authors have addressed my concerns adequately

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The manuscript under review investigates the role of periosteal stem cells (P-SSC) in bone marrow regeneration using a whole-bone subcutaneous transplantation model. While the model is somewhat artificial, the findings were interesting, suggesting the migration of periosteal stem cells into the bone marrow and their potential to become bone marrow stromal cells. This indicates a significant plasticity of P-SSC consistent with previous reports using fracture models (Cell Stem Cell 29:1547, Dev Cell 59:1192).

      Major Concerns

      (1) The authors assert that the periosteal layer was completely removed in their model, which is crucial for their conclusions. To substantiate this claim, it is recommended that the authors provide evidence of the successful removal of the entire periosteal stem cell (P-SSC) population. A colony-forming assay, with and without periosteal removal, could serve as a suitable method to demonstrate this.

      We are grateful to the reviewer for this valuable suggestion. The objective of this experiment was to demonstrate that periosteal ablation impairs bone marrow regeneration, a finding that is supported by our results. We expect that ablation of the periosteum would be associated with only a partial decrease in CFU-F activity, given the presence of MSCs in the bone and in the endosteal region of the bone marrow. Therefore, CFU-F assays would be difficult to interpret in this setting. In view of the phenotype obtained, providing proof of concept of the importance of the periosteum, we do not believe that further experiments would strengthen the level of proof of this experiment.

      (2) The observation that P-SSCs do not express Kitl or Cxcl12, while their bone marrow stromal cell (BM-MSC) derivatives do, is a key finding. To strengthen this conclusion, the authors are encouraged to repeat the experiment using Cxcl12 or Scf reporter alleles. Immunofluorescence staining that confirms the migration of periosteal cells and their transformation into Cxcl12- or Scf-reporter-positive cells would significantly enhance the paper's key conclusion.

      Transplantation of periosteum isolated from Cxcl12 or Scf into WT bones is an excellent suggestion. Indeed, this experiment would confirm (1) the migration of periosteal SSC and (2) the expression of Cxcl12 and Scf by BM-MSCs derived from the periosteum .However, it should be noted that the current limitations in terms of available resources preclude the execution of these experiments. Moreover, the use of the PostnCre<sup>ER</sup>;Tmt mice represent the optimal approach for tracking and specifically isolating BM-MSCs derived from the periosteum. The expression of Cxcl12 and Scf by BM-MSCs derived from the periosteum has been demonstrated in 2 distinct experimental models (Figures 5 and 6).

      (3) On page 8, line 20, the authors' statement regarding the detection of Periostin+ cells outside the periosteum layer could be misinterpreted due to the use of the periostin antibody. Given that periostin is an extracellular matrix protein, the staining may not accurately represent Periostin-expressing cells but rather the presence of periostin in the extracellular matrix. The authors should revise this section for greater precision.

      We acknowledge and appreciate the reviewer's attention to detail. This is, in fact, an error. Nestin-GFP positive periosteal SSC are seen within the periosteum marked by an anti-periostin antibody labeling the extracellular matrix of the periosteum. The manuscript has been revised to address this inaccuracy on page 9, lines 8-9.

      Reviewer #2 (Public review):

      Summary:

      The authors have established a femur graft model that allows the study of hematopoietic regeneration following transplantation. They have extensively characterized this model, demonstrating the loss of hematopoietic cells from the donor femur following transplantation, with recovery of hematopoiesis from recipient cells. They also show evidence that BM MSCs present in the graft following transplantation are graft-derived. They have utilized this model to show that following transplantation, periosteal cells respond by first expanding, then giving rise to more periosteal SSCs, and then migrating into the marrow to give rise to BM MSCs.

      Strengths:

      These studies are notable in several ways:

      (1) Establishment of a novel femur graft model for the study of hematopoiesis;

      (2) Use of lineage tracing and surgery models to demonstrate that periosteal cells can give rise to BM MSCs.

      We thank the reviewer for noting the novelty of our manuscript.

      Weaknesses:

      There are a few weaknesses. First, the authors do not definitively demonstrate the requirement of periosteal SSC movement into the BM cavity for hematopoietic recovery. Hematopoiesis recovers significantly before 5 months, even before significant P-SSC movement has been shown, and hematopoiesis recovers significantly even when periosteum has been stripped.

      This is an important point. Notably, we can see expansion of P-SSCs by day 8 after femur transplantation and evidence of periosteum-derived SSCs in the bone marrow by day 15, before we can detect any significant hematopoietic recovery (see Figure 3A-C).

      Second, it is not clear how the periosteum is changing in the grafts. Which cells are expanding is unclear, and it is not clear if these cells have already adopted a more MSC-like phenotype prior to entering the marrow space.

      This is an interesting question. To examine early changes in gene expression in periosteal SSCs in grafted femurs, we performed additional RNA sequencing on host periosteal SSCs vs periosteal SSCs from grafted femurs at an earlier time point - at 3 days after femur transplantation and on host bone marrow MSCs (see new Supplementary Figure S5 A-C). At this time point the three cell populations are already distinct on the PCA plot (Figure S5A), and there is downregulation of some periosteal genes in the graft P-SSCs (Figure S5B). However, we do not yet see upregulation of Kitl or Cxcl12 or most other BM MSC genes in graft P-SSCs at this time point (Figure S5B). Furthermore, gene set enrichment analysis (GSEA) revealed upregulation of cell cycle, DNA replication and mismatch repair gene signatures, and downregulation of multiple gene signatures compared to host P-SSCs (Figure S5C). Therefore, we conclude that P-SSCs already adopt some gene expression changes early after femur transplantation, but have not yet fully differentiated into BM MSCs at this early time point. This experiment is now discussed on p.10 of the revised manuscript.

      Indeed, given the presence of host-derived endothelial cells in the BM, these studies are reminiscent of prior studies from this group and others that re-endothelialization of the marrow may be much more important for determining hematopoietic regeneration, rather than the P-SSC migration.

      Indeed, as previously shown by our group and others, we agree that endothelial regeneration and re-endothelialization may also play an important role in this bone marrow regeneration model. It is noteworthy that this model has the potential to serve as a valuable tool for analyzing the origin of BM endothelial cells during regeneration processes. To further illustrate the endothelial regeneration, additional images of bone sections from VE-cadherin-cre;TdTomato grafted femurs at 15 days, one month, and five months post-transplantation have been included in the new Figure S3. These images reveal extensive vascularization of the graft and proximity of UBC-GFP+ donor-derived vessels to VE-cadherin+ host-derived blood vessels in the bone marrow within one month (see Figure S2C). This observation is consistent with the timing of both BM MSC recovery and HSC recovery in the grafts, thereby suggesting the importance of endothelial recovery (see Fig. 1B). A new discussion of these findings has been included on page 6 of the revised manuscript and on page 16 in the discussion section.

      Third, the studies exploring the preferential depletion of BM MSCs vs P-SSCs are difficult to interpret. The single metabolic stress condition chosen was not well-justified, and the use of purified cell populations to study response to stress ex vivo may have introduced artifacts into the system.

      We chose to focus on hypoxia as the main condition in which to analyze the stress response of P-SSCs vs BM MSCs because we reasoned that due to the location of P-SSCs on the outside of the bone, these cells would be exposed to a higher oxygen tension than BM-MSCs, which are located within the bone marrow. Therefore, we wanted to determine whether this exposure to a different oxygen tension would be sufficient to explain the different properties of P-SSCs and BM MSCs. We modified the text on p.11 of the manuscript to explain the rationale for this experiment better.

      Reviewer #3 (Public review):

      Summary:

      Marchand, Akinnola, et al. describe the use of the novel model to study BM regeneration. Here, they harvest intact femurs and subcutaneously graft them into recipient mice. Similar to standard BM regeneration models, there is a rapid decrease in cellularity followed by a gradual recovery over 5 months within the grafts. At 5 months, these grafts have robust HSC activity, similar to HSCs isolated from the host femur. They find that periosteum skeletal stem cells (p-SSCs) are the primary source of BM-MSCs within the grafted femur and that these cells are more resistant to the acute stress of grafting the femur.

      Strengths:

      This is an interesting manuscript that describes a novel model to study BM regeneration. The model has tremendous promise.

      We thank the reviewer for highlighting the novelty and potential of our work.

      Weaknesses:

      The authors claim that grafting intact femurs subcutaneously is a model of BM regeneration and can be used as a replacement for gold standard BM regeneration assays such as sublethal chemo/irradiation. However, there isn't enough explanation as to how this model is equivalent or superior to the traditional models. For instance, the authors claim that this model allows for the study of "BM regeneration in vivo in response to acute injury using genetic tools." This can and has been done numerous times with established, physiologically relevant BM regeneration models. The onus is on the authors to discuss or perform the necessary experiments to justify the use of this model. For example, standard BM regeneration models involve systemic damage that is akin to therapies that require BM regeneration. How is studying the current model that provides only an acute injury more relevant and useful than other models? As it stands, it seems as if the authors could have done all the experiments demonstrating the importance of these p-SSCs in the traditional myelosuppressive BM regeneration models to be more physiologically relevant. Along these lines, the use of a standard BM regeneration model (e.g., sublethal chemo/irradiation) as a critical control is missing and should be included. Even if the control doesn't demonstrate that p-SSCs can contribute to the BM-MSC during regeneration, it will still be important because it could be the justification for using the described model to specifically study p-SSCs' regulation of BM regeneration.

      We appreciate the reviewer raising this important point. We never intended this femur transplantation model of bone marrow injury to replace more established models, such as chemotherapy or irradiation. In fact, we compared the effects of femur transplantation to localized bone irradiation on P-SSCs using our Periostin-Cre;Td-Tomato lineage tracing model. We found that irradiation does not induce the same migration of Tomato+ P-SSCs from the periosteum to the bone marrow cavity the way that femur transplantation, and cannot be used to demonstrate the plasticity of P-SSCs in the same way (see new Supplementary Figure S7D-E). Therefore, this appears to be a more severe form of bone marrow injury, and is not similar to other more established assays of bone marrow injury. We also added this discussion to the revised manuscript on p.14 and in the discussion section on p.17.

      The authors perform some analysis that suggests that grafting a whole femur mimics BM regeneration, but there are many experiments missing from the manuscript that will be necessary to support the use of this model. To demonstrate that this new model mimics current BM regeneration models, the authors need to perform a careful examination of the early kinetics of hematopoietic recovery post-transplant. Complete blood counts should be performed on the grafts, focusing on white blood cells (particularly neutrophils), red blood cells, platelets, all critical indicators of BM regeneration. This analysis should be done at early time points that include weekly analysis for a minimum of 28 days following the graft. Additionally, understanding how and when the vasculature recovers is critical. This is particularly important because it is well-established that if there is a delay in vascular recovery, there is a delay in hematopoietic recovery. As mentioned above, a standard BM regeneration model should be used as a control.

      We concur with the reviewer that hematopoietic recovery is a pivotal aspect of this model. We conducted a time-course analysis of bone marrow and HSC cellularity from day 0 to month 5 post-transplantation (Figure 1B). Furthermore, we evaluated the HSC capacities through bone marrow transplantation from grafted or host femurs (Figures 1D and 1E) and quantified the various hematopoietic cells in the graft after five months (Supplemental Figure 1). Furthermore, hematopoiesis occurring in the transplanted bone was comprehensively evaluated in another article, currently in revision and available in BioRxiv (Takeishi, S., Marchand, T., Koba, W. R., Borger, D. K., Xu, C., Guha, C., Bergman, A., Frenette, P. S., Gritsman, K., & Steidl, U. (2023). Haematopoietic stem cell numbers are not solely determined by niche availability. bioRxiv: the preprint server for biology, 2023.10.28.564559. https://doi.org/10.1101/2023.10.28.564559). We did not use another assay of bone marrow regeneration as a “control”, since we do not expect to see similar plasticity of periosteal SSCs in these models, such as with the localized irradiation model described in the new Figure S7D-E.

      We agree with the reviewer that endothelial recovery is also likely to be very important for hematopoietic recovery in this model, but this was not the focus of this manuscript. The process of endothelial recovery  is likely to be more complex than that of MSC recovery, as our findings indicate that the graft endothelium can arise from both the host and the graft femur (see Fig.2D). Consequently, further investigation into the mechanisms of endothelial recovery and its contribution to hematopoiesis in this experimental system will be an interesting focus of future work. We believe that this bone transplantation model represents a valuable tool for addressing questions regarding the origin and regeneration mechanisms of bone marrow endothelial cells.

      The contribution of donor and host cells to the BM regeneration of the graft is interesting. Particularly, the chimerism of the vasculature. One can assume that for the graft to undergo BM regeneration, there needs to be the delivery of nutrients into the graft via the vasculature. The chimerism of the vascular network suggests that host endothelial cells anastomose with the graft. Host mice should have their vascular system labeled with a dye such as dextran to determine if anastomosis has occurred. If not, the authors need to explain how this graft survives up to 5 months. If anastomosis does occur, then it is very surprising that the hematopoietic system of the graft is not a chimera because this would essentially be a parabiosis model. This needs to be explained.

      We have included additional images of bone sections from VE-cadherin-cre;tdTomato grafted femurs at 15 days, one month, and five months post transplantation in the new Figure S3. These images show extensive vascularization of the graft and proximity of UBC-GFP+ donor-derived vessels to VE-cadherin+ host-derived blood vessels in the bone marrow within one month, suggesting a potential anastomosis (Figure S2C). However, it is not surprising that hematopoiesis arises exclusively from the host, as we observed complete death of the hematopoietic cells and BM MSCs in the graft femur within the first 3 days of femur transplantation (see Figure S1A), and we do not see any significant hematopoietic recovery in the grafts until at least 2 months (see Fig.1B). Therefore, this is not similar to a parabiosis model, as confirmed by our chimerism studies shown in Figure 2D. In addition, these data are consistent with the results reported with the use of ossicles (doi:10.1038/nature09262; DOI 10.1016/j.cell.2007.08.025; doi:10.1038/nature07547).

      Most of the data presented for the resistance of p-SSCs to stress suggests DNA damage response. Do p-SSCs demonstrate a higher ability to resolve DNA damage? Do they accumulate less DNA damage? Staining for DNA damage foci or performing comet assays could be done to further define the mechanism of stress resistance properties of p-SSCs.

      This is an interesting question. In our RNA sequencing analysis of graft P-SSCs compared with host P-SSCs we did observe an upregulation of mismatch repair gene signatures by gene set enrichment analysis (GSEA) (new Figure S5C). Therefore, it is possible that P-SSCs do have an altered DNA damage response. However, we are unable to investigate this further at this time.

      Given the importance of BM-MSCs in hematopoiesis and that the majority of the emerging BM-MSCs appear to be derived from p-SSCs, the authors should perform experiments to determine if p-SSC-derived BM-MSCs are critical regulators of BM regeneration. For example, the authors could test this by crossing the Postn-creER mice with iDTR mice to ablate these cells and see if recovery is inhibited or delayed. This should be done with the described periosteum-wrapped femur graft model as well as a control BM regeneration model. Demonstrating that the deletion of these cells affects BM regeneration in both models would further justify the physiological relevance and utility of the femur graft model.

      We thank the reviewer for this excellent suggestion, and we agree that this is an important experiment. However, our attempts to ablate Postn+ cells using the iDTA system were limited by technical difficulties, which we are unable to address at this time.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure 2C, the vascular network staining appears to be duplicated, suggesting a possible error in image capture. The authors should replace this image with a different field or an alternative picture to avoid confusion.

      We thank the reviewer for noting this accidental duplication due to an image stitching problem. Figure 2C was replaced by a different image from the same experiment.

      (2) For consistency and clarity, a scale bar should be included in Figure S3E to indicate that the magnification factors of the respective visual fields are identical.

      We thank the reviewer for highlighting this point. The magnification used has been added in the revised Figure.

      (3) In Figure S5B, the difference in normalized Opn mRNA expression relative to Gapdh between steady-state BM-MSCs and P-SSCs seems substantial, which contradicts the "ns" (not significant) label. The authors should verify the accuracy of this labeling.

      We agree with the reviewer that this difference in what is now Figure S6B looks substantial. However, we confirmed that this difference is not statistically significant, likely due to the high variability between replicates in Opn expression in the steady state BM MSCs.

      Reviewer #2 (Recommendations for the authors):

      In order to strengthen the argument that P-SSCs are necessary for hematopoietic recovery, the authors should consider providing the following data:

      (1) In the periosteal stripping experiments, the authors should show if periosteum-derived MSCs are present in the BM throughout the process of hematopoietic recovery (not just at the end of the experiment). If none are present at the end, that would mean that periosteum is not required for hematopoietic recovery, but would still suggest that it is required for optimal hematopoietic recovery. At early time points, it would also be very helpful to demonstrate the composition and amount of endothelium present in the marrow to determine if P-SSC migration and differentiation into MSCs depends on endothelial reconstitution.

      To further examine the vascularization of the transplanted femur at an earlier time point, we have added additional images of grafted femur from VE-cadherin-cre;tdTomato at 15 days and one month post transplantation in the new Figure S3A and S3B. These images already show extensive vascularization of the graft periosteum stained with an anti-periostin antibody. In addition, we observed anastomoses of host VE-cadherin;Tmt+ blood vessels with graft ubc-GFP+ blood vessels in the grafted periosteum within one month (Figure S3C).

      (2) Studies of the surgical periosteum grafts could benefit from histologic analysis of the BM and its MSC components at earlier time points following grafting since the data provided are only at 5 months. Such studies would allow a better appreciation of the relationship between P-SSC migration into the marrow and hematopoietic recovery.

      We have performed histologic analysis of grafted femurs at multiple early time points, which shows expansion of P-SSCs and their migration into the bone marrow cavity (Figure 3C).

      (3) Studies of stress responses preferably should be performed using intact bone and should characterize P-SSC and BM MSC apoptosis, cell cycle status, differentiation, etc, immediately following shifts to the stress conditions. These studies would be more compelling if performed using additional "stress" conditions likely to represent the graft environment.

      This is an interesting suggestion. However, these types of studies would not be possible in intact bones ex vivo, as P-SSCs are known to migrate out of the bone in culture.

    1. eLife Assessment

      This useful study introduces a deep learning-based algorithm that tracks animal postures with reduced drift by incorporating transformers for more robust keypoint detection. The efficacy of this new algorithm for single-animal pose estimation was demonstrated through comparisons with two popular algorithms. The strength of evidence is solid but would benefit from consideration of issues in multi-animal tracking. This work will be of interest to those interested in animal behavior tracking.

    2. Reviewer #2 (Public review):

      Summary:

      The authors present a new model for animal pose estimation. The core feature they highlight is the model's stability compared to existing models in terms of keypoint drift. The authors test this model across a range of new and existing datasets. The authors also test the model with two mice in the same arena. For the single animal datasets the authors show a decrease in sudden jumps in keypoint detection and the number of undetected keypoints compared with DeepLabCut and SLEAP. Overall average accuracy, as measured by root mean squared error, generally shows generally similar but sometimes superior performance to DeepLabCut and better performance compared to SLEAP. The authors confusingly don't quantify the performance of pose estimation in the multi (two) animal case instead focusing on detecting individual identity. This multi-animal model is not compared with the model performance of the multi-animal mode of DeepLabCut or SLEAP.

      Strengths:

      The major strength of the paper is successfully demonstrating a model that is less likely to have incorrect large keypoint jumps compared to existing methods. As noted in the paper, this should lead to easier-to-interpret descriptions of pose and behavior to use in the context of a range of biological experimental workflows.

      Weaknesses:

      There are two main types of weaknesses in this paper. The first is a tendency to make unsubstantiated claims that suggest either model performance that is untested or misrepresents the presented data, or suggest excessively large gaps in current SOTA capabilities. One obvious example is in the abstract when the authors state ADPT "significantly outperforms the existing deep-learning methods, such as DeepLabCut, SLEAP, and DeepPoseKit." All tests in the rest of the paper, however, only discuss performance with DeepLabCut and SLEAP, not DeepPoseKit. At this point, there are many animal pose estimation models so it's fine they didn't compare against DeepPoseKit, but they shouldn't act like they did. Similar odd presentation of results are statements like "Our method exhibited an impressive prediction speed of 90{plus minus}4 frames per second (fps), faster than DeepLabCut (44{plus minus}2 fps) and equivalent to SLEAP (106{plus minus}4 fps)." Why is 90{plus minus}4 fps considered "equivalent to SLEAP (106{plus minus}4 fps)" and not slower? I agree they are similar but they are not the same. The paper's point of view of what is "equivalent" changes when describing how "On the single-fly dataset, ADPT excelled with an average mAP of 92.83%, surpassing both DeepLabCut and SLEAP (Figure 5B)" When one looks at Figure 5B, however, ADPT and DeepLabCut look identical. Beyond this, oddly only ADPT has uncertainty bars (no mention of what uncertainty is being quantified) and in fact, the bars overlap with the values corresponding to SLEAP and DeepPoseKit. In terms of making claims that seem to stretch the gaps in the current state of the field, the paper makes some seemingly odd and uncited statements like "Concerns about the safety of deep learning have largely limited the application of deep learning-based tools in behavioral analysis and slowed down the development of ethology" and "So far, deep learning pose estimation has not achieved the reliability of classical kinematic gait analysis" without specifying which classical gait analysis is being referred to. Certainly, existing tools like DeepLabCut and SLEAP are already widely cited and used for research.

      The other main weakness in the paper is the validation of the multi-animal pose estimation. The core point of the paper is pose estimation and anti-drift performance and yet there is no validation of either of these things relating to multi-animal video. All that is quantified is the ability to track individual identity with a relatively limited dataset of 10 mice IDs with only two in the same arena (and see note about train and validation splits below). While individual tracking is an important task, that literature is not engaged with (i.e. papers like Walter and Couzin, eLife, 2021: https://doi.org/10.7554/eLife.64000) and the results in this paper aren't novel compared to that field's state of the art. On the other hand, while multi-animal pose estimation is also an important problem the paper doesn't engage with those results either. The two methods already used for comparison in the paper, SLEAP and DeepPoseKit, already have multi-animal modes and multi-animal annotated datasets but none of that is tested or engaged with in the paper. The paper notes many existing approaches are two-step methods, but, for practitioners, the difference is not enough to warrant a lack of comparison. The authors state that "The evaluation of our social tracking capability was performed by visualizing the predicted video data (see supplement Videos 3 and 4)." While the authors report success maintaining mouse ID, when one actually watches the key points in the video of the two mice (only a single minute was used for validation) the pose estimation is relatively poor with tails rarely being detected and many pose issues when the mice get close to each other.

      Finally, particularly in the methods section, there were a number of places where what was actually done wasn't clear. For example in describing the network architecture, the authors say "Subsequently, network separately process these features in three branches, compute features at scale of one-fourth, one-eight and one-sixteenth, and generate one-eight scale features using convolution layer or deconvolution layer." Does only the one-eight branch have deconvolution or do the other branches also? Similarly, for the speed test, the authors say "Here we evaluate the inference speed of ADPT. We compared it with DeepLabCut and SLEAP on mouse videos at 1288 x 964 resolution", but in the methods section they say "The image inputs of ADPT were resized to a size that can be trained on the computer. For mouse images, it was reduced to half of the original size." Were different image sizes used for training and validation? Or Did ADPT not use 1288 x 964 resolution images as input which would obviously have major implications for the speed comparison? Similarly, for the individual ID experiments, the authors say "In this experiment, we used videos featuring different identified mice, allocating 80% of the data for model training and the remaining 20% for accuracy validation." Were frames from each video randomly assigned to the training or validation sets? Frames from the same video are very correlated (two frames could be just 1/30th of a second different from each other), and so if training and validation frames are interspersed with each other validation performance doesn't indicate much about performance on more realistic use cases (i.e. using models trained during the first part of an experiment to maintain ids throughout the rest of it.)

      Editors' note: None of the original reviewers responded to our request to re-review the manuscript. The attached assessment statement is the editor's best attempt at assessing the extent to which the authors addressed the outstanding concerns from the previous round of revisions.

    3. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study introduces a useful deep learning-based algorithm that tracks animal postures with reduced drift by incorporating transformers for more robust keypoint detection. The efficacy of this new algorithm for single-animal pose estimation was demonstrated through comparisons with two popular algorithms. However, the analysis is incomplete and would benefit from comparisons with other state-of-the-art methods and consideration of multi-animal tracking.

      First, we would like to express our gratitude to the eLife editors and reviewers for their thorough evaluation of our manuscript. ADPT aims to improve the accuracy of body point detection and tracking in animal behavior, facilitating more refined behavioral analyses. The insights provided by the reviewers have greatly enhanced the quality of our work, and we have addressed their comments point-by-point.

      In this revision, we have included additional quantitative comparisons of multi-animal tracking capabilities between ADPT and other state-of-the-art methods. Specifically, we have added evaluations involving homecage social mice and marmosets to comprehensively showcase ADPT’s advantages from various perspectives. This additional analysis will help readers better understand how ADPT effectively overcomes point drift and expands its applicability in the field.

      Reviewer #1:

      In this paper, the authors introduce a new deep learning-based algorithm for tracking animal poses, especially in minimizing drift effects. The algorithm's performance was validated by comparing it with two other popular algorithms, DeepLabCut and LEAP.The accessibility of this tool for biological research is not clearly addressed, despite its potential usefulness. Researchers in biology often have limited expertise in deep learning training, deployment, and prediction. A detailed, step-by-step user guide is crucial, especially for applications in biological studies.

      We appreciate the reviewers' acknowledgment of our work. While ADPT demonstrates superior performance compared to DeepLabCut and SLEAP, we recognize that the absence of a user-friendly interface may hinder its broader application, particularly for users with a background solely in biology. In this revision, we have enhanced the command-line version of the user tutorial to provide a clear, step-by-step guide. Additionally, we have developed a simple graphical user interface (GUI) to further support users who may not have expertise in deep learning, thereby making ADPT more accessible for biological research.

      The proposed algorithm focuses on tracking and is compared with DLC and LEAP, which are more adept at detection rather than tracking.

      In the field of animal pose estimation, the distinction between detection and tracking is often blurred. For instance, the title of the paper "SLEAP: A deep learning system for multi-animal pose tracking" refers to "tracking," while "detection" is characterized as "pose estimation" in the body text. Similarly, "Multi-animal pose estimation, identification, and tracking with DeepLabCut" uses "tracking" in the title, yet "detection" is also mentioned in the pose estimation section. We acknowledge that referencing these articles may have contributed to potential confusion.

      To address this, we have clarified the distinction between "tracking" and "detection" Results section under " Anti-drift pose tracker." (see lines 118-119). In this paper, we now explicitly use “track” to refer to the tracking of all body points or poses of an individual, and “detect” for specific keypoints.

      Reviewer #1 recommendations:

      (1) DLC and LEAP are mainly good in detection, not tracking. The authors should compare their ADPT algorithm with idtracker.ai, ByteTrack, and other advanced tracking algorithms, including recent track-anything algorithms.

      (2) DeepPoseKit is outdated and no longer maintained; a comparison with the T-REX algorithm would be more appropriate.

      We appreciate the reviewer's suggestion for a more comprehensive comparison and acknowledge the importance of including these advanced tracking algorithms. However, we have not yet found suitable publicly available datasets for such comparative testing. We appreciate this insight and will consider incorporating T-REX into future comparisons.

      (3) The authors primarily compared their performance using custom data. A systematic comparison with published data, such as the dataset reported in the paper "Multi-animal pose estimation, identification, and tracking with DeepLabCut," is necessary. A detailed comparison of the performances between ADPT and DLC is required.

      In the previous version of our manuscript, we included the SLEAP single-fly public dataset and the OMS_dataset from OpenMonkeyStudio for performance comparisons. We recognize that these datasets were not comprehensive. In this revision, we have added the marmoset dataset from "Multi-animal pose estimation, identification, and tracking with DeepLabCut" and a customized homecage social mice dataset to enhance our comparative analysis of multi-animal pose estimation performance. Our comprehensive comparison reveals that ADPT outperforms both DLC and SLEAP, as discussed in the Results section under "ADPT can be adapted for end-to-end pose estimation and identification of freely social animals.". (Figure 1, see lines 303-332)

      (4) Given the focus on biological studies, an easy-to-use interface and introduction are essential.

      In this revision, we have not only developed a GUI for ADPT but also included a more detailed tutorial. This can be accessed at https://github.com/tangguoling/ADPT-TOOLBOX

      Reviewer #2:

      The authors present a new model for animal pose estimation. The core feature they highlight is the model's stability compared to existing models in terms of keypoint drift. The authors test this model across a range of new and existing datasets. The authors also test the model with two mice in the same arena. For the single animal datasets the authors show a decrease in sudden jumps in keypoint detection and the number of undetected keypoints compared with DeepLabCut and SLEAP. Overall average accuracy, as measured by root mean squared error, generally shows similar but sometimes superior performance to DeepLabCut and better performance compared to SLEAP. The authors confusingly don't quantify the performance of pose estimation in the multi (two) animal case instead focusing on detecting individual identity. This multi-animal model is not compared with the model performance of the multi-animal mode of DeepLabCut or SLEAP.

      We appreciate the reviewer's thoughtful assessment of our manuscript. Our study focuses on addressing the issue of keypoint drift prevalent in animal pose estimation methods like DeepLabCut and SLEAP. During the model design process, we discovered that the structure of our model also enhances performance in identifying multiple animals. Consequently, we included some results related to multi-animal identity recognition in our manuscript.

      In recent developments, we are working to broaden the applicability of ADPT for multi-animal pose estimation and identity recognition. Given that our manuscript emphasizes pose estimation, we have added a comparison of anti-drift performance in multi-animal scenarios in this revision. This quantifies ADPT's capability to mitigate drift in multi-animal pose estimation.

      Using our custom Homecage social mice dataset, we compared ADPT with DeepLabCut and SLEAP. The results indicate that ADPT achieves more accurate anti-drift pose estimation for two mice, with superior keypoint detection accuracy. Furthermore, we also evaluated pose estimation accuracy on the publicly available marmoset dataset, where ADPT outperformed both DeepLabCut and SLEAP. These findings are discussed in the Results section under "ADPT can be adapted for end-to-end pose estimation and identification of freely social animals."

      The first is a tendency to make unsubstantiated claims that suggest either model performance that is untested or misrepresents the presented data, or suggest excessively large gaps in current SOTA capabilities. One obvious example is in the abstract when the authors state ADPT "significantly outperforms the existing deep-learning methods, such as DeepLabCut, SLEAP, and DeepPoseKit." All tests in the rest of the paper, however, only discuss performance with DeepLabCut and SLEAP, not DeepPoseKit. At this point, there are many animal pose estimation models so it's fine they didn't compare against DeepPoseKit, but they shouldn't act like they did.

      We appreciate the reviewer's feedback regarding unsubstantiated claims in our manuscript. Upon careful review, we acknowledge that our previous revisions inadvertently included statements that may misrepresent our model's performance. In particular, we have revised the abstract to eliminate the mention of DeepPoseKit, as our comparisons focused exclusively on DeepLabCut and SLEAP.

      In addition to this correction, we have thoroughly reviewed the entire manuscript to address other instances of ambiguity and ensure that our claims are well-supported by the data presented. Thank you for bringing this to our attention; we are committed to maintaining the integrity of our claims throughout the paper.

      In terms of making claims that seem to stretch the gaps in the current state of the field, the paper makes some seemingly odd and uncited statements like "Concerns about the safety of deep learning have largely limited the application of deep learning-based tools in behavioral analysis and slowed down the development of ethology" and "So far, deep learning pose estimation has not achieved the reliability of classical kinematic gait analysis" without specifying which classical gait analysis is being referred to. Certainly, existing tools like DeepLabCut and SLEAP are already widely cited and used for research.

      In this revision, we have carefully reviewed the entire manuscript and addressed the instances of seemingly odd and unsubstantiated claims. Specifically, we have revised the statements "largely limited" to "limited" to ensure accuracy and clarity. Additionally, we thoroughly reviewed the citation list to ensure proper attribution, incorporating references such as "A deep learning-based toolbox for Automated Limb Motion Analysis (ALMA) in murine models of neurological disorders" to better substantiate our claims and provide a clearer context.

      We have also added an additional section to comprehensively discuss the applications of widely-used tools like DeepLabCut and SLEAP in behavioral research. This new section elaborates on the challenges and limitations researchers encounter when applying these methods, highlighting both their significant contributions and the areas where improvements are still needed.

      The other main weakness in the paper is the validation of the multi-animal pose estimation. The core point of the paper is pose estimation and anti-drift performance and yet there is no validation of either of these things relating to multi-animal video. All that is quantified is the ability to track individual identity with a relatively limited dataset of 10 mice IDs with only two in the same arena (and see note about train and validation splits below). While individual tracking is an important task, that literature is not engaged with (i.e. papers like Walter and Couzin, eLife, 2021: https://doi.org/10.7554/eLife.64000) and the results in this paper aren't novel compared to that field's state of the art. On the other hand, while multi-animal pose estimation is also an important problem the paper doesn't engage with those results either. The two methods already used for comparison in the paper, SLEAP and DeepPoseKit, already have multi-animal models and multi-animal annotated datasets but none of that is tested or engaged with in the paper. The paper notes many existing approaches are two-step methods, but, for practitioners, the difference is not enough to warrant a lack of comparison.

      We appreciate the reviewer's insights regarding the validation of multi-animal pose estimation in our paper. While our primary focus has been on pose estimation and anti-drift performance, we recognize the importance of validating these aspects within the context of multi-animal videos.

      In this revision, we have included a comparison of ADPT's anti-drift performance in multi-animal pose estimation, utilizing our custom Homecage social mouse dataset (Figure 1A). Our findings indicate that ADPT achieves more accurate pose estimation for two mice while significantly reducing keypoint drift, outperforming both DeepLabCut and SLEAP. (see lines 311-322). We trained each model three times, and this figure presents the results from one of those training sessions. We calculated the average RMSE between predictions and manual labels, demonstrating that ADPT achieved an average RMSE of 15.8 ± 0.59 pixels, while DeepLabCut (DLC) and SLEAP recorded RMSEs of 113.19 ± 42.75 pixels and 94.76 ± 1.95 pixels, respectively (Figure 1C). ADPT achieved an accuracy of 6.35 ± 0.14 pixels based on the DLC evaluation metric across all body parts of the mice, while DLC reached 7.49 ± 0.2 pixels (Figure 1D). ADPT achieved 8.33 ± 0.19 pixels using the SLEAP evaluation Metric across all body parts of the mice, compared to SLEAP’s 9.82 ± 0.57 pixels (Figure 1E).

      Furthermore, we have conducted pose estimation accuracy evaluations on the publicly available marmoset dataset from DeepLabCut, where ADPT also demonstrated superior performance compared to DeepLabCut and SLEAP. These results can be found in the "ADPT can be adapted for end-to-end pose estimation and identification of freely social animals" section of the Results. (see lines 323-329)

      We acknowledge the existing literature on multi-animal tracking, such as the work by Walter and Couzin (2021). While individual tracking is crucial, our primary focus lies in the effective tracking of animal poses and minimizing drift during this process. This dual emphasis on pose tracking and anti-drift performance distinguishes our work and aligns with ongoing advancements in the field. Engaging with relevant literature, highlights the importance of contextualizing our results within the broader tracking literature, demonstrating that while our findings may overlap with existing methods, the unique focus on improving tracking stability and reducing drift presents valuable contributions to the field. Thank you for your valuable feedback, which has helped us improve the robustness of our manuscript.

      The authors state that "The evaluation of our social tracking capability was performed by visualizing the predicted video data (see supplement Videos 3 and 4)." While the authors report success maintaining mouse ID, when one actually watches the key points in the video of the two mice (only a single minute was used for validation) the pose estimation is relatively poor with tails rarely being detected and many pose issues when the mice get close to each other.

      We acknowledge that there are indeed challenges in pose estimation, particularly when the two mice get close to each other, leading to tracking failures and infrequent detection of tails in the predicted videos. The reasons for these issues can be summarized as follows:

      Lack of Training Data from Real Social Scenarios: The training data used for the social tracking assessment were primarily derived from the Mix-up Social Animal Dataset, which does not fully capture the complexities of real social interactions. In future work, we plan to incorporate a blend of real social data and the Mix-up data for model training. Specifically, we aim to annotate images where two animals are in close proximity or interacting to enhance the model's understanding of genuine social behaviors.

      Challenges in Tail Tracking in Social Contexts: Tracking the tails of mice in social situations remains a significant challenge. To validate this, we have added an assessment of tracking performance in real social settings using homecage data. Our findings indicate that using annotated data from real environments significantly improves tail tracking accuracy, as demonstrated in the supplementary video.

      We appreciate your feedback, which highlights critical areas for improvement in our model.

      Finally, particularly in the methods section, there were a number of places where what was actually done wasn't clear.

      We have carefully reviewed and revised the corresponding parts to clarify the previously incomprehensible statements. Thank you for your valuable feedback, which has helped enhance the clarity of our methods.

      For example in describing the network architecture, the authors say "Subsequently, network separately process these features in three branches, compute features at scale of one-fourth, one-eight and one-sixteenth, and generate one-eight scale features using convolution layer or deconvolution layer." Does only the one-eight branch have deconvolution or do the other branches also?

      We apologize for the confusion this has caused. Upon reviewing our manuscript, we identified an error in the diagram. In the revised version, we have clarified that the model samples feature maps at multiple resolutions and ultimately integrates them at the 1/8 resolution for feature fusion. Specifically, the 1/4 feature map from ResNet50's stack 2 is processed through max-pooling and convolution to generate a 1/8 feature map. Additionally, the 1/4 feature map from ResNet50's stack 2 is also transformed into a 1/8 feature map using a convolution operation with a stride of 2. Finally, both the input and output of the transformer are at the 1/16 resolution, which can be trained on a 2080Ti GPU. The 1/16 feature map is then upsampled to produce the final 1/8 feature map. We have updated the manuscript to reflect these changes, and we also modified the model architecture diagram for better clarity.

      Similarly, for the speed test, the authors say "Here we evaluate the inference speed of ADPT. We compared it with DeepLabCut and SLEAP on mouse videos at 1288 x 964 resolution", but in the methods section they say "The image inputs of ADPT were resized to a size that can be trained on the computer. For mouse images, it was reduced to half of the original size." Were different image sizes used for training and validation? Or Did ADPT not use 1288 x 964 resolution images as input which would obviously have major implications for the speed comparison?

      For our inference speed evaluation, all models, including ADPT, used images with a resolution of 1288 x 964. In ADPT's processing pipeline, the first layer is a resizing layer designed to compress the images to a scale determined by the global scale parameter. For the mouse images, we set the global scale to 0.5, allowing our GPU to handle the data at that resolution during transformer training.

      We recorded the time taken by ADPT to process the entire 15-minute mouse video, which included the time taken for the resizing operation, and subsequently calculated the frames per second (FPS). We have clarified this process in the manuscript, particularly in the "Network Architecture" section, where we specify: "Initially, ADPT will resize the images to a390 scale (a hyperparameter, consistent with the global scale in the DLC configuration)."

      Similarly, for the individual ID experiments, the authors say "In this experiment, we used videos featuring different identified mice, allocating 80% of the data for model training and the remaining 20% for accuracy validation." Were frames from each video randomly assigned to the training or validation sets? Frames from the same video are very correlated (two frames could be just 1/30th of a second different from each other), and so if training and validation frames are interspersed with each other validation performance doesn't indicate much about performance on more realistic use cases (i.e. using models trained during the first part of an experiment to maintain ids throughout the rest of it.)

      In our study, we actually utilized the first 80% of frames from each video for model training and the remaining 20% for testing the model's ID tracking accuracy. We have revised the relevant description in the manuscript to clarify this process. The updated description can be found in the "Datasets" section under "Mouse Videos of Different Individuals."

    1. eLife Assessment

      This useful manuscript reports on the crystal structures of two glycosaminoglycan (GAG) lyases from the PL35 family, along with in vitro enzyme activity assays and comprehensive structure-guided mutagenesis. The authors have addressed key concerns by incorporating additional docking analyses, validating the role of His188 in alginate degradation, and providing ICP-MS data to examine Mn²⁺ binding. While these improvements enhance the study, the study is incomplete due to the lack of enzyme-substrate complex structures and reliance on modeling which still limit mechanistic insight. Nonetheless, the revised manuscript presents a more complete analysis that will be of interest to specialists in carbohydrate-active enzymes.

    2. Reviewer #1 (Public review):

      Summary:

      This study aims to uncover molecular and structural details underlying the broad substrate specificity of glycosaminoglycan lyases belonging to a specific family (PL35). They determined the crystal structures of two such enzymes, conducted in vitro enzyme activity assays, and a thorough structure-guided mutagenesis campaign to interrogate the role of specific residues. They made progress towards achieving their aims and I appreciate the attempt of the authors to address my initial comments on the paper.

      Impact on the field:

      I expect this work will have limited impact on the field, although it does stand on its own as a solid piece of structure-function analysis.

      Strengths:

      The major strengths of the study were the combination of structure and enzyme activity assays, comprehensive structural analysis, as well as a thorough structure-guided mutagenesis campaign.

      Weaknesses:

      (Before revision) -the authors claim to have done a ICP-MS experiment to show Mn2+ binds to their enzyme, but did not present the data. The authors could have used the anomalous scattering properties of Mn2+ at the synchrotron to determine the presence and location of this cation (i.e. fluorescence spectra, and/or anomalous data collection at the Mn2+ absorption peak).<br /> *comment after revision: I appreciate that the authors included this data now, and it looks fine.

      (Before revision) -the authors have an over-reliance on molecular docking for understanding the position of substrates bound to the enzyme. The docking analysis performed was cursory at best; Autodock Vina is a fine program but more rigorous software could have been chosen, as well we molecular dynamics simulations. As well the authors do not use any substrate/product-bound structures from the broader PL enzyme family to guide the placement of the substrates in the GAGases, and interpret the molecular docking models.<br /> *comment after revision: the authors used another docking program, which is fine, but did not do any MD analysis or comment on why not. Also maybe it is just me but I still do not see a figure explicitly showing an overlay/superposition of the docking results with crystal structures of similar enzymes with similar ligands. The authors do have a statement in this regard but I believe a figure (e.g. an additional panel on S2) would be very helpful to the reader.

      (Before revision)-the conclusion that the structures of GAGase II and VII are most similar to the structures of alginate lyases (Table 2 data), and the authors' reliance on DALI, are both questioned. DALI uses a global alignment algorithm, which when used for multi-domain enzymes such as these tends to result in sub-optimal alignment of active site residues, particularly if the active site is formed between the two domains as is the case here. The authors should evaluate local alignment methods focused on optimization of the superposition of a single domain; these methods may result in a more appropriate alignment of the active site residues, and different alignment statistics. This may influence the overall conclusion of the evolutionary history of these PL35 enzymes.<br /> *comment after revision: I'm not sure the authors understood my suggestion as the reply reiterates the original conclusions. I suggest local structural alignment of *only* the toroid and antiparallel β-sheet domains, not global alignment of both domains, as this would improve the accuracy of the structural similarity conclusions.

      (Before revision)-the data on the GAGase III residue His188 is not well interpreted; substitution of this residue clearly impacts HA and HS hydrolysis as well. The data on the impact on alginate hydrolysis is weak, which could be due to the fact that the WT enzyme has poor activity against alginate to start with.<br /> *comment after revision: I appreciate that the authors used higher amounts of H188A variants and still do not see activity on alginate, which strengthens the conclusions regarding this substrate. However this variant also has decreased activity against HS (Figure 5C) and thus H188 appears to be important for more substrates than just alginate. The discussion section should be updated accordingly.

      (Before revision)-the authors did not use the words "homology", "homologous", or "homolog" correctly (these terms mean the subjects have a known evolutionary relationship, which may or may not be known in the contexts the authors used these targets); the words "similarity" and "similar" are recommended to be used instead.<br /> *comment after revision: I thank the authors for addressing this.

      (Before revision)-the authors discuss a "shorter" cavity in GAGases, which does not make sense, and is not supported by any figure or analysis. I recommend a figure with a surface representation of the various enzymes of interest, with dimensions of the cavity labeled (as a supplemental figure). The authors also do not specifically define what subsites are in the context of this family of enzymes, nor do they specifically label or indicate the location of the subsites on the figures of the GAGase II and IV enzyme structures.<br /> *comment after revision: I thank the authors for improving their figures and text description on this point.

    3. Reviewer #3 (Public review):

      Summary:

      The authors characterized previous substrate specificity of several polysaccharide lyases from family PL35 (CAzy) and discovered their unusually broad substrate specificity, being able to degrade three types of GAGs belonging to HA, CS, and HS classes.<br /> In this study they determined the 3D structures of two lyases from this family and identified several residues essential for substrate degradation. Comparison with lyases from other PL families but having the same fold allowed them to propose an Asn, Tyr and His as essential for catalysis. One of the characterized lyases can also degrade alginate and they established a specific His residue as necessary for activity toward this substrate but not sufficient by itself.<br /> Attempts to obtain crystals with substrate or products were unsuccessful, therefore the authors resorted to modeling substrate into the determined structures. The obtained models led them to propose a catalytic mechanism, that generally reflects previously proposed mechanism for lyases with this fold.

      Unfortunately, they have no definitive explanation for a broad specificity for the PL35 lyases but suggest that it is related to a shorter substrate binding cleft with a large open space on the nonreducing end of the substrate.

      Strengths:

      The determination of 3D structure of two PL35 lyases allows comparing them to other lyases with similar fold. The structures show a shorter substrate binding cleft that might be the reason for broader substrate specificity. Essential roles of several residues in catalysis and/or substrate binding were established by mutagenesis.

      Weaknesses:

      The main weakness is the lack of the structures of an enzyme-substrate/product complex. While the determined structures confirm the predicted two domain fold with a helical toroid domain and a double beta-sheet domain, the explanation for the broad specificity is lacking, except for suggestion that it has to do with a shorter substrate binding cleft. The enzymatic mechanism is hypothesized based on models rather than supported by experimentally determined structure of the complex.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study aims to uncover molecular and structural details underlying the broad substrate specificity of glycosaminoglycan lyases belonging to a specific family (PL35). They determined the crystal structures of two such enzymes, conducted in vitro enzyme activity assays, and a thorough structure-guided mutagenesis campaign to interrogate the role of specific residues. They made progress towards achieving their aims but I see significant holes in data that need to be determined and in the authors' analyses.

      Impact on the field:

      I expect this work will have a limited impact on the field, although, with additional experimental work and better analysis, this paper will be able to stand on its own as a solid piece of structure-function analysis.

      Strengths:

      The major strengths of the study were the combination of structure and enzyme activity assays, comprehensive structural analysis, as well as a thorough structure-guided mutagenesis campaign.

      Weaknesses:

      There were several weaknesses, particularly:

      (1) The authors claim to have done an ICP-MS experiment to show Mn2+ binds to their enzyme but did not present the data. The authors could have used the anomalous scattering properties of Mn2+ at the synchrotron to determine the presence and location of this cation (i.e. fluorescence spectra, and/or anomalous data collection at the Mn2+ absorption peak).

      Thank you for your kind comment and suggestion. Many studies utilized ICP-MS for the detection of metal ions within proteins (doi: 10.1016/j.jbc.2023.103047; doi: 10.1074/jbc.RA119.011790), so we utilized this method to determine the type of atoms within GAGases. In the revised manuscript, the data of ICP-MS experiment has been presented in “Supplemental Table S1”

      (2) The authors have an over-reliance on molecular docking for understanding the position of substrates bound to the enzyme. The docking analysis performed was cursory at best; Autodock Vina is a fine program but more rigorous software could have been chosen, as well we molecular dynamics simulations. As well the authors do not use any substrate/product-bound structures from the broader PL enzyme family to guide the placement of the substrates in the GAGases, and interpret the molecular docking models.

      Thank you for your kind comments. The interaction between the enzyme and ligand should be confirmed by resolving the structure of enzyme-ligand complex. Unfortunately, we tried to prepare the co-crystals of GAGases with various oligosaccharide substrates but ultimately failed. Thus, we tried to use docking to explain the catalytic mechanism of polysaccharide lyases using Autodock Vina although this method may be questionable. In the revised manuscript, we predicted the substrate binding site of GAGase II using Caver Web 1.2 and performed molecular docking near the substrate binding site simultaneously using Molecular Operating Environment (MOE) to verify the accuracy of the docking results (Figure 6, Supplemental Figure S4). In addition, a series of enzyme-substrate complex structures of identified PL family enzymes with structural similarities to the GAGases are showed in Supplemental Figure S2, and the positions of the catalytic cavities and the substrate binding modes are similar to those of the molecular docking results, which may also corroborate the referability of our molecular docking results in another aspect.

      (3) The conclusion that the structures of GAGase II and VII are most similar to the structures of alginate lyases (Table 2 data), and the authors' reliance on DALI, are both questioned. DALI uses a global alignment algorithm, which when used for multi-domain enzymes such as these tends to result in sub-optimal alignment of active site residues, particularly if the active site is formed between the two domains as is the case here. The authors should evaluate local alignment methods focused on the optimization of the superposition of a single domain; these methods may result in a more appropriate alignment of the active site residues and different alignment statistics. This may influence the overall conclusion of the evolutionary history of these PL35 enzymes.

      Thank you for your kind question. As your suggestion, multiple structural alignment assays were carried out for the (α/α)<sub>n</sub> toroid and the antiparallel β-sheet domain, respectively, based on the structures of GAGs/alginate lyases from PL5, PL8, PL12, PL15, PL17, PL21, PL23, PL36, PL38 and PL39 families. The results showed that the overall structure of GAGases is more similarity to that of PL15, PL17 and PL39 family alginate lyases, which have an (α/α)<sub>6</sub> toroid and an antiparallel β-sheet domain (Table 3). In terms of the toroid and antiparallel β-sheet domains, most of them have an (α/α)<sub>6</sub> toroid and an antiparallel β-sheet as shown in Table 3. We also noticed that GAGases possess such a (α/α)<sub>6</sub> toroid structure rather than a (α/α)<sub>7</sub> toroid structure, and revised the relevant statement in the manuscript.

      (4) The data on the GAGase III residue His188 is not well interpreted; substitution of this residue clearly impacts HA and HS hydrolysis as well. The data on the impact on alginate hydrolysis is weak, which could be due to the fact that the WT enzyme has poor activity against alginate to start with.

      Thank you very much for your helpful comments and questions. To verify your suggestion that the weak impact of alginate hydrolysis could be due to poor activity of wild type GAGase III, we degraded alginate using different enzyme concentrations (3 to 30 μg) and analyzed the degradation products. The results showed that the alginate-degrading activity of GAGase III-H188A and GAGase III-H188N was abolished, even at a quite high ratio of the mutated enzyme to substrate such as 30 μg enzyme to 30 μg substrate (Supplemental Figure S3A), while their GAG-degrading activity was only partially affected, indicating that this residue plays a more important role for the digestion of alginate than other substrates. Unfortunately, we were unable to confer the ability to GAGase III through the mutation of N191H in GAGase II. Therefore, we suggest that His<sup>188</sup> play a key role in the specificity of alginate degradation by GAGase III, but that other determinants also contribute to this process. We will try more methods to obtain the structure of enzyme-substrate co-crystals and explain its substrate-selective mechanism in future studies.

      (5) The authors did not use the words "homology", "homologous", or "homolog" correctly (these terms mean the subjects have a known evolutionary relationship, which may or may not be known in the contexts the authors used these targets); the words "similarity" and "similar" are recommended to be used instead.

      Thank you for your helpful suggestions. We have revised the relevant part of the description in the manuscript.

      (6) The authors discuss a "shorter" cavity in GAGases, which does not make sense and is not supported by any figure or analysis. I recommend a figure with a surface representation of the various enzymes of interest, with dimensions of the cavity labeled (as a supplemental figure). The authors also do not specifically define what subsites are in the context of this family of enzymes, nor do they specifically label or indicate the location of the subsites on the figures of the GAGase II and IV enzyme structures.

      Thank you for your helpful suggestions. Figures (Supplemental Figure S2) with surface representations of the GAGase II and some structurally similar GAGs/alginate lyases with the dimensions of the cavity labeled, were added to the supplementary data as you suggested. Considering the correlation between enzyme specificity and substrate binding sites, we speculated that a shorter substrate binding cavity might allow the enzyme to accommodate a wider variety of substrates, resulting in a smaller restriction of the catalytic cavity to substrate binding, although this speculation needs to be verified by the resolution of the crystal structure of the enzyme-substrate complexes.

      Reviewer #2 (Public review):

      Summary:

      Wei et al. present the X-ray crystallographic structures of two PL35 family glycosaminoglycan (GAG) lyases that display a broad substrate specificity. The structural data show that there is a high degree of structural homology between these enzymes and GAGases that have previously been structurally characterized. Central to this are the N-terminal (α/α)7 toroid domain and the C-terminal two-layered β-sheet domain. Structural alignment of these novel PL35 lyases with previously deposited structures shows a highly conserved triplet of residues at the heart of the active sites. Docking studies identified potentially important residues for substrate binding and turnover, and subsequent site-directed mutagenesis paired with enzymatic assays confirmed the importance of many of these residues. A third PL35 GAGase that is able to turn over alginate was not crystallized, but a predicted model showed a conserved active site Asn was mutated to a His, which could potentially explain its ability to act on alginate. Mutation of the His into either Ala or Asn abrogated its activity on alginate, providing supporting evidence for the importance of the His. Finally, a catalytic mechanism is proposed for the activity of the PL35 lyases. Overall, the authors used an appropriate set of methods to investigate their claims, and the data largely support their conclusions. These results will likely provide a platform for further studies into the broad substrate specificity of PL35 lyases, as well as for studies into the evolutionary origins of these unique enzymes

      Strengths:

      The crystallographic data are of very high quality, and the use of modern structural prediction tools to allow for comparison of GAGase III to GAGase II/GAGase VII was nice to see. The authors were comprehensive in their comparison of the PL35 lyases to those in other families. The use of molecular docking to identify key residues and the use of site-directed mutagenesis to investigate substrate specificity was good, especially going the extra distance to mutate the conserved Asn to His in GAGase II and GAGase VII.

      Weaknesses:

      The structural models simply are not complete. A cursory look at the electron density and the models show that there are many positive density peaks that have not had anything modelled into them. The electron density also does not support the placement of a Mn2+ in the model. The authors indicate that ICP-MS was done to identify the metal, but no ICP-MS data is presented in the main text or supplementary. I believe the authors put too much emphasis on the possibility of GAGase III representing an evolutionary intermediate between GAG lyases and alginate lyases based on a single Asn to His mutation in the active site, and I don't believe that enough time was spent discussing how this "more open and shorter" catalytic cavity would necessarily mean that the enzyme could accommodate a broader set of substrates. Finally, the proposed mechanism does not bring the enzyme back to its starting state.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Minor points:

      (1) The number of significant digits used in Table 1 and Figure 3 legend are not justified. The authors should use a maximum of 2 significant digits.

      Thank you for your kind suggestion. We have verified the relevant data and retained two significant digits.

      (2) The authors should use the words "mutant" or "mutation" only when discussing DNA, but when discussing protein, the words "variant" and "substitution" should be used instead as these are more appropriate.

      Thank you for your helpful suggestions. We have revised the relevant description in the manuscript as you suggested.

      (3) Lines 102-110 are a long, run-on sentence that should be split into shorter sentences. Similarly, lines 367-378 should be split into shorter sentences.

      Thank you for your suggestions. In the revised manuscript, the long sentences in lines 102-110 and 367-378 have been rewritten into shorter ones.

      (4) Lines 174-175: His, Tyr, Glu, and Trp are not positively charged residues and this wording should be changed.

      Thank you for your suggestions. We have revised the relevant description in the manuscript as you suggested.

      (5) Lines 423-426 require a reference.

      Thank you for your suggestion. We have provided the reference at the right position and revised the relevant description in the manuscript as you suggested.

      (6) Grammar/language:

      -line 90 - change "should emerge" to "likely emerged"

      -line 145 - delete "Finally"

      -line 264 - delete "their"

      -line 265 - delete "active sites"

      -line 265-266 - change to "To confirm this hypothesis, site-directed mutagenesis followed by enzyme activity assay was performed"

      -line 311 - change "residue in the catalytic cavity of GAGase III, which.." to "residue in its catalytic cavity, which..."

      -line 318 - change "affect" to "affected"

      -line 323 - change to "degrading activity of GAGase II remains to be determined outside of the His188 residue"

      -line 345 - delete "assays"

      -line 359 - change to "evidence"

      -line 397 - change "folds" to "3D fold"

      -line 420 - change to "share similar catalytic sites"

      -lines 411, 433 - change "conversed" to "conserved"

      -line 441 - change to "Mutational analysis showed that the His188.."

      -line 450 - delete "which"

      Thank you for your suggestions. Grammatical errors in the revised manuscript have been corrected in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      Major Concerns

      The electron density in your model clearly does not support the placement of a Mn ion. In the GAGase II structure, the placement of the Mn and the placement of waters around it still results in two density peaks of > 12 rmsd. The manuscript suggests that ICP-MS was done but the results of this are not shown anywhere. Please include your ICP-MS data. I see the structures have already been deposited, and if they have been deposited unchanged, please see if you can modify them to actually finish building the models. I don't find your data in Figure 2B particularly convincing that Mn is necessarily important for activity.

      Thank you for your kind comments. As we known, ICP-MS is a common method used for the detection of metal ions within proteins (doi: 10.1016/j.jbc.2023.103047; doi: 10.1074/jbc.RA119.011790), and thus we utilized it to determine the type of atoms within GAGases in this study. In the revised manuscript, the data of ICP-MS experiment has been presented in “Supplemental Table S1”, and the data clearly showed that the content of Mn<sup>2+</sup> rather than others in test sample is much higher than that in the negative control, suggesting the involvement of Mn<sup>2+</sup> in the protein. We agree that the addition of Mn<sup>2+</sup> does not show very strong promotion to the activity of GAGase II just like other tested metal ions, but the addition of EDTA significantly inhibited the enzyme activity (Figure 2), indicating that metal ion such as Mn<sup>2+</sup> is necessary for the function of GAGases. Regarding the role of metal ion, whether it participates in the catalytic reaction or only stabilize the structure of enzyme remains to be further explored in our further study.

      Minor Concerns

      (1) Please include CC1/2 in your Table 1.

      Thank you for your kind suggestions. CC1/2 parameters have been added in the revised manuscript (Table 1).

      (2) If possible please include SDS-PAGE gel images of your purified proteins. Particularly for the point mutations. Ideally, you would have done SEC on your mutants to show that the reduction in activity is not due to aggregation/misfolding, but at the very least I would to see that you have similar levels of purity.

      Thank you for your kind suggestions. As your suggestion, we have added SDS-PAGE gel images of purified GAGase II, GAGase III, GAGase VII, and their mutant enzymes to the supplementary data. As shown in Figure S5, site-directed mutagenesis did not affect the soluble expression levels of GAGase II, GAGase III or GAGase VII, indicating that the reduction in activity is not due to aggregation or misfolding. Due to the large number of variants, we used crude enzyme for the activity assay of substrate binding sites, while for some catalytic key residues, we purified the corresponding mutant enzymes and then verified their activities by HPLC.

      (3) When referring to your structural predictions, it is not appropriate to say that you used Robetta. Your reference is correct though - you should say that the structures were predicted using RoseTTAfold.

      Thank you for your helpful suggestions. We have revised the relevant description in the manuscript.

      (4) If possible expand on how the shorter/more open active site cavity would result in broader substrate specificity.

      Thank you for your kind comment. In the revised manuscript, figures (Supplemental Figure S2) with surface representations of the GAGase II and some representatively structurally similar GAGs/alginate lyases, with the dimensions of the cavity labeled, were added to the supplementary data. Considering the correlation between enzyme specificity and substrate binding sites, we speculated that a shorter substrate binding cavity might allow the enzyme to accommodate a wider variety of substrates, resulting in a smaller restriction of the catalytic cavity to substrate binding. However, unfortunately, we did not succeed in obtaining co-crystals of GAGases with any of the substrates. We will try to explain the mechanism of substrate selectivity in future studies by culturing and resolving crystals of its enzyme substrate complex or otherwise.

      (5) I would put less emphasis on His188 in GAGase III being a strong indicator that this protein represents an evolutionary intermediate between alginate lyases and GAGases.

      Thank you for your comment. The His<sup>188</sup> residue, which is unique compared to other GAGases, is essential for the alginate-degrading activity of GAGase III. Regarding why GAGases are thought to represent a possible evolutionary intermediate between alginate lyases and GAG lyases, phylogenetic analysis demonstrated that GAGases show considerable homology with some identified GAG lyases and alginate lyases (DOI: 10.1016/j.jbc.2024.107466). The similarity in primary structure between some GAG lyases, alginate lyases, and GAGases suggests structural similarities, which are further supported by this study. As structure determines function, structural similarity is often used as a key criterion when studying the evolution of proteins, the GAGase III, which shows significant GAGs and alginate-degrading activity, support for this speculation. Of course, in this study, our analysis of the evolutionary relationship between GAGases and identified GAG lyases and alginate lyases, based on structural comparison, is an attempt using existing methods. The conclusions we have drawn remain a hypothesis that still requires further evidence to support and validate.

    1. eLife Assessment

      This important work advances our understanding of how the SARS-CoV-2 Nsp16 protein is regulated by host E3 ligases to promote viral mRNA capping. Support for the overall claims in the revised manuscript is convincing . This work will be of interest to those working in host-viral interactions and the role of the ubiquitin-proteasome system in viral replication.

    2. Reviewer #1 (Public review):

      In this study, Tiang et al. explore the role of ubiquitination of non-structural protein 16 (nsp16) in the SARS-CoV-2 life cycle. nsp16, in conjunction with nsp10, performs the final step of viral mRNA capping through its 2'-O-methylase activity. This modification allows the virus to evade host immune responses and protects its mRNA from degradation. The authors demonstrate that nsp16 undergoes ubiquitination and subsequent degradation by the host E3 ubiquitin ligases UBR5 and MARCHF7 via the ubiquitin-proteasome system (UPS). Specifically, UBR5 and MARCHF7 mediate nsp16 degradation through K48- and K27-linked ubiquitination, respectively. Notably, degradation of nsp16 by either UBR5 or MARCHF7 operates independently, with both mechanisms effectively inhibiting SARS-CoV-2 replication in vitro and in vivo. Furthermore, UBR5 and MARCHF7 exhibit broad-spectrum antiviral activity by targeting nsp16 variants from various SARS-CoV-2 strains. This research advances our understanding of how nsp16 ubiquitination impacts viral replication and highlights potential targets for developing broadly effective antiviral therapies.

      Strengths:

      The proposed study is of significant interest to the virology community because it aims to elucidate the biological role of ubiquitination in coronavirus proteins and its impact on the viral life cycle. Understanding these mechanisms will address broadly applicable questions about coronavirus biology and enhance our overall knowledge of ubiquitination's diverse functions in cell biology. Employing in vivo studies is a strength.

      Weaknesses:

      Minor comments:<br /> Figure 5A- The authors should ensure that the figure is properly labeled to clearly distinguish between the IP (Immunoprecipitation) panel and the input panel.

    3. Reviewer #3 (Public review):

      Summary:

      The manuscript "SARS-CoV-2 nsp16 is regulated by host E3 ubiquitin ligases, UBR5 and MARCHF7" is an interesting work by Tian et al. describing the degradation/ stability of NSP16 of SARS CoV2 via K48 and K27-linked Ubiquitination and proteasomal degradation. The authors have demonstrated that UBR5 and MARCHF7, an E3 ubiquitin ligase bring about the ubiquitination of NSP16. The concept, and experimental approach to prove the hypothesis looks ok. The in vivo data looks ok with the controls. Overall, the manuscript is good.

      Strengths:

      The study identified important E3 ligases (MARCHF7 and UBR5) that can ubiquitinate NSP16, an important viral factor.

      Comments on revisions:

      I had gone through the revised form of the manuscript thoroughly. The authors have addressed all of my concerns. To me, the experimental approach looks convincing that the host E3 ubiquitin ligases (UBR5 and MARCHF7) ubiquitinate NSP16 and mark it for proteasomal degradation via K48- and K27- linkage. The authors have represented the final figure (Fig.8) in a convincing manner, opening a new window to explore the mechanism of capping the vRNA bu NSP16.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Major comments:

      (1) In Figure 1 the authors could reference and use NSP8 (PMID: 38275298) and Nucleocapsid (PMID: 37185839) in their experiments as positive controls.

      Thank you for your suggestion! In Figure 1A, during our screening of SARS-CoV-2 nsp proteins regulated by MG132, we confirmed that nsp8 can also be restored by MG132. This finding indicates that nsp8 is degraded via the proteasome pathway and can therefore serve as a positive control for the experiment. It has been reported that nsp8 undergoes degradation via the ubiquitin-proteasome pathway following its ubiquitination mediated by TRIM22. We have added the description at line 115 in the manuscript.

      (2) The data indicating that NSP16 is ubiquitinated come from overexpression systems, and it is possible that NSP16 ubiquitination only occurs in expression contexts, not during coronavirus infection. If NSP16 ubiquitination can't be measured in the context of infection, it is unclear how we can make any conclusions. The authors need to demonstrate the ubiquitination of NSP16 in the context of viral infection.

      We greatly appreciate the reviewer's suggestion and have incorporated the corresponding experimental results. As shown in Figure 5A, co-IP experiments using an endogenous nsp16 antibody were conducted following infection with the SARS-CoV-2 Wuhan strain. These experiments confirmed that the nsp16 protein encoded by the virus undergoes ubiquitination in infected cells. This finding highlights the ubiquitination of nsp16 within a biological context, thereby supporting our conclusions in expression contexts.

      (3) In Figure 4, adding controls will strengthen the authors' conclusion.

      a) Is it possible to observe ubiquitination of NSP16 by transfecting in NSP16-FLAG tagged, immunoprecipitate NSP16, run a western blot, and probe for endogenous ubiquitin?

      b) Can the authors please include an empty vector control as well as WT ubiquitin in these panels for comparison?

      c) In addition, why are the Ubiquitination patterns different in the IP panels of D and E vs B?? Without an empty vector control, it is challenging to conclude what the background is.

      Thank you for your valuable suggestions! We have made the following changes and additions in response to your comments:

      a) We have conducted the experiments as per the reviewer's suggestion. Figure 3B shows the result. Co-IP experiments were performed, and endogenous ubiquitination of nsp16 was observed using the endogenous ubiquitin antibody.

      b) We apologize for previously focusing solely on presenting multiple ubiquitin mutants on a single panel of nsp16 IP without considering the inclusion of an empty vector control and WT ubiquitin. The experiment has been redesigned and conducted, and the results are now presented in Figures 3E and 3F.

      c) The differences in the ubiquitination patterns observed between the IP panels in Figures 3E and 3F compared to 3C may be due to varying plasmids, differences in antibody and depth of exposure. To address this, we have standardized the plasmids in the figure and included an empty vector control as a negative control to clarify the background signal.

      (4) Overexpression of the ubiquitin mutants may have an indirect effect on protein homeostasis. The authors can also utilize linkage-specific antibodies in their studies to elucidate the ubiquitin linkage associated with NSP16 ubiquitination. K63-linkage Specific Polyubiquitin (D7A11) Rabbit mAb, 5621S, and K48-linkage Specific Polyubiquitin (D9D5) Rabbit mAb, 8081S from Cell Signaling Technologies?

      We greatly appreciate the reviewer's excellent suggestion! Using linkage-specific antibodies to elucidate the ubiquitin linkage associated with nsp16 ubiquitination would indeed provide more direct evidence. However, due to the long lead time for obtaining these antibodies, we plan to conduct further verification in future experiments.

      (5) The authors discussed the subcellular localization of overexpressed NSP16- showing the localization of NSP16 in the context of viral infection would strengthen the study. If this is challenging, can the authors express NSP16 along with the co-factor NSP10 and examine its subcellular localization?

      Thank you for your suggestion! During viral infection, we observed the ubiquitination of the nsp16 protein through co-IP experiments, indicating that the presence of nsp10 does not influence the regulation of nsp16 ubiquitination by MARCHF7 or UBR5 (Figure 5A). Therefore, we believe that investigating the co-localization of nsp10 and nsp16 would not provide additional value to our results. Additionally, through a literature review, we found studies that have already examined the localization of nsp10 and nsp16 following viral infection. These studies revealed that nsp10 was located in the cytoplasm, while nsp16 can be detected in both the nucleus and cytoplasm (PMID: 33080218; PMID: 34452352). This observation is consistent with the localization of nsp16 that we observed in our overexpression experiments.

      (6) a) In Figure 3A, the authors should note that the interaction of NPS16 appears weak with UBR5. The authors should confirm that the interaction of NSP16 and the E3 ligases is relevant in the context of viral infection.

      b) In Figure 3B, the scale bars should be labeled in at least one panel, as well as in the legend.

      c) The authors discussed nuclear localization of MARCHF7, UBR5, and NSP16, therefore a control with a nuclear stain should be included in this figure to enhance the study.

      d) Some panels look overexposed while others are blurry which decreases the robustness of the interaction as the authors stated in line 191. To strengthen the results of Figure 3, consider GST purification and in vitro, cell-free binding assays to confirm a direct interaction between nsp16 and the E3 ligases

      Thank you for the reviewer’s thoughtful suggestions! We have made the following changes and adjustments based on your recommendations:

      a) On the interaction between nsp16 and UBR5:

      The interaction between nsp16 and UBR5 appears to be weak, possibly due to the large size of the UBR5 protein (300 kDa). As a result, there are challenges in presenting the experimental results, including difficulties in both expression and protein level detection. To further confirm the relevance of the interaction between nsp16 and the E3 ligases in the context of viral infection, we have performed experiments, and the results are presented in Figure 5A.

      b) On scale bars:

      The issue regarding the scale bars in Figure 4 has been addressed, and we have now included them in the figure legend for clarity (Line 885).

      c) On nuclear localization control:

      For the localization of MARCHF7, UBR5, and nsp16 in Figure 4C, given that both MARCHF7 and UBR5 are tagged with CFP, DAPI staining would result in spectral overlap. However, we conducted co-localization experiments for MARCHF7 or UBR5 with nsp16 in Figure 4—figure supplements 1E and 1F, where DAPI staining was included to illustrate the localization of these three proteins. Our experiments showed that while these proteins are present in both the nucleus and cytoplasm, they are predominantly localized in the cytoplasm.

      d) On validation of direct interaction:

      We attempted GST purification and in vitro cell-free binding assays to verify the direct interaction between nsp16 and the E3 ligases. However, UBR5 and MARCHF7 are both large proteins, with UBR5 being particularly large, which significantly increased the difficulty of purification. Additionally, we faced challenges in purifying nsp16, as the purified nsp16 protein tended to aggregate. We will continue to optimize purification techniques and conditions in future experiments.

      We appreciate your valuable comments, which have greatly contributed to improving our experiments and conclusions.

      .

      (7) To confirm the knockdown of the E3 ligases by siRNA, the authors should use western blotting to show the presence/absence/decrease of the protein levels in addition to mRNA levels by RT-PCR. The authors have the lysates, and they have shown that the antibodies for MARCHF7 and UBR5 work therefore including this throughout the manuscript to help substantiate the authors' conclusions.

      Thank you for the reviewer’s valuable suggestion! We have validated the knockdown efficiency at the protein level for the experiments involving siRNA knockdown. Corresponding Western blot images are now included in the relevant experiments to substantiate our conclusions, in addition to the RT-PCR data, including Figures 2, 4 and 5.

      (8) In the overexpression studies of the E3 ligases with viral infection in Figure 5, the authors should include the catalytic mutants for the E3 ligases with the nsp16 gradient experiment. This would strengthen the conclusion of the studies.

      Thank you for the reviewer’s suggestion! We have conducted the relevant experiments based on your recommendation, and the corresponding data are presented in the Figure 6—figure supplements 2A-H. These results strengthen the conclusions of our study.

      (9) Figure 5: For C and F, for a better comparison of the efficacy against the 2 strains, the authors should use the same scale. This could benefit from a kinetics experiment.

      Thank you for the reviewer’s suggestion! We have made revisions in Figures 5E and 5H in responses to your recommendation.

      (10) Is there a synergistic effect of double E3 knockdown on viral replication?

      Thank you for the reviewer’s question! In Figures 5—figure supplement 1A-B, we conducted experiments by individually and simultaneously knocking down MARCHF7 or UBR5, followed by infection with viral SARS-CoV-2 transmissible virus-like particles. The results revealed that simultaneous knockdown further enhances viral replication, demonstrating a synergistic effect.

      (11) In lines 98-100 the authors state "This dual targeting by MARCHF7 and UBR5 impairs the 2'-O-MTase activity of nsp16, blocking the conversion of cap-0 to cap-1 at the 5 'end of viral RNA, ultimately exhibiting potent antiviral activity against SARS-CoV-2". The authors did not examine the 2'-O-MTase activity of nsp16. The authors should rephrase this or provide the data if this experiment was done.

      Thank you for the reviewer’s valuable suggestion! Based on your comment, we have revised the ambiguous wording located in lines 100-104.

      (12) In the discussion, the authors reported that elucidating a specific lysine residue (s) that is ubiquitinated was challenging and stated that they generated multiple mutants including truncated mutants, and wrote "data not shown". The authors need to include this data as supplementary.

      Thank you for the reviewer’s suggestion! Based on your comment, we have included the data regarding the specific lysine residue(s) that is ubiquitinated, along with the truncated mutants, as supplementary data (Appendix-figure S2).

      (13) In Figure 7, the authors showed a copy number of SARS CoV-2 E in lung tissue. The authors should show viral titers using either the plaque assay or the TCID50 assay.

      Thank you for the reviewer’s suggestion! Based on your comment, we measured the TCID50 of the virus in the lung tissue homogenates, and the results are presented in Figure 7D.

      Minor comments:

      (1) Line 76: while many E3 ubiquitin ligases directly recognize and bind to their target substrates, cullin-RING ligases directly bind an adaptor, which binds a substrate receptor and/or the substrate directly, while the RING-box protein binds a different surface of the cullin and is also not directly interacting with substrate.

      Thank you for the reviewer’s valuable suggestion! Based on your comment, we have revised the ambiguous wording in line 76.

      (2) Line 161: having introduced the suggestion that NSP16 is ubiquitinated by these ligases, consider moving Figure 4 to the Figure 3 spot.

      Based on your comment, we have rearranged the order of the figures and moved Figure 4 to the Figure 3 spot.

      (3) Figure 2: Can the authors please do +/- MG132 for each siRNA? It is possible that the lanes where we don't see NSP16 were because there was no NSP16 expressed, OR it was degraded, MG132 would confirm one or the other.

      Thank you for the reviewer’s suggestion! Based on your comment, we have redesigned the experiment and included the MG132 treatment for each siRNA. The results are presented in Figure 2A.

      (4) Line 165: The authors write "As confirmed by MS, both Myc-tagged MARCHF7 and endogenous UBR5 interact with nsp16, as seen in the Co-IP experiment" should be the reverse, MS suggests NSP16-E3 interaction, the co-ip confirms this.

      Based on your comment, we have revised the wording in line 183 to ensure accuracy. MS suggests the interaction between nsp16 and the E3 ligases, while the Co-IP experiment confirms this interaction.

      (5) Line 178: the cited paper doesn't clearly show NSP16 nuclear localization, nor do the authors of said paper claim that they found it there. It is cytoplasmic. Additionally, said paper used overexpression, and it is unclear if NSP16 is nuclear in the context of viral infection.

      Thank you for the reviewer’s suggestion! The referenced paper states, "As can be seen in the Supplementary Fig. S2, the viral proteins are either cytoplasmic (NSP2, NSP3C, NSP4, NSP8, Spike, M, N, ORF3a, ORF3b, ORF6, ORF7a, ORF7b, ORF8, ORF9b, and ORF10) or both nuclear and cytoplasmic (NSP1, NSP3N, NSP5, NSP6, NSP7, NSP9, NSP10, NSP12, NSP13, NSP14, NSP15, NSP16, E, and ORF9a)," indicating that nsp16 is localized in both the nucleus and cytoplasm. Upon reviewing the literature, we found that the paper (PMID: 33080218) reports the distribution of nsp16 protein following viral infection. The results indicate that nsp16 is present in both the nucleus and cytoplasm, although the authors of the referenced paper claim that ns16 was located in the nucleus.

      (6) Line 197: in addition to the 7 lysine residues, ubiquitin can also form linear N-terminal linkages.

      Thank you for the reviewer’s suggestion! Linear N-terminal ubiquitination, with its distinct linkage and substrate recognition mechanism, is typically mediated by a complex consisting of the E3 ubiquitin ligases HOIL-1 and HOIP, and differs from classical ubiquitination. Therefore, this type of ubiquitin chain was not investigated in our experiments.

      (7) Line 202: Authors state "Interestingly, all single-lysine Ub mutants promoted nsp16 ubiquitylation to varying degrees, indicating a complex polyubiquitin chain structure on nsp16 potentially regulated by multiple E3 ligases". However, not all the mutants. K33 isn't supported by the blot.

      Thank you for pointing that out! Indeed, we made an error in our description. The K33 mutant did not promote nsp16 ubiquitylation, and we have corrected this in the manuscript accordingly in line 173.

      (8) Line 204: consider including "E2-E3 ligase pairs" for RING ligases the E2 determines the linkage type see: Cell Research (2016) 26:423-440.

      Thank you for your suggestion! We have included the term "E2-E3 ligase pairs" in the article in line 176.

      (9) Line 235: The authors used the real virus, the inclusion of the BLS2 virus here is extraneous, it doesn't add anything. The authors can consider removing it.

      Thank you for your suggestion! In our experiments, we performed simultaneous knockdown of two E3 ligases, so we believe this data is relevant and should not be removed.

      (10) Line 238: Authors state: "led to a significant increase in SARS-CoV-2 levels compared to the control group". What is meant by "levels?"

      Thank you for your careful reading. We have updated "levels" to "replication" as suggested to clarify the meaning in line 237.

      (11) Line 245: increased titers. This could be improved for specificity by saying, 1-log increase for example.

      Thank you for the reviewer's valuable suggestions. We have made the necessary changes and specified "increased titers" as a "1-log increase" in lines 249 and 261.

      (12) Line 249: in Figure 5H again, the authors are showing relative mRNA levels. Ideally should show protein levels by western blot.

      Thank you for the reviewer's suggestion! We have performed protein-level detection of the knockdown efficiency for the samples, and the bands have been placed in the corresponding positions in Figure 5I.

      (13) Line 259: "strongly linked to their ability to modulate..." This appears to be an overextension of the data. The data show nsp16 levels can compensate for E3 overexpression, but not that the E3 ligases are modulating this activity. We can infer this from previous experiments. Perhaps increasing the NSP12 levels would also have the same effect as they don't show that this is specific to NSP16. What about a catalytically dead E3?

      Thank you for the reviewer's thoughtful suggestion. We have revised the wording accordingly and designed the viral-related experiments with E3 enzyme activity mutants in Figure 6 supplement 2.

      (14) Figure 6: In panel H the MW for UBR5 is incorrect, should be around 300kDa.

      Thank you for the reviewer's detailed suggestions. We have made the necessary revisions in Figure 6H.

      (15) Line 267: "suggesting a more conserved sequence". What are the authors referring to? More conserved than what? This section would benefit from a discussion of which residues are mutated. Are they potential Ub sites, which could point to differential degradation by the E3s as due to more ubiquitination? Or rather to more efficient interaction with the E3? Is this conserved in related CoVs: original SARS and MERS, for instance?

      Thank you for the reviewer’s detailed suggestions. In this context, by “conservation,” we refer to the relative conservation of nsp16 proteins across different subtypes of the Omicron variant. We found that most of the mutation sites contained only 1 to 2 mutations. Additionally, we have constructed and validated multiple-mutant nsp16 proteins, which are still degraded by MARCHF7 or UBR5. Given the ongoing prevalence of the Omicron variant, we aim to explore the broad-spectrum degradation and antiviral effects of these two E3 ligases. While it would be ideal if these experiments could aid in identifying the ubiquitination sites, we have not yet identified any mutant forms that escape degradation. We also compared the nsp16 proteins of several other coronaviruses (such as human coronaviruses 229E, HKU1, MERS-CoV, NL63, OC43, and SARS-CoV-1), and found that these viruses' nsp16 proteins are not highly conserved. As a result, we have not further investigated whether MARCHF7 or UBR5 regulate the nsp16 proteins of these viruses.

      (16) Line 347: 2C of what virus?

      Thank you for the reviewer’s careful reading. We have made the necessary additions to address this point in line 357.

      (17) Line 890: "Scale bars, 25 mm". Should it be 25nm?

      Thank you for your feedback! I realized there was an error in the unit labeling, and I have corrected the relevant sections in line 904. I appreciate your careful reading.

      Reviewer #2 (Recommendations for the authors):

      (1) In Figure 6, the authors found that increasing amounts of nsp16 restored the replication of SARS-CoV-2 in the presence of MARCHF7 or UBR5. The authors better discuss the possibility that nsp16 may stimulate viral replication regardless of these E3 ligases, or provide evidence to further clarify this.

      Thank you for your thoughtful suggestion! Given the strong functionality of nsp16 itself, your consideration is very comprehensive. In Figure 6—figure supplement 2A–H, we conducted transfection experiments with E3 activity-deficient proteins and reintroduced nsp16. The results showed that, in the absence of active MARCHF7 or UBR5 antiviral function, overexpression of nsp16 did not promote viral replication, although the RNA levels of the M protein slightly increased. Therefore, in our experiments, excess nsp16 did not significantly stimulate viral replication.

      (2) In Figure 7, the in vivo data supports the function of both E3 ligases to reduce viral infectivity. Is it possible that tail vein injection of naked plasmid DNA may stimulate the innate immune system, e.g., induce IFN as a DNA vaccine, which may contribute to the inhibitory effect? The authors are suggested to discuss or address it.

      Upon reviewing the relevant literature, we found that the hydrodynamic gene delivery (HGD) method using naked DNA is both highly efficient and associated with a low risk of triggering immune responses or oncogenesis. Studies have shown that HGD only weakly activates host immunity (reference: 37111597), which is less of a concern compared to other gene delivery methods. Although some studies have reported strong immune responses following the injection of naked DNA (e.g., Otc cDNA) in human trials, it is noteworthy that no such responses were observed in 17 other participants. This suggests that the immune reactions observed in some cases may be due to individual variability or limitations in animal models, which may not fully translate to human trials.

      Based on these findings, we believe that the antiviral effects observed in our study are primarily attributable to the intrinsic properties and functions of the E3 ligases.  Furthermore, it has been reported that mice and non-human primates exhibit significantly greater resistance to innate immune activation compared to humans. This highlights the challenges in translating these findings into effective antiviral therapeutics and underscores the need for further research in this area. We have incorporated the requested discussion into the manuscript in lines 393-410.

      (3) The authors shall include some of the key data in supplementary figures in the main text, such as the study on UBR5 and MARCHF7 mediate broad-spectrum degradation of nsp16 variants and SARS-CoV-2 infection decreases UBR5 and MARCHF7 expression, which make it easier for readers to follow.

      Thank you for your valuable suggestion regarding the organization of our manuscript. In response to your feedback, we have moved the study on nsp16 variants to the Figure 6—figure supplement 3. Additionally, the data showing changes in UBR5 and MARCHF7 levels following viral infection have been added as supplementary data in Figure 6—figure supplement 4.

      (4) The diagrammatic sketches in Figures 1E, S1A and B, 7A, and 8 had low resolutions. Please change them to higher resolutions. Moreover, please state the licensing rights of these diagrammatic sketches.

      Thank you for your detailed review! In response to your comment, we have improved the resolution of Figures 1E, S1A and B, 7A, and 8. Additionally, we have specified the drawing tools and source websites in the figure legends (lines 794, 813, 999, and 1013). And we have obtained the necessary licenses for each diagram.

      Figure 1E: Created in BioRender. Li, Z. (2025) https://BioRender.com/h43f612

      Figure S1B: Created in BioRender. Li, Z. (2025) https://BioRender.com/b98t559

      Figure 7A: Created in BioRender. Li, Z. (2025) https://BioRender.com/e76g512

      Figure 8: Created in BioRender. Li, Z. (2025) https://BioRender.com/o84p897

      (5) The authors suggested that both UBR5 and MARCHF7 had a function in triggering the degradation of NSP16, however, the expression of UBR5 but not MARCHF7 was shown to be associated with the severity of clinical symptoms. Further, why did the host evolve 2 kinds of E3 ligases to adjust only 1 viral target? Please discuss them.

      Thank you for your insightful comments. We acknowledge that the limited number of patients with varying degrees of illness in our study could potentially mask some of the observed phenomena. Additionally, individual variability may also play a significant role, which highlights the challenges in translating findings from animal models to human trials.

      Regarding the presence of two E3 ligases targeting the same substrate, we view this as part of an evolutionary arms race between the host and the virus. Viruses evolve mechanisms to counteract the host’s antiviral responses, while the host, in turn, develops multiple pathways and strategies to combat viral infection. This dynamic may explain why multiple E3 ligases regulate the levels of the same factor, reflecting the host’s complex and redundant antiviral defense mechanisms. We have incorporated the requested discussion into the manuscript in lines 359-362.

      (6) Please standardize the symbol size of the bar charts in the same figure, just like in Figures 1D and 5.

      Thank you for your constructive suggestion. We have standardized the symbol sizes of the bar charts in the figure as per your recommendation, ensuring consistency across all panels.

      (7) The use of English could be improved.

      Thank you for your feedback regarding the language. We have carefully reviewed the manuscript and made revisions to improve the clarity and fluency of the English.

      Reviewer #3 (Recommendations for the authors):

      Major points:

      (1) In Figure 1: The expression level of NSP6, 10, 11, and 12 is weak. Include a higher exposure blot (right next to these blots marking as higher exposure) to show the expression of these plasmids. Here, the NSP12 plasmid has no expression, so it is difficult to conclude the effect of MG132 from this blot. It will be appropriate to show the molecular weight of each gene fragment since some of the plasmids have multiple bands. Verify the densitometric analysis, the NSP4 (+/- MG132) blot, and the densitometric analysis do not correlate. Figure 1B: It is recommended to include appropriate control (media only) for NH4Cl. The DMSO control serves well for the drugs, not for Ammonium Chloride. In Figure 1C, how did the authors arrive at the 15-hour time point? The correlation does not appear as the authors claim. Where is the 15-hour sampling time point for MG132 or CHX chase? The experimental approach to screen the E2/E3 Ub ligase is appreciated.

      Thank you for your valuable feedback! Regarding your questions, we have made the following revisions:

      On the expression of nsp6, nsp10, nsp11, and nsp12 in Figure 1:

      We have replaced the blots for nsp10, nsp11, and nsp12 with higher exposure blots. However, due to the strong expression of NSP14, we were unable to generate a higher exposure blot for nsp6. Based on the current exposure, it is clear that nsp6 is not regulated by the proteasome. Additionally, in the high-exposure blot for nsp12, we were able to observe its expression and found that this protein is weakly regulated by MG132. Following your suggestion, we have labeled the molecular weights of the proteins in the figure.

      On the densitometric analysis of nsp4 protein:

      We recalculated the densitometric analysis for nsp4 and found no issues. Although the band intensities do not show large changes, the relative fold changes appear more pronounced because we normalized the data using GAPDH as an internal control. We have added detailed description in the figure legend.

      On the NH4Cl control:

      In this experiment, ammonium chloride was dissolved in DMSO. We reviewed the solubility data and found that ammonium chloride has a solubility of 50 mg/ml in DMSO, which is sufficient to reach the concentrations used in our experiment. While the solubility is higher in water, we believe that DMSO is an appropriate solvent for this compound in our context.

      On the 15-hour time point in Figure 1C:

      Regarding the 15-hour time point mentioned in Figure 1C, we did not collect samples at that time. We performed semi-quantitative analysis of protein levels at different time points using ImageJ and estimated the half-life time point based on the half-life calculation formula. Thank you for your suggestion; we will clarify this in the figure legend.

      Once again, thank you for your thoughtful review and constructive suggestions. We have made the necessary revisions and improvements to the figures based on your feedback.

      (2) In Figure 2: I do not find a reason to include DMSO control in the siRNAs for E2/E3 Ub. Please justify why it is necessary. It is requested to include WB for the siRNA-treated samples. It is strongly recommended to show the WB data for siRNA-treated samples because you are showing siRNA treatment of MARCHF7 in shUBR5 cells and vice versa. However, if antibodies for corresponding targets are not available, qPCR can be shown in graphical representation in supplementary data indicating the siRNA target region and qPCR target. Show a graphical representation of domains/ deleted regions of MARCHF7 and UBR5.

      Thank you for your valuable feedback! We have addressed your concerns as follows:

      On the inclusion of the DMSO control group:

      The DMSO group was initially included as a control for the MG132-treated group. By comparing with the MG132 group, we aimed to observe whether nsp16 levels were restored by MG132 treatment. Additionally, in siRNA knockdown experiments, the DMSO group was included to compare nsp16 protein levels after knockdown with those in the NC group, as well as to assess differences in nsp16 restoration between MG132 treatment and factor knockdown. However, we acknowledge some issues in the control design. To address this, we have redesigned and conducted the experiments with improved controls (Figure 2A).

      On validating knockdown efficiency:

      We have included Western blot data for UBR5 and MARCHF7 knockdown efficiencies. For other factors where specific antibodies were unavailable, we followed your suggestion and provided graphical representations in the Appendix-figure S1, illustrating the siRNA target regions and qPCR target sites to confirm knockdown specificity and efficiency.

      (3) In Figure 4 A: Write details on how this IP was done. What was the transfection time of this plasmid? Is the transfection time different from that of NSP16 in Figure 1A which shows a significant degradation of NSP16? Please discuss this in detail. It is recommended that this IP be done in +/- MG132. Since you have used siRNA and performed an IP, It is recommended to repeat the IP (with +/- MG132) using the MARCHF7 and UBR5 plasmids

      Thank you for your detailed review and suggestions! We have addressed your concerns as follows:

      On the specific protocol for the co-IP in Figure 3A:

      The detailed protocol for the immunoprecipitation (IP) experiment is as follows: on day 1, cells were plated, and on day 2, we co-transfected nsp16 and Ub expression plasmids. After 32 hours of transfection, we treated the cells with MG132 for 16 hours, then harvested the cells for IP. We included MG132 treatment in all ubiquitination IP experiments because, without MG132, nsp16 would be degraded, preventing us from observing changes in ubiquitination levels. We apologize for not clearly labeling this in the figure, and we have made the necessary modifications.

      On the use of MG132 and NSP16 degradation:

      Following your suggestion, we have clarified the use of MG132 in the IP experiments, which differs from the degradation of nsp16 shown in Figure 1A. In Figure 1A, we show the degradation of nsp16 in the absence of MG132 treatment.

      On the overexpression of UBR5 and MARCHF7:

      The effect of overexpressing UBR5 or MARCHF7 on ubiquitination has been validated in Figure 4 supplement 2. In these experiments, we explored the effect of UBR5 activity domain inactivation on nsp16 ubiquitination, as well as the effect of MARCHF7 truncation on nsp16 ubiquitination modification. In these experiments, overexpression of the wild-type E3 ligases was also included, and the results yielded the same conclusions as those from the E3 knockdown experiments, thereby validating the robustness of our findings.

      (4) In Figure 4C: Appropriate controls are missing. The authors claim NSP16 is ubiquitinated and degraded by UBR5 and MARCHF7 via K27 and K48 chains. There is no NSP16 Only control. We cannot compare the NSP16 without an NSP16 transfection. I will suggest the authors repeat these individual controls in both the presence and absence of MG132.

      Thank you for your careful review and valuable suggestion! In response to your comment, we have redesigned the experiment and added a control group without nsp16 transfection. We have repeated the validation in the presence of MG132. Without MG132 treatment, nsp16 is degraded, leading to very low protein levels, making it difficult to observe the phenomenon. We have updated the figure accordingly and made the necessary adjustments based on your suggestion (Figure 3E-F).

      (5) In my opinion, the Figure 8 needs modification. It is requested to show the levels of strand-specific viral mRNA under UBR5 and MARCHF7 knock-down in +/- of MG312. This figure should also be supported by WB indicating the level of NSP16 (capping activity) and any of the viral proteins. This may validate that if the capping activity is lost, viral translation is affected and hence there is a reduction in virus titre. Alternatively, the figure can be modified by putting a sub-heading box over 7mGppA-RNA section and marking it as a future direction/ hypothesis.

      Thank you for your thorough and thoughtful review! Regarding the modification of Figure 8, we completely agree with your suggestion. Currently, examining the impact of viral RNA cap modification is technically challenging for us. Therefore, we have followed your advice and marked the investigation of how nsp16 degradation affects viral RNA cap structures as a future direction/hypothesis in the schematic of Figure 8. This revision helps provide direction for future experiments and enhances the clarity of the figure. Thank you for your thoughtful consideration and valuable suggestion!

      Minor points:

      (1) Figure 2A: Align NSP16 Blot to actin.

      Thank you for your constructive feedback! We have redesigned the experiment and included an MG132 treatment group in Figure 2A. Consequently, the figure has been revised comprehensively, and the nsp16 blot has been aligned with tubulin.

      (2) Figure 2C: It is recommended to properly align the lanes where the pLKO and shRNA labelling are overlapping.

      Thank you for your thoughtful suggestion! We have revised Figure 2C based on your recommendation to ensure that the pLKO and shRNA labeling no longer overlap. We sincerely apologize for any confusion this may have caused and appreciate your understanding and support.

      (3) Just a curious question, what happens if we silence both UBR5 and MARCHF7 and check for virus titre? This is an additional work, but if the authors do not agree, it is ok.

      Thank you for your valuable suggestion! Regarding your question about silencing both UBR5 and MARCHF7, we indeed attempted to generate knockout cell lines, but unfortunately, we were not successful at this stage. We plan to explore alternative methods to establish stable knockout cell lines in our future experiments. Meanwhile, as shown in Figure 5 supplement 1, we have performed experiments where both UBR5 and MARCHF7 were knocked down simultaneously, followed by infection with virus-like particles. The results indicate that dual knockdown further enhances viral replication. These findings may partially address your question. Thank you again for your insightful suggestion!

    1. eLife Assessment

      This study provides a valuable contribution to our understanding of causal inference in visual perception. The evidence provided through multiple well-designed psychophysical experiments is convincing. The current study targets very specific visual features of launch events, future work will be able to build on this to study the implementation of causal inference in general.

    2. Reviewer #1 (Public review):

      Summary:

      The authors investigated causal inference in the visual domain through a set of carefully designed experiments, and sound statistical analysis. They suggest the early visual system has a crucial contribution to computations supporting causal inference.

      Strengths:

      (1) I believe the authors target an important problem (causal inference) with carefully chosen tools and methods. Their analysis rightly implies the specialization of visual routines for causal inference and the crucial contribution of early visual systems to perform this computation. I believe this is a novel contribution and their data and analysis are in the right direction.<br /> (2) Authors sufficiently discuss the alternative perspective to causal inference.<br /> (3) The authors also expand the discussions beyond pure psychophysics and also include neural aspects.

      Weaknesses:

      I would not call them weaknesses, perhaps a different perspective:

      (1) Authors arguing pro a mere bottom-up contribution of early sensory areas for causal inference. Certainly, as the authors suggested, early sensory areas have a crucial contribution, and the authors expand it to other possibilities in their discussion (but more for more complex scenario). It would say, even in simple cases, we can still consider the effect of top down processes. This particularly makes sense in light of recent studies. These studies progressively suggest perception as an active process that also weighs in strongly, the top-down cognitive contributions. For instance, the most simple cases of perception have been conceptualized along this line (Martin, Solms, and Sterzer 2021) and even some visual illusions (Safavi and Dayan 2022), and other extensions (Kay et al. 2023). Thus, I believe it would be helpful to extend the discussion on the top-down and cognitive contributions of causal inference (of course that can also be hinted at, based on recent developments). Even adaptation, which is central in this study, can be influenced by top-down factors (Keller et al. 2017).

      Lastly, I hope the authors find this review helpful. I generally want to try to end all of my reviews with areas of the paper I liked because I think this should be part of the feedback. Certainly, there were many in this manuscript as well (clever questions, experimental design and statistical analysis) that I had to highlight further. I congratulate the authors again on their manuscript and hope they will find it helpful.

      Bibliography

      Aller, Mate, and Uta Noppeney. 2018. "To Integrate or Not to Integrate: Temporal Dynamics of Bayesian Causal Inference." Biorxiv, December, 504118. .

      Cao, Yinan, Christopher Summerfield, Hame Park, Bruno Lucio Giordano, and Christoph Kayser. 2019. "Causal Inference in the Multisensory Brain." Neuron 102 (5): 1076-87.e8. .

      Coen, Philip, Timothy P. H. Sit, Miles J. Wells, Matteo Carandini, and Kenneth D. Harris. 2021. "The Role of Frontal Cortex in Multisensory Decisions." Biorxiv, April. Cold Spring Harbor Laboratory, 2021.04.26.441250. .

      Kay, Kendrick, Kathryn Bonnen, Rachel N. Denison, Mike J. Arcaro, and David L. Barack. 2023. "Tasks and Their Role in Visual Neuroscience." Neuron 111 (11). Elsevier: 1697-1713. .

      Keller, Andreas J, Rachael Houlton, Björn M Kampa, Nicholas A Lesica, Thomas D Mrsic-Flogel, Georg B Keller, and Fritjof Helmchen. 2017. "Stimulus Relevance Modulates Contrast Adaptation in Visual Cortex." Elife 6. eLife Sciences Publications, Ltd: e21589.

      Kording, K. P., U. Beierholm, W. J. Ma, S. Quartz, J. B. Tenenbaum, and L. Shams. 2007. "Causal Inference in Multisensory Perception." PloS One 2: e943. .

      Martin, Joshua M., Mark Solms, and Philipp Sterzer. 2021. "Useful Misrepresentation: Perception as Embodied Proactive Inference." Trends Neurosci. 44 (8): 619-28. .

      Safavi, Shervin, and Peter Dayan. 2022. "Multistability, Perceptual Value, and Internal Foraging." Neuron, August. .

      Shams, L. 2012. "Early Integration and Bayesian Causal Inference in Multisensory Perception." In The Neural Bases of Multisensory Processes, edited by M. M. Murray and M. T. Wallace. Frontiers in Neuroscience. Boca Raton (FL).

      Shams, Ladan, and Ulrik Beierholm. 2022. "Bayesian Causal Inference: A Unifying Neuroscience Theory." Neuroscience & Biobehavioral Reviews 137 (June): 104619.

    3. Reviewer #2 (Public review):

      This paper seeks to determine whether the human visual system's sensitivity to causal interactions is tuned to specific parameters of a causal launching event, using visual adaptation methods. The three parameters the author investigates in this paper are the direction of motion in the event, the speed of the objects in the event, and surface features or identity of the objects in the event (in particular, having two objects of different color).

      The key method, visual adaptation to causal launching, has now been demonstrated by at least three separate groups and seems to be a robust phenomenon. Adaptation is a strong indicator of a visual process that is tuned to a specific feature of the environment, in this case launching interactions. Whereas other studies have focused on retinotopically-specific adaptation (i.e., whether the adaptation effect is restricted to the same test location on the retina as the adaptation stream was presented to), this one focuses on feature-specificity.

      The first experiment replicates the adaptation effect for launching events as well as the lack of adaptation event for a minimally different non-causal 'slip' event. However, it also finds that the adaptation effect does not work for launching events that do not have a direction of motion more than 30 degrees from the direction of the test event. The interpretation is that the system that is being adapted is sensitive to the direction of this event, which is an interesting and somewhat puzzling result given the methods used in previous studies, which have used random directions of motion for both adaptation and test events.

      The obvious interpretation would be that past studies have simply adapted to launching in every direction, but that in itself says something about the nature of this direction-specificity: it is not working through opposed detectors. For example, in something like the waterfall illusion adaptation effect, where extended exposure to downward motion leads to illusory upward motion on neutral-motion stimuli, the effect simply doesn't work if motion in two opposed directions are shown (i.e., you don't see illusory motion in both directions, you just see nothing). The fact that adaptation to launching in multiple directions doesn't seem to cancel out the adaptation effect in past work raises interesting questions about how directionality is being coded in the underlying process. In addition, one limitation of the current method is that it's not clear whether the motion-direction-specificity is also itself retinotopically-specific, that is, if one retinotopic location were adapted to launching in one direction and a different retinotopic location adapted to launching in the opposite direction, would each test location show the adaptation effect only for events in the direction presented at that location?

      The second experiment tests whether the adaptation effect is similarly sensitive to differences in speed. The short answer is no; adaptation events at one speed affect test events at another. Furthermore, this is not surprising given that Kominsky & Scholl (2020) showed adaptation transfer between events with differences in speeds of the individual objects in the event (whereas all events in this experiment used symmetrical speeds). This experiment is still novel and it establishes that the speed-insensitivity of these adaptation effects is fairly general, but I would certainly have been surprised if it had turned out any other way.

      The third experiment tests color (as a marker of object identity), and pits it against motion direction. The results demonstrate that adaptation to red-launching-green generates an adaptation effect for green-launching-red, provided they are moving in roughly the same direction, which provides a nice internal replication of Experiment 1 in addition to showing that the adaptation effect is not sensitive to object identity. This result forms an interesting contrast with the infant causal perception literature. Multiple papers (starting with Leslie & Keeble, 1987) have found that 6-8-month-old infants are sensitive to reversals in causal roles exactly like the ones used in this experiment. The success of adaptation transfer suggests, very clearly, that this sensitivity is not based only on perceptual processing, or at least not on the same processing that we access with this adaptation procedure. It implies that infants may be going beyond the underlying perceptual processes and inferring genuine causal content. This is also not the first time the adaptation paradigm has diverged from infant findings: Kominsky & Scholl (2020) found a divergence with the object speed differences as well, as infants categorize these events based on whether the speed ratio (agent:patient) is physically plausible (Kominsky et al., 2017), while the adaptation effect transfers from physically implausible events to physically plausible ones. This only goes to show that these adaptation effects don't exhaustively capture the mechanisms of early-emerging causal event representation.

      One overarching point about the analyses to take into consideration: The authors use a Bayesian psychometric curve-fitting approach to estimate a point of subjective equality (PSE) in different blocks for each individual participant based on a model with strong priors about the shape of the function and its asymptotic endpoints, and this PSE is the primary DV across all of the studies. As discussed in Kominsky & Scholl (2020), this approach has certain limitations, notably that it can generate nonsensical PSEs when confronted with relatively extreme response patterns. The authors mentioned that this happened once in Experiment 3, and that participant had to be replaced. An alternate approach is simply to measure the proportion of 'pass' reports overall to determine if there is an adaptation effect. The results here do not change based on which analytical strategy is used, which ultimately just goes to show that the effects are very robust.

      In general, this paper adds further evidence for something like a 'launching' detector in the visual system, but beyond that it specifies some interesting questions for future work about how exactly such a detector might function.

      Kominsky, J. F., & Scholl, B. J. (2020). Retinotopic adaptation reveals distinct categories of causal perception. Cognition, 203, 104339. https://doi.org/10.1016/j.cognition.2020.104339

      Kominsky, J. F., Strickland, B., Wertz, A. E., Elsner, C., Wynn, K., & Keil, F. C. (2017). Categories and Constraints in Causal Perception. Psychological Science, 28(11), 1649-1662. https://doi.org/10.1177/0956797617719930

      Leslie, A. M., & Keeble, S. (1987). Do six-month-old infants perceive causality? Cognition, 25(3), 265-288. https://doi.org/10.1016/S0010-0277(87)80006-9

    4. Reviewer #3 (Public review):

      Summary:

      This paper presents evidence from three behavioral experiments that causal impressions of "launching events", in which one object is perceived to cause another object to move, depend on motion direction-selective processing. Specifically, the work uses an adaptation paradigm (Rolfs et al., 2013), presenting repetitive patterns of events matching certain features to a single retinal location, then measuring subsequent perceptual reports of a test display in which the degree of overlap between two discs was varied, and participants could respond "launch" or "pass". The three experiments report results of adapting to motion direction, motion speed and "object identity", and examine how the psychometric curves for causal reports shift in these conditions depending on the similarity of adapter and test. While causality reports in the test display were selective for motion direction (Experiment 1), they were not selective for adapter-test speed differences (Experiment 2) nor for changes in object identity induced via color swap (Experiment 3). These results support the notion of a biological implementation of causality perception in the visual system, possibly even independently of computations of object identity.

      Strengths:

      The setup of the research question and hypotheses are exceptional. The authors thoroughly discuss relevant literature to clearly link their launch/pass paradigm to impressions of causality, strengthening their hypothesis and conclusions. The experiments are carefully performed (appropriate equipment, careful control of eye movements). The slip adaptor is a really nice control condition and effectively mitigates the need to control for motion direction with a drifting grating or similar. Participants were measured with sufficient precision, and a power curve analysis was conducted to determine the sample size. Data analysis and statistical quantification is appropriate. Data and analysis code will be shared on publication, in keeping with open science principles. The paper is concise and well written.

      Weaknesses:

      I would like to emphasise that in the employed paradigm and previously conducted similar study, the only report options are "launch" or "pass". As pointed out by the authors' reply, the adaptation to launches seems to be a highly specific process and likely is a consequence of the causal interaction between the objects. I would nonetheless be interested to see which of the stimulus features driving the adaptation effect observed here are relevant/irrelevant to subjective causal impressions in an experiment.

      References:

      Rolfs, M., Dambacher, M., & Cavanagh, P. (2013). Visual Adaptation of the Perception of Causality. Current Biology, 23(3), 250-254. https://doi.org/10.1016/j.cub.2012.12.017

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: 

      The authors investigated causal inference in the visual domain through a set of carefully designed experiments, and sound statistical analysis. They suggest the early visual system has a crucial contribution to computations supporting causal inference. 

      Strengths: 

      I believe the authors target an important problem (causal inference) with carefully chosen tools and methods. Their analysis rightly implies the specialization of visual routines for causal inference and the crucial contribution of early visual systems to perform this computation. I believe this is a novel contribution and their data and analysis are in the right direction. 

      Weaknesses: 

      In my humble opinion, a few aspects deserve more attention: 

      (1) Causal inference (or causal detection) in the brain should be quite fundamental and quite important for human cognition/perception. Thus, the underlying computation and neural substrate might not be limited to the visual system (I don't mean the authors did claim that). In fact, to the best of my knowledge, multisensory integration is one of the best-studied perceptual phenomena that has been conceptualized as a causal inference problem.

      Assuming the causal inference in those studies (Shams 2012; Shams and Beierholm 2022;

      Kording et al. 2007; Aller and Noppeney 2018; Cao et al. 2019) (and many more e.g., by Shams and colleagues), and the current study might share some attributes, one expects some findings in those domains are transferable (at least to some degree) here as well. Most importantly, underlying neural correlates that have been suggested based on animal studies and invasive recording that has been already studied, might be relevant here as well.

      Perhaps the most relevant one is the recent work from the Harris group on mice (Coen et al. 2021). I should emphasize, that I don't claim they are necessarily relevant, but they can be relevant given their common roots in the problem of causal inference in the brain. This is a critical topic that the authors may want to discuss in their manuscript. 

      We thank the reviewer. We addressed this point of the public review in our reply to the reviewer’s suggestions (and add it here again for convenience). The literature on the role of occipital, parietal and frontal brain areas in causal inference is also addressed in the response to point 3 of the public review.

      “We used visual adaptation to carve out a bottom-up visual routine for detecting causal interactions in form of launching events. However, we know that more complex behaviors of perceiving causal relations can result from integrating information across space (e.g., in causal capture; Scholl & Nakayama, 2002), across time (postdictive influence; Choi & Scholl, 2006), and across sensory modalities (Sekuler, Sekuler, & Lau, 1997). Bayesian causal inference has been particularly successful as a normative framework to account for multisensory integration (Körding et al., 2007; Shams & Beierholm, 2022). In that framework, the evidence for a common-cause hypothesis is competing with the evidence for an independent-causes hypothesis (Shams & Beierholm, 2022). The task in our experiments could be similarly formulated as two competing hypotheses for the second disc’s movement (i.e., the movement was caused by the first disc vs. the movement occurred autonomously). This framework also emphasizes the distributed nature of the neural implementation for solving such inferences, showing the contributions of parietal and frontal areas in addition to sensory processing (for review see Shams & Beierholm, 2022). Moreover, even visual adaptation to contrast in mouse primary visual cortex is influenced by top-down factors such as behavioral relevance— suggesting a complex implementation of the observed adaptation results (Keller et al. 2017). The present experiments, however, presented purely visual events that do not require an integration across processing domains. Thus, the outcome of our suggested visual routine can provide initial evidence from within the visual system for a causal relation in the environment that may then be integrated with signals from other domains (e.g., auditory signals). Determining exactly how the perception of causality relates to mechanisms of causal inference and the neural implementation thereof is an exciting avenue for future research. Note, however, that perceived causality can be distinguished from judged causality: Even when participants are aware that a third variable (e.g., a color change) is the best predictor of the movement of the second disc in launching events, they still perceive the first disc as causing the movement of the second disc (Schlottmann & Shanks, 1992).”

      (2) If I understood correctly, the authors are arguing pro a mere bottom-up contribution of early sensory areas for causal inference (for instance, when they wrote "the specialization of visual routines for the perception of causality at the level of individual motion directions raises the possibility that this function is located surprisingly early in the visual system *as opposed to a higher-level visual computation*."). Certainly, as the authors suggested, early sensory areas have a crucial contribution, however, it may not be limited to that. Recent studies progressively suggest perception as an active process that also weighs in strongly, the topdown cognitive contributions. For instance, the most simple cases of perception have been conceptualized along this line (Martin, Solms, and Sterzer 2021) and even some visual illusion (Safavi and Dayan 2022), and other extensions (Kay et al. 2023). Thus, I believe it would be helpful to extend the discussion on the top-down and cognitive contributions of causal inference (of course that can also be hinted at, based on recent developments). Even adaptation, which is central in this study can be influenced by top-down factors (Keller et al. 2017). I believe, based on other work of Rolfs and colleagues, this is also aligned with their overall perspective on vision.  

      Indeed, we assessed bottom-up contributions to the perception of a causal relation. We agree with the reviewer that in more complex situations, for instance, in the presence of contextual influences or additional auditory signals, the perception of a causal relation may not be limited to bottom-up vision. While we had acknowledged this in the original manuscript (see excerpts below), we now make it even more explicit:

      “[…] we know that more complex behaviors of perceiving causal relations can result from integrating information across space (e.g., in causal capture; Scholl & Nakayama, 2002), across time (postdictive influence; Choi & Scholl, 2006), and across sensory modalities (Sekuler, Sekuler, & Lau, 1997).”

      “[…] Neurophysiological studies support the view of distributed neural processing underlying sensory causal interactions with the visual system playing a major role.”

      “[…] Interestingly, single cell recordings in area F5 of the primate brain revealed that motor areas are contributing to the perception of causality (Caggiano et al., 2016; Rolfs, 2016), emphasizing the distributed nature of the computations underlying causal interactions. This finding also stresses that the detection, and the prediction, of causality is essential for processes outside sensory systems (e.g., for understanding other’s actions, for navigating, and for avoiding collisions). The neurophysiology subserving causal inference further extend the candidate cortical areas that might contibute to the detection of causal relations, emphasizing the role of the frontal cortex for the flexible integration of multisensory representations (Cao et al., 2019; Coen et al., 2023).”

      However, there is also ample evidence that the perception of a simple causal relation—as we studied it in our experiments—escapes top-down cognitive influences. The perception of causality in launching events is described as automatic and irresistible, meaning that participants have the spontaneous impression of a causal relation, and participants typically do not voluntarily switch between a causal and a noncausal percept. This irresistibility has led several authors to discuss a modular organization underlying the detection of such events (Michotte, 1963; Scholl & Tremoulet, 2000). This view is further supported by a study that experimentally manipulated the contingencies between the movement of the two discs (Schlottmann & Shanks, 1992). In one condition the authors created a launching event where the second disc’s movement was perfectly correlated with a color change, but only sometimes coincided with the first disc’s movement offset. Nevertheless, participants reported seeing that the first disc caused the movement of second disc (regardless of the stronger statistical relationship with the color change). However, when asked to make conscious causal judgments, participants were aware of the color change as the true cause of the second disc’s motion—therefore recognizing its more reliable correlation. This study strongly suggests that perceived and judged causality (i.e., cognitive causal inference) can be dissociated (Schlottmann & Shanks, 1992). We have added this reference in the revised manuscript. Overall, we argue that our study focused on a visual routine that could be implemented in a simple bottom-up fashion, but we acknowledge throughout the manuscript, that in a more complex situation (e.g., integrating information from other sensory domains) the implementation could be realized in a more distributed fashion including top-down influences as in multisensory integration. However, it is important to stress that these potential top-down influences would be automatic and should not be confused with voluntary cognitive influences.

      “Note, however, that perceived causality can be distinguished from judged causality (Schlottmann & Shanks, 1992). Even when participants are aware that a third variable (e.g., a color change) is the best predictor of the movement of the second disc in launching events, they still perceive the first disc as causing the movement of the second disc (Schlottmann & Shanks, 1992).”

      (3) The authors rightly implicate the neural substrate of causal inference in the early sensory system. Given their study is pure psychophysics, a more elaborate discussion based on other studies that used brain measurements is needed (in my opinion) to put into perspective this conclusion. In particular, as I mentioned in the first point, the authors mainly discuss the potential neural substrate of early vision, however much has been done about the role of higher-tier cortical areas in causal inference e.g., see (Cao et al. 2019; Coen et al. 2021). 

      In the revised manuscript, we addressed the limitations of a purely psychophysical approach and acknowledged alternative implementations in the Discussion section.

      “Note that, while the present findings demonstrate direction-selectivity, it remains unclear where exactly that visual routine is located. As pointed out, it is also possible that the visual routine is located higher up in the visual system (or distributed across multiple levels) and is only using a directional-selective population response as input.”

      Moreover, we cite also the two suggested papers when referring to the role of cortical areas in causal inference (Cao et al, 2019; Coen et al., 2023):

      “Neurophysiological studies support the view of distributed neural processing underlying sensory causal interactions with the visual system playing a major role. Imaging studies in particular revealed a network for the perception of causality that is also involved in action observation (Blakemore et al., 2003; Fonlupt, 2003; Fugelsang et al., 2005; Roser et al., 2005). The fact that visual adaptation of causality occurs in a retinotopic reference frame emphazises the role of retinotopically organized areas within that network (e.g., V5 and the superior temporal sulcus). Interestingly, single cell recordings in area F5 of the primate brain revealed that motor areas are contributing to the perception of causality (Caggiano et al., 2016; Rolfs, 2016), emphasizing the distributed nature of the computations underlying causal interactions, and also stressing that the detection, and the prediction, of causality is essential for processes outside purely sensory systems (e.g., for understanding other’s actions, for navigating, and for avoiding collisions). The neurophysiological underpinnings in causal inference further extend the candidate cortical areas that might contibute to the detection of causal relations, emphasizing the role of the frontal cortex for the flexible integration of multisensory representations (Cao et al., 2019; Coen et al., 2023).”

      There were many areas in this manuscript that I liked: clever questions, experimental design, and statistical analysis.

      Thank you so much.

      Reviewer #1 (Recommendations for the authors):

      I congratulate the authors again on their manuscript and hope they will find my review helpful. Most of my notes are suggestions to the authors, and I hope will help them to improve the manuscript. None are intended to devalue their (interesting) work. 

      We would like to thank the reviewer for their thoughtful and encouraging comments.

      In the following, I use pX-lY template to refer to a particular page number, say page number X (pX), and line number, say line number Y (lY). 

      Major concerns and suggestions 

      - I would suggest simplifying the abstract and significance statement or putting more background in it. It's hard (at least for me) to understand if one is not familiar with the task used in this study. 

      We followed the reviewer’s suggestion and added more background in the beginning of the abstract. 

      We made the following changes:

      “Detecting causal relations structures our perception of events in the world. Here, we determined for visual interactions whether generalized (i.e., feature-invariant) or specialized (i.e., feature-selective) visual routines underlie the perception of causality. To this end, we applied a visual adaptation protocol to assess the adaptability of specific features in classical launching events of simple geometric shapes. We asked observers to report whether they observed a launch or a pass in ambiguous test events (i.e., the overlap between two discs varied from trial to trial). After prolonged exposure to causal launch events (the adaptor) defined by a particular set of features (i.e., a particular motion direction, motion speed, or feature conjunction), observers were less likely to see causal launches in subsequent ambiguous test events than before adaptation. Crucially, adaptation was contingent on the causal impression in launches as demonstrated by a lack of adaptation in non-causal control events. We assessed whether this negative aftereffect transfers to test events with a new set of feature values that were not presented during adaptation. Processing in specialized (as opposed to generalized) visual routines predicts that the transfer of visual adaptation depends on the feature-similarity of the adaptor and the test event. We show that negative aftereffects do not transfer to unadapted launch directions but do transfer to launch events of different speed. Finally, we used colored discs to assign distinct feature-based identities to the launching and the launched stimulus. We found that the adaptation transferred across colors if the test event had the same motion direction as the adaptor. In summary, visual adaptation allowed us to carve out a visual feature space underlying the perception of causality and revealed specialized visual routines that are tuned to a launch’s motion direction.”

      - The authors highlight the importance of studying causal inference and understanding the underlying mechanisms by probing adaptation, however, their introduction justifying that is, in my humble opinion, quite short. Perhaps in the cited paper, this is discussed extensively, but I'd suggest providing some elaboration in the manuscript. Otherwise, the study would be very specific to certain visual phenomena, rather than general mechanisms.  

      We have carefully considered the reviewer’s set of comments and concerns (e.g., the role of top-down influences, the contributions of the frontal cortex, and illustration of the computational level). They all appear to share the theme that the reviewer looks at our study from the perspective of Bayesian inference. We conducted the current study in the tradition of classical phenomena in the field of the perception of causality (in the tradition of Michotte, 1963 and as reviewed in Scholl & Tremoulet, 2000) which aims to uncover the relevant visual parameters and rules for detecting causal relations in the visual domain. Indeed, we think that a causal inference perspective promises a lot of new insights into the mechanisms underlying the classical phenomena described for the perception of causality. In the revised manuscript, we discuss therefore causal inference and how it relates to the current study. We now emphasize that in our study, a) we used visual adaptation to reveal the bottom-up processes that allow for the detection of a causal interaction in the visual domain, b) that the perception of causality also integrates signals from other domains (which we do not study here), and c) that the neural substrates underlying the perception of causality might be best described by a distributed network. By discussing Bayesian causal inference, we point out promising avenues for future research that may bridge the fields of the perception of causality and Bayesian causal inference. However, we also emphasize that perceived causality and judged causality can be dissociated (Schlottmann & Shanks, 1992).

      We added the following discussion:

      “We used visual adaptation to carve out a bottom-up visual routine for detecting causal interactions in form of launching events. However, we know that more complex behaviors of perceiving causal relations can result from integrating information across space (e.g., in causal capture; Scholl & Nakayama, 2002), across time (postdictive influence; Choi & Scholl, 2006), and across sensory modalities (Sekuler, Sekuler, & Lau, 1997). Bayesian causal inference has been particularly successful as a normative framework to account for multisensory integration (Körding et al., 2007; Shams & Beierholm, 2022). In that framework, the evidence for a common-cause hypothesis is competing with the evidence for an independent-causes hypothesis (Shams & Beierholm, 2022). The task in our experiments could be similarly formulated as two competing hypotheses for the second disc’s movement (i.e., the movement was caused by the first disc vs. the second disc did not move). This framework also emphasizes the distributed nature of the neural implementation for solving such inferences, showing the contributions of parietal and frontal areas in addition to sensory processing (for review see Shams & Beierholm, 2022). Moreover, even visual adaptation to contrast in mouse primary visual cortex is influenced by top-down factors such as behavioral relevance— suggesting a complex implementation of the observed adaptation results (Keller et al. 2017). The present experiments, however, presented purely visual events that do not require an integration across processing domains. Thus, the outcome of our suggested visual routine can provide initial evidence from within the visual system for a causal relation in the environment that may then be integrated with signals from other domains (e.g., auditory signals). Determining exactly how the perception of causality relates to mechanisms of causal inference and the neural implementation thereof is an exciting avenue for future research. Note, however, that perceived causality can be distinguished from judged causality: Even when participants are aware that a third variable (e.g., a color change) is the best predictor of the movement of the second disc in launching events, they still perceive the first disc as causing the movement of the second disc (Schlottmann & Shanks, 1992).”

      - I'd suggest, at the outset, already set the context, that your study of causal inference in the brain is specifically targeting the visual domain, if you like, in the discussion connect it  better to general ideas about causal inference in the brain (like the works by Ladan Shams and colleagues). 

      We would like to thank the reviewer for this comment. We followed the reviewer’s suggestion and made clear from the beginning that this paper is about the detection of causal relations in the visual domain. In the revised manuscript we write:

      “Here, we will study the mechanisms underlying the computations of causal interactions in the visual domain by capitalizing on visual adaptation of causality (Kominsky & Scholl, 2020; Rolfs et al., 2013). Adaptation is a powerful behavioral tool for discovering and dissecting a visual mechanism (Kohn, 2007; Webster, 2015) that provides an intriguing testing ground for the perceptual roots of causality.”

      As described in our reply to the previous comment, we now also discussed the ideas about causal inference.

      - To better illustrate the implication of your study on the computational level, I'd suggest putting it in the context of recent approaches to perception (point 2 of my public review). I think this is also aligned with the comment of Reviewer#3 on your line 32 (recommendation for authors).  

      In the revised manuscript, we now discuss the role of top-down influences in causal inference when addressing point 2 of the reviewer’s public review.

      Minor concerns and suggestions 

      - On p2-l3, I'd suggest providing a few examples for generalized and or specialized visual routines (given the importance of the abstract). I only got it halfway through the introduction. 

      We thank the reviewer for highlighting the need to better introduce the concept of a visual routine. We have chosen the term visual routine to emphasize that we locate the part of the mechanism that is affected by the adaptation in our experiments in the visual system. At the same time, the concept leaves space with respect to the extent to which the mechanism further involves mid- and higher-level processes. In the revised manuscript, we now refer to Ullman (1987) who introduced the concept of a visual routine—the idea of a modular operation that sequentially processes spatial and feature information. Moreover, we refer to the concept of attentional sprites (Cavanagh, Labianca, & Thornton, 2001)—attention-based visual routines that allow the visual system to semi-independently handle complex visual tasks (e.g., identifying biological motion).

      We add the following footnote to the introduction:

      “We use the term visual routine here to highlight that our adaptation experiments can reveal a causality detection mechanism that resides in the visual system. At the same time, calling it a routine emphasizes similarities with a local, semi-independent operation (e.g., the recognition of familiar motion patterns; see also Ullman, 1987; Cavanagh, Labianca, & Thornton, 2001) that can engage mid- and higher-level processes (e.g., during causal capture, Scholl & Nakayama, 2002; or multisensory integration, Körding et al., 2007).”

      In the abstract we now write:

      “Here, we determined for visual interactions whether generalized (i.e., feature-invariant) or specialized (i.e., feature-selective) visual routines underlie the perception of causality.”

      - On p4-l31, I'd suggest mentioning the Matlab version. I have experienced differences across different versions of Matlab (minor but still ...). 

      We added the Matlab Version.

      - On p6-l46 OSF-link is missing (that contains data and code). 

      Thank you. We made the OSF repository public and added the link to the revised manuscript.

      We added the following information to the revised manuscript.

      “The data analysis code has been deposited at the Open Science Framework and is publicly available https://osf.io/x947m/.”

      Reviewer #2 (Public Review):

      This paper seeks to determine whether the human visual system's sensitivity to causal interactions is tuned to specific parameters of a causal launching event, using visual adaptation methods. The three parameters the authors investigate in this paper are the direction of motion in the event, the speed of the objects in the event, and the surface features or identity of the objects in the event (in particular, having two objects of different colors). The key method, visual adaptation to causal launching, has now been demonstrated by at least three separate groups and seems to be a robust phenomenon. Adaptation is a strong indicator of a visual process that is tuned to a specific feature of the environment, in this case launching interactions. Whereas other studies have focused on retinotopically specific adaptation (i.e., whether the adaptation effect is restricted to the same test location on the retina as the adaptation stream was presented to), this one focuses on feature specificity. 

      The first experiment replicates the adaptation effect for launching events as well as the lack of adaptation event for a minimally different non-causal 'slip' event. However, it also finds that the adaptation effect does not work for launching events that do not have a direction of motion more than 30 degrees from the direction of the test event. The interpretation is that the system that is being adapted is sensitive to the direction of this event, which is an interesting and somewhat puzzling result given the methods used in previous studies, which have used random directions of motion for both adaptation and test events. 

      The obvious interpretation would be that past studies have simply adapted to launching in every direction, but that in itself says something about the nature of this direction-specificity: it is not working through opposed detectors. For example, in something like the waterfall illusion adaptation effect, where extended exposure to downward motion leads to illusory upward motion on neutral-motion stimuli, the effect simply doesn't work if motion in two opposed directions is shown (i.e., you don't see illusory motion in both directions, you just see nothing). The fact that adaptation to launching in multiple directions doesn't seem to cancel out the adaptation effect in past work raises interesting questions about how directionality is being coded in the underlying process. 

      We would like to thank the reviewer for that thoughtful comment. We added the described implication to the manuscript:

      “While the present study demonstrates direction-selectivity for the detection of launches, previous adaptation protocols demonstrated successful adaptation using adaptors with random motion direction (Rolfs et al., 2013; Kominsky & Scholl, 2020). These results therefore suggest independent direction-specific routines, in which adaptation to launches in one direction does not counteract an adaptation to launches in the opposite direction (as for example in opponent color coding).”

      In addition, one limitation of the current method is that it's not clear whether the motion direction-specificity is also itself retinotopically-specific, that is, if one retinotopic location were adapted to launching in one direction and a different retinotopic location adapted to launching in the opposite direction, would each test location show the adaptation effect only for events in the direction presented at that location? 

      This is an interesting idea! Because previous adaptation studies consistently showed retinotopic adaptation of causality, we would not expect to find transfer of directional tuning for launches to other locations. We agree that the suggested experiment on testing the reference frame of directional specificity constitutes an interesting future test of our findings.

      The second experiment tests whether the adaptation effect is similarly sensitive to differences in speed. The short answer is no; adaptation events at one speed affect test events at another. Furthermore, this is not surprising given that Kominsky & Scholl (2020) showed adaptation transfer between events with differences in speeds of the individual objects in the event (whereas all events in this experiment used symmetrical speeds). This experiment is still novel and it establishes that the speed-insensitivity of these adaptation effects is fairly general, but I would certainly have been surprised if it had turned out any other way. 

      We thank the reviewer for highlighting the link to an experiment reported in Kominsky & Scholl (2020). We report the finding of that experiment now in the revised manuscript.

      We added the following paragraph in the discussion:

      “For instance, we demonstrated a transfer of adaptation across speed for symmetrical speed ratios. This result complements a previous finding that reported that the adaptation to triggering events (with an asymmetric speed ratio of 1:3) resulted in significant retinotopic adaptation of ambiguous (launching) test events of different speed ratios (i.e., test events with a speed ratio of 1:1 and of 1:3; Kominsky & Scholl, 2020).”

      The third experiment tests color (as a marker of object identity), and pits it against motion direction. The results demonstrate that adaptation to red-launching-green generates an adaptation effect for green-launching-red, provided they are moving in roughly the same direction, which provides a nice internal replication of Experiment 1 in addition to showing that the adaptation effect is not sensitive to object identity. This result forms an interesting contrast with the infant causal perception literature. Multiple papers (starting with Leslie & Keeble, 1987) have found that 6-8-month-old infants are sensitive to reversals in causal roles exactly like the ones used in this experiment. The success of adaptation transfer suggests, very clearly, that this sensitivity is not based only on perceptual processing, or at least not on the same processing that we access with this adaptation procedure. It implies that infants may be going beyond the underlying perceptual processes and inferring genuine causal content. This is also not the first time the adaptation paradigm has diverged from infant findings: Kominsky & Scholl (2020) found a divergence with the object speed differences as well, as infants categorize these events based on whether the speed ratio (agent:patient) is physically plausible (Kominsky et al., 2017), while the adaptation effect transfers from physically implausible events to physically plausible ones. This only goes to show that these adaptation effects don't exhaustively capture the mechanisms of early-emerging causal event representation. 

      We would like to thank the reviewer for highlighting the similarities (and differences) to the seminal study by Leslie and Keeble (1987). We included a discussion with respect to that paper in the revised manuscript. Indeed, that study showed a recovery from habituation to launches after reversal of the launching events. In their study, the reversal condition resulted in a change of two aspects, 1) motion direction and 2) a change of what color is linked to either cause (i.e., agent) or effect (i.e, patient). Our study, based on visual adaptation in adults, suggests that switching the two colors is not necessary for a recovery from the habituation, provided the motion direction is reversed. Importantly, the reversal of the motion direction only affected the perception of causality after adapting to launches (but not to slip events), which is consistent with Leslie and Keeble’s (1987) finding that the effect of a reversal is contingent on habituation/adaptation to a causal relationship (and is not observed for non-causal delayed launches). Based on our findings, we predict that switching colors without changing the event’s motion direction would not result in a recovery from habituation. Obviously, for infants, color may play a more important role for establishing an object identity than it does for adults, which could explain potential differences. We also agree with the reviewer’s point that the adaptation protocol might tap into different mechanisms than revealed by habituation studies in infants (e.g, Kominsky et al., 2017 vs. Kominsky & Scholl, 2020). 

      We revised the manuscript accordingly when discussing the role of direction selectivity in our study:

      “Habituation studies in six-months-old infants also demonstrated that the reversal of a launch resulted in a recovery from habituation to launches (while a non-causal control condition of delayed-launches did not; Leslie & Keeble, 1987). In their study, the reversal of motion direction was accompanied by a reversal of the color assignment to the cause-effectrelationship. In contrast, our findings suggest, that in adults color does not play a major role in the detection of a launch. Future studies should further delineate similarities and differences obtained from adaptation studies in adults and habituation studies in children (e.g., Kominsky et al., 2017; Kominsky & Scholl, 2020).”

      One overarching point about the analyses to take into consideration: The authors use a Bayesian psychometric curve-fitting approach to estimate a point of subjective equality (PSE) in different blocks for each individual participant based on a model with strong priors about the shape of the function and its asymptotic endpoints, and this PSE is the primary DV across all of the studies. As discussed in Kominsky & Scholl (2020), this approach has certain limitations, notably that it can generate nonsensical PSEs when confronted with relatively extreme response patterns. The authors mentioned that this happened once in Experiment 3 and that a participant had to be replaced. An alternate approach is simply to measure the proportion of 'pass' reports overall to determine if there is an adaptation effect. I don't think this alternate analysis strategy would greatly change the results of this particular experiment, but it is robust against this kind of self-selection for effects that fit in the bounds specified by the model, and may therefore be worth including in a supplemental section or as part of the repository to better capture the individual variability in this effect. 

      We largely agree with these points. Indeed, we adopted the non-parametric analysis for a recent series of experiments in which the psychometric curves were more variable (Ohl & Rolfs, Vision Sciences Society Meeting 2024). In the present study, however, the model fits were very convincing. In Figures S1, S2 and S3 we show the model fits for each individual observer and condition on top of the mean proportion of launch reports. The inferential statistics based on the points of subjective equality, therefore, allowed us to report our findings very concisely.

      In general, this paper adds further evidence for something like a 'launching' detector in the visual system, but beyond that, it specifies some interesting questions for future work about how exactly such a detector might function. 

      We thank the reviewer for this positive overall assessment.

      Reviewer #2 (Recommendations for the authors):

      Generally, the paper is great. The questions I raised in the public review don't need to be answered at this time, but they're exciting directions for future work. 

      We would like to thank the reviewer for the encouraging comments and thoughtful ideas on how to improve the manuscript.

      I would have liked to see a little more description of the model parameters in the text of the paper itself just so readers know what assumptions are going into the PSE estimation. 

      We followed the reviewer’s suggestion and added more information regarding the parameter space (i.e., ranges of possible parameters of the logistic model) that we used for obtaining the model fits. 

      Specifically, we added the following information in the manuscript:

      “For model fitting, we constrained the range of possible estimates for each parameter of the logistic model. The lower asymptote for the proportion of reported launches was constrained to be in the range 0–0.75, and the upper asymptote in the range 0.25–1. The intercept of the logistic model was constrained to be in the range 1–15, and the slope was constrained to be in the range –20 to –1.”

      The models provided very good fits as can be appreciated by the fits per individual and experimental condition which we provide in response to the public comments. Please note, that all data and analysis scripts are available at the Open Science Framework (https://osf.io/x947m/).

      I also have a recommendation about Figure 1b: Color-code "Feature A", "Feature B", and "Feature C" and match those colors with the object identity/speed/direction text. I get what the figure is trying to convey but to a naive reader there's a lot going on and it's hard to interpret. 

      We followed the reviewer’s suggestion and revised the visualization accordingly.

      If you have space, figures showing the adaptation and corresponding test events for each experimental manipulation would also be great, particularly since the naming scheme of the conditions is (necessarily) not entirely consistent across experiments. It would be a lot of little figures, I know, but to people who haven't spent as long staring at these displays as we have, they're hard to envision based on description alone. 

      We followed the reviewer’s recommendation and added a visualization of the adaptor and the test events for the different experiments in Figure 2.

      Reviewer #3 (Public Review):

      We thank the reviewer for their thoughtful comments, which we carefully addressed to improve the revised manuscript. 

      Summary: 

      This paper presents evidence from three behavioral experiments that causal impressions of "launching events", in which one object is perceived to cause another object to move, depending on motion direction-selective processing. Specifically, the work uses an adaptation paradigm (Rolfs et al., 2013), presenting repetitive patterns of events matching certain features to a single retinal location, then measuring subsequent perceptual reports of a test display in which the degree of overlap between two discs was varied, and participants could respond "launch" or "pass". The three experiments report results of adapting to motion direction, motion speed, and "object identity", and examine how the psychometric curves for causal reports shift in these conditions depending on the similarity of the adapter and test. While causality reports in the test display were selective for motion direction (Experiment 1), they were not selective for adapter-test speed differences (Experiment 2) nor for changes in object identity induced via color swap (Experiment 3). These results support the notion that causal perception is computed (in part) at relatively early stages of sensory processing, possibly even independently of or prior to computations of object identity. 

      Strengths: 

      The setup of the research question and hypotheses is exceptional. The experiments are carefully performed (appropriate equipment, and careful control of eye movements). The slip adaptor is a really nice control condition and effectively mitigates the need to control motion direction with a drifting grating or similar. Participants were measured with sufficient precision, and a power curve analysis was conducted to determine the sample size. Data analysis and statistical quantification are appropriate. Data and analysis code are shared on publication, in keeping with open science principles. The paper is concise and well-written. 

      Weaknesses: 

      The biggest uncertainty I have in interpreting the results is the relationship between the task and the assumption that the results tell us about causality impressions. The experimental logic assumes that "pass" reports are always non-causal impressions and "launch" reports are always causal impressions. This logic is inherited from Rolfs et al (2013) and Kominsky & Scholl (2020), who assert rather than measure this. However, other evidence suggests that this assumption might not be solid (Bechlivanidis et al., 2019). Specifically, "[our experiments] reveal strong causal impressions upon first encounter with collision-like sequences that the literature typically labels "non-causal"" (Bechlivanidis et al., 2019) -- including a condition that is similar to the current "pass". It is therefore possible that participants' "pass" reports could also involve causal experiences. 

      We agree with the reviewer that our study assumes that the launch-pass dichotomy can be mapped onto a dimension of causal to non-causal impressions. Please note that the choice for this launch-pass task format was intentional. We consider it an advantage that subjects do not have to report causal vs non-causal impressions directly, as it allows us to avoid the oftencriticized decision biases that come with asking participants about their causal impression (Joynson, 1971; for a discussion see Choi & Scholl, 2006). This comes obviously at the cost that participants did not directly report their causal impression in our experiments. There is however evidence that increasing overlap between the discs monotonically decreases the causal impression when directly asking participants to report their causal impression (Scholl & Nakayama, 2004). We believe, therefore, that the assumption of mapping between launchesto-passes and causal-to-noncausal is well-justified. At the same time, the expressed concern emphasizes the need to develop further, possibly implicit measure for causal impressions (see Völter & Huber, 2021).

      However, as pointed out by the reviewer, a recent paper demonstrated that on first encounter participants can have impressions in response to a pass event that are different from clearly non-causal impressions (Bechlivanidis et al., 2019). As demonstrated in the same paper, displaying a canonical launch decreased the impression of causality when seeing pass events in subsequent trials. In our study, participants completed an entire training session before running the main experiments. It is therefore reasonable to expect that participants observed passes as non-causal events given the presence of clear causal references. Nevertheless, we now acknowledge this concern directly in the revised manuscript.

      We added the following paragraph to the discussion:

      “In our study, we assessed causal perception by asking observers to report whether they observed a launch or a pass in events of varying ambiguity. This method assumes that launches and passes can be mapped onto a dimension that ranges from causal to non-causal impressions. It has been questioned whether pass events are a natural representative of noncausal events: Observers often report high impressions of causality upon first exposure to pass events, which then decreased after seeing a canonical launch (Bechlivanidis, Schlottmann, & Lagnado, 2019). In our study, therefore, participants completed a separate session that included canonical launches before starting the main experiment.”

      Furthermore, since the only report options are "launch" or "pass", it is also possible that "launch" reports are not indications of "I experienced a causal event" but rather "I did not experience a pass event". It seems possible to me that different adaptation transfer effects (e.g. selectivity to motion direction, speed, or color-swapping) change the way that participants interpret the task, or the uncertainty of their impression. For example, it could be that adaptation increases the likelihood of experiencing a "pass" event in a direction-selective manner, without changing causal impressions. Increases of "pass" impressions (or at least, uncertainty around what was experienced) would produce a leftward shift in the PSE as reported in Experiment 1, but this does not necessarily mean that experiences of causal events changed. Thus, changes in the PSEs between the conditions in the different experiments may not directly reflect changes in causal impressions. I would like the authors to clarify the extent to which these concerns call their conclusions into question. 

      Indeed, PSE shifts are subject to cognitive influences and can even be voluntarily shifted (Morgan et al., 2012). We believe that decision biases (e.g., reporting the presence of launch before adaptation vs. reporting the absence of a pass after the adaptation) are unlikely to explain the high specificity of aftereffects observed in the current study. While such aftereffects are very typical of visual processing (Webster, 2015), it is unclear how a mechanism that increase the likelihood of perceiving a pass could account for the retinotopy of adaptation to launches (Rolfs et al., 2013) or the recently reported selective transfer of adaptation for only some causal categories (Kominsky et al., 2020). The latter authors revealed a transfer of adaptation from triggering to launching, but not from entraining events to launching. Based on these arguments, we decided to not include this point in the revised manuscript.

      Leaving these concerns aside, I am also left wondering about the functional significance of these specialised mechanisms. Why would direction matter but speed and object identity not? Surely object identity, in particular, should be relevant to real-world interpretations and inputs of these visual routines? Is color simply too weak an identity? 

      We agree that it would be beneficial to have mechanisms in place that are specific for certain object identities. Overall, our results fit very well to established claims that only spatiotemporal parameters mediate the perception of causality (Michotte, 1963; Leslie, 1984; Scholl & Tremoulet, 2000). We have now explicitly listed these references again in the revised manuscript. It is important to note, that an understanding of a causal relation could suffice to track identity information based purely on spatiotemporal contingencies, neglecting distinguishing surface features.

      We revised the manuscript and state:

      “Our findings therefore provide additional support for the claim that an event’s spatiotemporal parameters mediate the perception of causality (Michotte, 1963; Leslie, 1984; Scholl & Tremoulet, 2000).”

      Moreover, we think our findings of directional selectivity have functional relevance. First, direction-selective detection of collisions allows for an adaptation that occurs separately for each direction. That means that the visual system can calibrate these visual routines for detecting causal interactions in response to real-world statistics that reflect differences in directions. For instance, due to gravity, objects will simply fall to the ground. Causal relation such as launches are likely to be more frequent in horizontal directions, along a stable ground. Second, we think that causal visual events are action-relevant, that is, acting on (potentially) causal events promises an advantage (e.g., avoiding a collision, or quickly catching an object that has been pushed away). The faster we can detect such causal interactions, the faster we can react to them. Direction-selective motion signals are available in the first stages of visual processing. Visual routines that are based on these direction-selective motion signals promise to enable such fast computations. Please note, however, that while our present findings demonstrate direction-selectivity, they do not pinpoint where exactly that visual routine is located. It is quite possible that the visual routine is located higher up in the visual system, relying on a direction-selective population response as input.

      We added these points to the discussion of the functional relevance: 

      “We suggest that at least two functional benefits result from a specialized visual routine for detecting causality. First, a direction-selective detection of launches allows adaptation to occur separately for each direction. That means that the visual system can automatically calibrate the sensitivity of these visual routines in response to real-world statistics. For instance, while falling objects drop vertically towards the ground, causal relations such as launches are common in horizontal directions moving along a stable ground. Second, we think that causal visual events are action-relevant, and the faster we can detect such causal interactions, the faster we can react to them. Direction-selective motion signals are available very early on in the visual system. Visual routines that are based on these direction-selective motion signals may enable faster detection. While our present findings demonstrate direction-selectivity, they do not pinpoint where exactly that visual routine is located. It is possible that the visual routine is located higher up in the visual system (or distributed across multiple levels), relying on a direction-selective population response as input.”

      Reviewer #3 (Recommendations for the authors):

      - The concept of "visual routines" is used without introduction; for a general-interest audience it might be good to include a definition and reference(s) (e.g. Ullman.). 

      Thank you very much for highlighting that point. We have chosen the term visual routine to emphasize that we locate the part of the mechanism that is affected by the adaptation in our experiments in the visual system, but at the same time it leaves space regarding the extent to which the mechanism further involves mid- and higher-level processes. The term thus has a clear reference to a visual routine by Ullman (1987). We have now addressed what we mean by visual routine, and we also included the reference in the revised manuscript.

      We add the following footnote to the introduction:

      “We use the term visual routine here to highlight that our adaptation experiments can reveal a causality detection mechanism that resides in the visual system. At the same time, calling it a routine emphasizes similarities with a local, semi-independent operation (e.g., the recognition of familiar motion patterns; see also Ullman, 1987; Cavanagh, Labianca, & Thornton, 2001) that can engage mid- and higher-level processes (e.g., during causal capture, Scholl & Nakayama, 2002; or multisensory integration, Körding et al., 2007).”

      - I would appreciate slightly more description of the phenomenology of the WW adaptors: is this Michotte's "entraining" event? Does it look like one disc shunts the other?  

      The stimulus differs from Michotte's entrainment event in both spatiotemporal parameters and phenomenology. We added videos for the launch, pass and slip events as Supplementary Material.

      Moreover, we described the slip event in the methods section:

      “In two additional sessions, we presented slip events as adaptors to control that the adaptation was specific for the impression of causality in the launching events. Slip events are designed to match the launching events in as many physical properties as possible while producing a very different, non-causal phenomenology. In slip events, the first peripheral disc also moves towards a stationary disc. In contrast to launching events, however, the first disc passes the stationary disc and stops only when it is adjacent to the opposite edge of the stationary disc. While slip events do not elicit a causal impression, they have the same number of objects and motion onsets, the same motion direction and speed, as well as the same spatial area of the event as launches.”

      In the revised manuscript, we added also more information on the slip event in the beginning of the results section. Importantly, the stimulus typically produces the impression of two independent movements and thus serves as a non-causal control condition in our study. Only anecdotally, some observers (not involved in this study) who saw the stimulus spontaneously described their phenomenology of seeing a slip event as a double step or a discus throw.

      We added the following description to the results section:

      “Moreover, we compared the visual adaptation to launches to a (non-causal) control condition in which we presented slip events as adaptor. In a slip event, the initially moving disc passes completely over the stationary disc, stops immediately on the other side, and then the initially stationary disc begins to move in the same direction without delay. Thus, the two movements are presented consecutively without a temporal gap. This stimulus typically produces the impression of two independent (non-causal) movements.”

      - In general more illustrations of the different conditions (similar to Figure 1c but for the different experimental conditions and adaptors) might be helpful for skim readers.  

      We followed the reviewer’s recommendation and added a visualization of the adaptor and the test events for the different experiments in Figure 2.

      - Were the luminances of the red and green balls in experiment 3 matched? Were participants checked for color anomalous vision?  

      Yes, we checked for color anomalous vision using the color test Tafeln zur Prüfung des Farbensinnes/Farbensehens (Kuchenbecker & Broschmann, 2016). We added that information to the manuscript. The red and green discs were not matched for luminance. We measured the luminance after the experiment (21 cd/m<sup>2</sup> for the green disc and 6 cd/m<sup>2</sup> for the red disc). Please note, that the differences in luminance should not pose a problem for the interpretation of the results, as we see a transfer of the adaptation across the two different colors.

      We added the following information to the manuscript:

      “The red and green discs were not matched for luminance. Measurements obtained after the experiments yielded a luminance of 21 cd/m<sup>2</sup> for the green disc and 6 cd/m<sup>2</sup> for the red disc.”

      “All observers had normal or corrected-to-normal vision and color vision as assessed using the color test Tafeln zur Prüfung des Farbensinnes/Farbensehens (Kuchenbecker & Broschmann, 2016).”

      - Relationship of this work to the paper by Arnold et al., (2015). That paper suggested that some effects of adaptation of launching events could be explained by an adaptation of object shape, not by causality per se. It is superficially difficult to see how one could explain the present results from the perspective of object "squishiness" -- why would this be direction selective? In other words, the present results taken at face value call the "squishiness" explanation into question. The authors could consider an explanation to reconcile these findings in their discussion. 

      Indeed, the paper by Arnold and colleagues (2014) suggested that a contact-launch adaptor could lead to a squishiness aftereffect—arguing that the object elasticity changed in response to the adaptation.  Importantly, the same study found an object-centered adaptation effect rather than a retinotopic adaptation effect. However, the retinotopic nature of the negative aftereffect as used in our study has been repeatedly replicated (for instance Kominsky & Scholl, 2020). Thus, the divergent results of Arnold and colleagues may have resulted from differences in the task (i.e., observers had to judge whether they perceived a soft vs. hard bounce), or the stimuli (i.e., bounces of a disc and a wedge, and the discs moving on a circular trajectory). It would be important to replicate these results first and then determine whether their squishiness effect would be direction-selective as well. We now acknowledge the study by Arnold and colleagues in the discussion:

      “The adaptation of causality is spatially specific to the retinotopic coordinates of the adapting stimulus (Kominsky & Scholl, 2020; Rolfs et al., 2013; for an object-centered elasiticity aftereffect using a related stimulus on a circular motion path, see Arnold et al., 2015), suggesting that the detection of causal interactions is implemented locally in visual space.”

      - Line 32: "showing that a specialized visual routine for launching events exists even within separate motion direction channels". This doesn't necessarily mean the routine is within each separate direction channel, only that the output of the mechanism depends on the population response over motion direction. The critical motion computation could be quite high level -- e.g. global pattern motion in MST. Please clarify the claim. 

      We agree with the reviewer, that it is also possible that critical parts of the visual routine could simply use the aggregated population response over motion direction at higher-levels of processing. We acknowledge this possibility in the discussion of the functional relevance of the proposed mechanism and when suggesting that a distributed brain network may contribute to the perception of causality.

      We would like to highlight the following two revised paragraphs.

      “[…] Second, we think that causal visual events are action-relevant, and the faster we can detect such causal interactions, the faster we can react to them. Direction-selective motion signals are available very early on in the visual system. Visual routines that are based on these direction-selective motion signals may enable faster detection. While our present findings demonstrate direction-selectivity, they do not pinpoint where exactly that visual routine is located. It is possible that the visual routine is located higher up in the visual system (or distributed across multiple levels), relying on a direction-selective population response as input.”

      Moreover, when discussing the neurophysiological literature we write:

      “Interestingly, single cell recordings in area F5 of the primate brain revealed that motor areas are contributing to the perception of causality (Caggiano et al., 2016; Rolfs, 2016), emphasizing the distributed nature of the computations underlying causal interactions. This finding also stresses that the detection, and the prediction, of causality is essential for processes outside purely sensory systems (e.g., for understanding other’s actions, for navigating, and for avoiding collisions).”

      -  p. 10 line 30: typo "particual".  

      Done.

      -  p. 10 line 37: "This findings rules out (...)" should be singular "This finding rules out (...)". 

      Done.

      -  Spelling error throughout: "underly" should be "underlie". 

      Done.

      -  p.11 line 29: "emerges fast and automatic" should be "automatically". 

      Done.

    1. eLife Assessment

      Cichlid fishes have attracted attention from a wide range of biologists because of their<br /> extensive species diversification at the ecological and phenotypic levels. In this important study, the authors have partially revealed the mechanism behind lip thickening in cichlid fishes, which has evolved independently across three lakes in Africa. To explore this phenomenon, the authors used histological comparison, proteomics, and transcriptomics, all of which are well suited for their objectives. With compelling evidence, this contribution provides insights into parallel evolution in polygenic traits and holds significant value for the field.

    2. Reviewer #1 (Public review):

      Summary:

      Machii et al. reported a possible molecular mechanism underlying the parallel evolution of lip hypertrophy in African cichlids. The multifaceted approach taken in this manuscript is highly valued, as it uses histology, proteomics, and transcriptomics to reveal how phylogenetically distinct thick-lips have evolved in parallel. Findings from histology and proteomics connected to wnt signaling through the transcriptome are very exciting.

      Strengths:

      There is consistency between the results and it is possible to make a strong argument from the results.

      Comments on revised version:

      The issues I pointed out in the previous review have been carefully answered, and all issues have been addressed. The main points of the manuscript are clear, and the conclusions are easy to understand. The enlarged lips are a notable example of convergent evolution in African cichlids.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1:

      Weaknesses:

      The authors do not discuss based on genomic information; the genomes of the cichlids from the three lakes have been decoded and are therefore available. However, indeed, the species in Lake Tanganyika and Lake Malawi/Victoria are genetically distant from each other, so a comparative genome analysis would not have yielded the results presented here. I recommend adding such a discussion to the Discussion.

      We appreciate your comment. We added the discussion regarding the genomic aspect of parallel evolution.

      Line 386-393: “From a genomic perspective, several studies have investigated the genetic basis of hypertrophied lip cichlids (Masonick et al., 2023; Nakamura et al., 2021). Importantly, some Wnt pathway-related genes (tcf4 and daam2) and ECM-related genes (postna, col12a1a, and col12a1b) have been found to be under positive selection in cichlids with hypertrophied lips of Lake Victoria (see Nakamura et al., 2021 Table S3). For future research, examining whether these genes are under selection in other lakes is crucial to understand the genetic mechanisms underlying the parallel evolution of hypertrophied lips.”

      Minor comments:

      Line 30, the Wnt --> the genes in Wnt

      We appreciate your comment. According to the comment, we corrected the sentence.

      Line 30: “the Wnt signaling pathway” -> “the genes in Wnt signaling pathway”

      Line 42-44, "It is considered that the same direction of natural selection drives phenotypic changes among species since it is unlikely that these complex phenotypes have been acquired repeatedly just by neutral evolution". How about "Since it is unlikely that such a complex phenotype was acquired repeatedly by neutral evolution alone, the same direction of natural selection among species is likely to drive the parallel phenotypic change."?

      We agree with your suggestion and correct the sentence of our manuscript.

      Line 42-44: “It is considered that the same direction of natural selection drives phenotypic changes among species since it is unlikely that these complex phenotypes have been acquired repeatedly just by neutral evolution”

      “Since it is unlikely that such a complex phenotype was acquired repeatedly by neutral evolution alone, the same direction of natural selection among species is likely to drive the parallel phenotypic change”

      Line 60, polygenic --> likely to be polygenic

      We appreciate your comment. Indeed, it is better to weaken the wording.

      Line 60: “most traits are polygenic” -> “most traits are likely to be polygenic”

      Line 91, the Wnt --> the genes in Wnt

      We appreciate your correction. Last paragraph of introduction has been corrected according to the suggestion of Reviewer 2 (Q1).

      Line 230, NovaSeq --> Illumina NovaSeq

      We appreciate your correction.

      Line 222: “NovaSeq 6000” -> “Illumina NovaSeq 6000”

      Line 231 "mRNA Library Prep Kit". Please add a company name.

      We appreciate your correction. We added company’s information.

      Line 223: “a TruSeq stranded mRNA Library Prep Kit.” -> “a TruSeq stranded mRNA Library Prep Kit (Illumina)”

      Line 267, as for the tip of hypertrophied lips, could you add and point out which part is the tip?

      We dissected hypertrophied lips in two half anterior and half posterior. We added the sentence in the materials and methods section.

      Line 156-158: “The lips of H. chilotes were analyzed separately for the base and tip.” -> “The lips of H. chilotes were dissected in two half anterior (tip) and half posterior (base), which are analyzed separately.”

      Line 272, "133 proteins upregulated and 5 proteins downregulated" in hypertrophied lip or normal lip?

      We appreciate your correction. We added the sentence as follows.

      Line 264: “133 proteins upregulated and 5 proteins downregulated”

      “133 proteins upregulated and 5 proteins downregulated in the hypertrophied lip”

      Line 274, "hypertrophied lips" means tip of hypertrophied lips?

      We appreciate your correction. We corrected the sentence as follows.

      Line 266: “hypertrophied lips are abundant” -> “tip of hypertrophied lips is abundant”

      Line 277, Did you perform multiple testing correction for statistical significance?

      We appreciate your comment about multiple testing corrections. We did not apply multiple testing corrections in our “exploratory” analysis of proteomics not to miss biologically important candidates in a limited sample size (n=3). We calculated the multiple corrected p-value in the Benjamini Hochberg method (Author response image 1, right). The result suggested that almost the same proteoglycans and its related proteins as we focused on are highly accumulated in the hypertrophied lips in milder conditions (significance level of 0.1).

      Author response image 1.

      Thus, our main conclusions remain unchanged even with correction applied, however, the overall balance of the volcano plot is not visually appealing (Author response image 1, right).

      It is important to note that we selected the Top 20 proteins based on fold change rather than statistical significance. In addition, our proteomic findings show consistency with our histological and transcriptome data, providing the biological validation from various aspects. While we understand the potential benefits of multiple testing correction, our current approach without multiple testing still offers valuable and fair data to propose hypothesis on the molecular mechanisms of lip hypertrophy in cichlids. Therefore, we want to use original figure without multiple testing. We greatly appreciate the understanding of the reviewer.

      Line 349-351, "The results of the enrichment analysis suggested that the genes that were categorized into both canonical and non-canonical Wnt signaling pathways, were highly expressed in the hypertrophied lips of juvenile and adult cichlids."

      The wnt category was enriched by analyzing the highly expressed genes, so isn't it natural that the wnt category is highly expressed?

      Did you mean to say as in the following sentence?

      "Enrichment of genes categorized in the canonical and noncanonical Wnt signaling pathways suggested that high expression of genes in the Wnt signaling pathway is likely to be involved in the hypertrophied lips of juvenile and adult fish."

      Thank you for your comments. We corrected our manuscript as follows.

      Line 341-344: “The results of the enrichment analysis suggested that the genes that were categorized into both canonical and non-canonical Wnt signaling pathways, were highly expressed in the hypertrophied lips of juvenile and adult cichlids.”

      “As a result of enrichment analysis, DEGs were categorized in the canonical and noncanonical Wnt signaling pathways, suggesting that high expression of genes in the Wnt signaling pathway is likely to be involved in the hypertrophied lips of juvenile and adult fish.”

      Line 403-404, "several other pathways may be involved in the development of hypertrophied lips". Do you have any evidence?

      We appreciate your comment regarding possible evidence for the involvement of multiple pathways in hypertrophied lip development. Our statement was based on two main points:

      (1) While we highlighted the Wnt pathway because this pathway is known to increase proteoglycan expression, we cannot exclude the possibility of the involvement of other pathways. For instance, our enrichment analysis in adult cichlids identified VEGF-related pathways, which could contribute to lip hypertrophy by increasing vascularization and nutrient supply to the lip tissue.

      (2) Previous quantitative trait locus (QTL) analysis by Henning et al. (2017) concluded that lip hypertrophy is likely influenced by numerous loci with small additive effects. This indicates that lip hypertrophy is a complex phenotype consisted of multiple genetic factors, some which probably correspond to different molecular pathways.

      Given these points, we draw a conclusion that emphasize the importance of Wnt pathway while also recognizing the potential cooperative interaction of multiple pathways in developing lip hypertrophy. Without confusing the two statements, we corrected our manuscript as follows.

      Line 398-412: “We uncovered the apparent relationships between hypertrophied lips and the expression profiles of ECM proteins, in particularly proteoglycans. The trends for the overall expression of ECM-related genes were similar across hypertrophied lip species, but we rarely observed a specific gene that was commonly expressed at high or low levels in all three examples of hypertrophied lips across all East African Great Lakes. Furthermore, although we focused primarily on the relationship between the Wnt signaling pathway and lip hypertrophy, several other pathways may be involved in the development of hypertrophied lips. These findings imply that although enlargement of proteoglycan-rich loose connective tissue is common in hypertrophied lips, the developmental pathways to accomplish this are diverse in each lake.”

      “We uncovered the apparent relationships between hypertrophied lips and the expression profiles of ECM proteins, in particularly proteoglycans. The trends for the overall expression of ECM-related genes were similar across hypertrophied lip species, but we rarely observed a specific gene that was commonly expressed at high or low levels in all three examples of hypertrophied lips across all East African Great Lakes. Furthermore, although we focused primarily on the relationship between the Wnt signaling pathway and lip hypertrophy, several other pathways may be involved in the development of hypertrophied lips. For example, our enrichment analysis in adult cichlids identified VEGF-related pathways, which could contribute to lip hypertrophy by increasing vascularization and nutrient supply to the lip tissue. In addition, previous quantitative trait locus (QTL) analysis by Henning et al. (2017) concluded that lip hypertrophy is likely influenced by numerous loci with small additive effects. These lines of data imply that although enlargement of proteoglycan-rich loose connective tissue is common in hypertrophied lips, the developmental pathways to accomplish this are diverse in each lake.”

      Reviewer 2:

      Minor comments:

      Last paragraph of Introduction: Remove the results of this study.

      We appreciate your suggestion. We remove the specialized results from the last paragraph.

      “In this study, we comprehensively compared the hypertrophied lips of cichlids across all East African Great Lakes using histology, proteomics, and transcriptomics. Histological and proteomic analyses revealed a distinct microstructure of hypertrophied lips compared to normal lips, and primary candidate proteins were identified. Transcriptome analysis at different developmental stages showed that the genes in Wnt signaling pathway was highly expressed in cichlids with hypertrophied lips at both the juvenile and adult stages. It is noteworthy that the distinct expression profiles observed in the proteome and transcriptome analyses of hypertrophied lips were similar among cichlids from each of the East African Great Lakes. The present study, which integrates comprehensive analyses for cichlids from all East African Great Lakes, provides insight for a better understanding of the molecular basis of a typical example of parallel evolution.”

      Line 87-91: “In this study, we comprehensively compared the hypertrophied and normal lips of cichlids across all East African Great Lakes at various biological levels using histology, proteomics, and transcriptomics. As a result, we showed that a novel key pathway commonly involved in the formation of hypertrophied lips, providing insight into a better understanding of the molecular basis of a typical example of parallel evolution.”

      Line 156: Italicize the scientific names.

      We appreciate your correction.

      Line 148: “M. zebra and O. niloticus” -> “M. zebra and O. niloticus

      Line 261: Remove the period after "Victoria."

      We appreciate your correction.

      Line 253: “Lake Victoria. (Figure 1; Figure S2).” -> “Lake Victoria (Figure 1; Figure S2).”

      Line 416: Remove the period after "tissue."

      We appreciate your correction.

      Line 420: “tissue. (A,B)” -> “tissue (A,B)”

      Line 646: Probably "the anterior side to the left."

      We apologize for our mistake. As you commented, the anterior side is left. We corrected our manuscript as follows.

      Line 648: “the anterior side to the right” -> “the anterior side to the left”

      Fig. S2: Based on Fig. 1, the VG stained area appears larger in the Hypertrophied lip species; however, it is the opposite in Fig. S2.

      We appreciate your comments. This is because we calculated the ratio of the VG-stained area to the whole lip area. While the absolute VG-stained area is larger in hypertrophied lips, the proportion of the VG-stained area relative to the total lip area is smaller. This correction using entire area allows us to simply compare the degree of lip hypertrophy among species.

    1. eLife Assessment

      This important manuscript investigates the role of olfactory cues in Pieris brassicae larvae, focusing on their interactions with the host plant Brassica oleracea and the parasitoid wasp Cotesia glomerata. The authors' demonstration that impaired olfactory perception reduces caterpillar performance and increases susceptibility to parasitism is solid. These findings highlight the ecological significance of olfaction in mediating feeding behavior and predator avoidance in herbivorous insects.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript focuses on the olfactory system of Pieris brassicae larvae and the importance of olfactory information in their interactions with the host plant Brassica oleracea and the major parasitic wasp Cotesia glomerata. The authors used CRISPR/Cas9 to knockout odorant receptor co-receptors (Orco), and conducted a comparative study on the behavior and olfactory system of the mutant and wild-type larvae. The study found that Orco-expressing olfactory sensory neurons in antennae and maxillary palps of Orco knockout (KO) larvae disappeared, and the number of glomeruli in the brain decreased, which impairs the olfactory detection and primary processing in the brain. Orco KO caterpillars show weight loss and loss of preference for optimal food plants; KO larvae also lost weight when attacked by parasitoids with the ovipositor removed, and mortality increased when attacked by untreated parasitoids. On this basis, the authors further studied the responses of caterpillars to volatiles from plants attacked by the larvae of the same species and volatiles from plants on which the caterpillars were themselves attacked by parasitic wasps. Lack of OR-mediated olfactory inputs prevents caterpillars from finding suitable food sources and from choosing spaces free of enemies.

      Strengths:

      The findings help to understand the important role of olfaction in caterpillar feeding and predator avoidance, highlighting the importance of odorant receptor genes in shaping ecological interactions.

      Weaknesses:

      There are the following major concerns:

      (1) Possible non-targeted effects of Orco knockout using CRISPR/Cas9 should be analyzed and evaluated in Materials and Methods and Results.

      (2) Figure 1E: Only one olfactory receptor neuron was marked in WT. There are at least three olfactory sensilla at the top of the maxillary palp. Therefore, to explain the loss of Orco-expressing neurons in the mutant (Figure 1F), a more rigorous explanation of the photo is required.

      (3) In Figure 1G, H, the four glomeruli are circled by dotted lines: their corresponding relationship between the two figures needs to be further clarified.

      (4) Line 130: Since the main topic in this study is the olfactory system of larvae, the experimental results of this part are all about antennal electrophysiological responses, mating frequency, and egg production of female and male adults of wild type and Orco KO mutant, it may be considered to include this part in the supplementary files. It is better to include some data about the olfactory responses of larvae.

      (5) Line 166: The sentences in the text are about the choice test between " healthy plant vs. infested plant", while in Fig 3C, it is "infested plant vs. no plant". The content in the text does not match the figure.

      (6) Lines 174-178: Figure 3A showed that the body weight of Orco KO larvae in the absence of parasitic wasps also decreased compared with that of WT. Therefore, in the experiments of Figure 3A and E, the difference in the body weight of Orco KO larvae in the presence or absence of parasitic wasps without ovipositors should also be compared. The current data cannot determine the reduced weight of KO mutant is due to the Orco knockout or the presence of parasitic wasps.

      (7) Lines 179-181: Figure 3F shows that the survival rate of larvae of Orco KO mutant decreased in the presence of parasitic wasps, and the difference in survival rate of larvae of WT and Orco KO mutant in the absence of parasitic wasps should also be compared. The current data cannot determine whether the reduced survival of the KO mutant is due to the Orco knockout or the presence of parasitic wasps.

      (8) In Figure 4B, why do the compounds tested have no volatiles derived from plants? Cruciferous plants have the well-known mustard bomb. In the behavioral experiments, the larvae responses to ITC compounds were not included, which is suggested to be explained in the discussion section.

      (9) The custom-made setup and the relevant behavioral experiments in Figure 4C need to be described in detail (Line 545).

      (10) Materials and Methods Line 448: 10 μL paraffin oil should be used for negative control.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript investigated the effect of olfactory cues on caterpillar performance and parasitoid avoidance in Pieris brassicae. The authors knocked out Orco to produce caterpillars with significantly reduced olfactory perception. These caterpillars showed reduced performance and increased susceptibility to a parasitoid wasp.

      Strengths:

      This is an impressive piece of work and a well-written manuscript. The authors have used multiple techniques to investigate not only the effect of the loss of olfactory cues on host-parasitoid interactions, but also the mechanisms underlying this.

      Weaknesses:

      I do have one major query regarding this manuscript - I agree that the results of the caterpillar choice tests in a y-maze give weight to the idea that olfactory cues may help them avoid areas with higher numbers of parasitoids. However, the experiments with parasitoids were carried out on a single plant. Given that caterpillars in these experiments were very limited in their potential movement and source of food - how likely is it that avoidance played a role in the results seen from these experiments, as opposed to simply the slower growth of the KO caterpillars extending their period of susceptibility? While the two mechanisms may well both take place in nature - only one suggests a direct role of olfaction in enemy avoidance at this life stage, while the other is an indirect effect, hence the distinction is important.

      My other issue was determining sample sizes used from the text was sometimes a bit confusing. (This was much clearer from the figures).

      I also couldn't find the test statistics for any of the statistical methods in the main text, or in the supplementary materials.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript focuses on the olfactory system of Pieris brassicae larvae and the importance of olfactory information in their interactions with the host plant Brassica oleracea and the major parasitic wasp Cotesia glomerata. The authors used CRISPR/Cas9 to knockout odorant receptor co-receptors (Orco), and conducted a comparative study on the behavior and olfactory system of the mutant and wild-type larvae. The study found that Orco-expressing olfactory sensory neurons in antennae and maxillary palps of Orco knockout (KO) larvae disappeared, and the number of glomeruli in the brain decreased, which impairs the olfactory detection and primary processing in the brain. Orco KO caterpillars show weight loss and loss of preference for optimal food plants; KO larvae also lost weight when attacked by parasitoids with the ovipositor removed, and mortality increased when attacked by untreated parasitoids. On this basis, the authors further studied the responses of caterpillars to volatiles from plants attacked by the larvae of the same species and volatiles from plants on which the caterpillars were themselves attacked by parasitic wasps. Lack of OR-mediated olfactory inputs prevents caterpillars from finding suitable food sources and from choosing spaces free of enemies.

      Strengths:

      The findings help to understand the important role of olfaction in caterpillar feeding and predator avoidance, highlighting the importance of odorant receptor genes in shaping ecological interactions.

      Weaknesses:

      There are the following major concerns:

      (1) Possible non-targeted effects of Orco knockout using CRISPR/Cas9 should be analyzed and evaluated in Materials and Methods and Results.

      Thank you for your suggestion. In the Materials and Methods, we mention how we selected the target region and evaluated potential off-target sites by Exonerate and CHOPCHOP. Neither of these methods found potential off-target sites with a more-than-17-nt alignment identity. Therefore, we assumed no off-target effect in our Orco KO. Furthermore, we did not find any developmental differences between WT and KO caterpillars when these were reared on leaf discs in Petri dishes (Fig S4). We will further highlight this information on the off-target evaluation in the Results section of our revised manuscript.

      (2) Figure 1E: Only one olfactory receptor neuron was marked in WT. There are at least three olfactory sensilla at the top of the maxillary palp. Therefore, to explain the loss of Orco-expressing neurons in the mutant (Figure 1F), a more rigorous explanation of the photo is required.

      Thank you for pointing this out. The figure shows only a qualitative comparison between WT and KO and we did not aim to determine the total number of Orco positive neurons in the maxillary palps or antennae of WT and KO caterpillars, but please see our previous work for the neuron numbers in the caterpillar antennae (Wang et al., 2023). We did indeed find more than one neuron in the maxillary palps, but as these were in very different image planes it was not possible to visualize them together. However, we will add a few sentences in the Results and Discussion section to explain the results of the maxillary palp Orco staining.

      (3) In Figure 1G, H, the four glomeruli are circled by dotted lines: their corresponding relationship between the two figures needs to be further clarified.

      Thank you for pointing this out. The four glomeruli in Figure 1G and 1H are not strictly corresponding. We circled these glomeruli to highlight them, as they are the best visualized and clearly shown in this view. In this study, we only counted the number of glomeruli in both WT and KO, however, we did not clarify which glomeruli are missing in the KO caterpillar brain. We will further explain this in the figure legend.

      (4) Line 130: Since the main topic in this study is the olfactory system of larvae, the experimental results of this part are all about antennal electrophysiological responses, mating frequency, and egg production of female and male adults of wild type and Orco KO mutant, it may be considered to include this part in the supplementary files. It is better to include some data about the olfactory responses of larvae.

      Thank you for your suggestion. We do agree with your suggestion, and we will consider moving this part to the supplementary information. Regarding larval olfactory response, we unfortunately failed to record any spikes using single sensillum recordings due to the difficult nature of the preparation; however, we do believe that this would be an interesting avenue for further research.

      (5) Line 166: The sentences in the text are about the choice test between " healthy plant vs. infested plant", while in Fig 3C, it is "infested plant vs. no plant". The content in the text does not match the figure.

      Thank you for pointing this out. The sentence is “We compared the behaviors of both WT and Orco KO caterpillars in response to clean air, a healthy plant and a caterpillar-infested plant”. We tested these three stimuli in two comparisons: healthy plant vs no plant, infested plant vs no plant. The two comparisons are shown in Figure 3C separately. We will aim to describe this more clearly in the revised version of the manuscript.

      (6) Lines 174-178: Figure 3A showed that the body weight of Orco KO larvae in the absence of parasitic wasps also decreased compared with that of WT. Therefore, in the experiments of Figure 3A and E, the difference in the body weight of Orco KO larvae in the presence or absence of parasitic wasps without ovipositors should also be compared. The current data cannot determine the reduced weight of KO mutant is due to the Orco knockout or the presence of parasitic wasps.

      Thank you for pointing this out. We did not make a comparison between the data of Figures 3A and 3E since the two experiments were not conducted at the same time due to the limited space in our BioSafety Ⅲ greenhouse. We do agree that the weight decrease in Figure 3E is partly due to the reduced caterpillar growth shown in Figure 3A. However, we are confident that the additional decrease in caterpillar weight shown in Figure 3E is mainly driven by the presence of disarmed parasitoids. To be specific, the average weight in Figure 3A is 0.4544 g for WT and 0.4230 g for KO, KO weight is 93.1% of WT caterpillars. While in Figure 3E, the average weight is 0.4273 g for WT and 0.3637 g for KO, KO weight is 85.1% of WT caterpillars. We will discuss this interaction between caterpillar growth and the effect of the parasitoid attacks more extensively in the revised version of the manuscript.

      (7) Lines 179-181: Figure 3F shows that the survival rate of larvae of Orco KO mutant decreased in the presence of parasitic wasps, and the difference in survival rate of larvae of WT and Orco KO mutant in the absence of parasitic wasps should also be compared. The current data cannot determine whether the reduced survival of the KO mutant is due to the Orco knockout or the presence of parasitic wasps.

      We are happy that you highlight this point. When conducting these experiments, we selected groups of caterpillars and carefully placed them on a leaf with minimal disturbance of the caterpillars, which minimized hurting and mortality. We did test the survival of caterpillars in the absence of parasitoid wasps from the experiment presented in Figure 3A, although this was missing from the manuscript. There is no significant difference in the survival rate of caterpillars between the two genotypes in the absence of wasps (average mortality WT = 8.8 %, average mortality KO = 2.9 %; P = 0.088, Wilcoxon test), so the decreased survival rate is most likely due to the attack of the wasps. We will add this information to the revised version of the manuscript.

      (8) In Figure 4B, why do the compounds tested have no volatiles derived from plants? Cruciferous plants have the well-known mustard bomb. In the behavioral experiments, the larvae responses to ITC compounds were not included, which is suggested to be explained in the discussion section.

      Thank you for the suggestion. We assume you mean Figure 4D/4E instead of Figure 4B. In Figure 4B, many of the identified chemical compounds are essentially plant volatiles, especially those from caterpillar frass and caterpillar spit. In Figure 4D/4E, most of the tested chemicals are derived from plants. We did include several ITCs in the butterfly EAG tests shown in figure 2A/B, however because the butterfly antennae did not respond strongly to ITCs, we did not include ITCs in the subsequent larval behavioural tests. Instead, the tested chemicals in Figure 4D/4E either elicit high EAG responses of butterflies or have been identified as significant by VIP scores in the chemical analyses. We will add this explanation to the revised version of our manuscript.

      (9) The custom-made setup and the relevant behavioral experiments in Figure 4C need to be described in detail (Line 545).

      We will add more detailed descriptions for the setup and method in the Materials and Methods.

      (10) Materials and Methods Line 448: 10 μL paraffin oil should be used for negative control.

      Thank you for pointing this out. We used both clean filter paper and clean filter paper with 10 μL paraffin oil as negative controls, but we did not find a significant difference between the two controls. Therefore, in the EAG results of Figure 2A/2B, we presented paraffin oil as one of the tested chemicals. We will re-run our statistical tests with paraffin oil as negative control, although we do not expect any major differences to the previous tests.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigated the effect of olfactory cues on caterpillar performance and parasitoid avoidance in Pieris brassicae. The authors knocked out Orco to produce caterpillars with significantly reduced olfactory perception. These caterpillars showed reduced performance and increased susceptibility to a parasitoid wasp.

      Strengths:

      This is an impressive piece of work and a well-written manuscript. The authors have used multiple techniques to investigate not only the effect of the loss of olfactory cues on host-parasitoid interactions, but also the mechanisms underlying this.

      Weaknesses:

      (1) I do have one major query regarding this manuscript - I agree that the results of the caterpillar choice tests in a y-maze give weight to the idea that olfactory cues may help them avoid areas with higher numbers of parasitoids. However, the experiments with parasitoids were carried out on a single plant. Given that caterpillars in these experiments were very limited in their potential movement and source of food - how likely is it that avoidance played a role in the results seen from these experiments, as opposed to simply the slower growth of the KO caterpillars extending their period of susceptibility? While the two mechanisms may well both take place in nature - only one suggests a direct role of olfaction in enemy avoidance at this life stage, while the other is an indirect effect, hence the distinction is important.

      We do agree with your comment that both mechanisms may be at work in nature, and we do address this in the Discussion section. In our study, we did find that wildtype caterpillars were more efficient in locating their food source and did grow faster on full plants than knockout caterpillars. This faster growth will enable wildtype caterpillars to more quickly outgrow the life-stages most vulnerable to the parasitoids (L1 and L2). The olfactory system therefore supports the escape from parasitoids indirectly by enhancing feeding efficiency directly.

      In addition, we show in our Y-tube experiments that WT caterpillars were able to avoid plant where conspecifics are under the attack by parasitiods (Figure 3D). Therefore, we speculate that WT caterpillars make use of volatiles from the plant or from conspecifics via their spit or faeces to avoid plants or leaves potentially attracting natural enemies. Knockout caterpillars are unable to use these volatile danger cues and therefore do not avoid plants or leaves that are most attractive to their natural enemies, making KO caterpillars more susceptible and leading to more natural enemy harassment. Through this, olfaction also directly impacts the ability of a caterpillar to find an enemy-free feeding site.

      We think that olfaction supports the enemy avoidance of caterpillars via both these mechanisms, although at different time scales. Unfortunately, our analysis was not detailed enough to discern the relative importance of the two mechanisms we found. However, we feel that this would be an interesting avenue for further research. Moreover, we will sharpen our discussion on the potential importance of the two different mechanisms in the revised version of this manuscript.

      (2) My other issue was determining sample sizes used from the text was sometimes a bit confusing. (This was much clearer from the figures).

      We will revise the sample size in the text to make it clearer.

      (3) I also couldn't find the test statistics for any of the statistical methods in the main text, or in the supplementary materials.

      Thank you for pointing this out. We will provide more detailed test statistics in the main text and in the supplementary materials of the revised version of the manuscript.

    1. eLife Assessment

      This study presents a valuable open-source and cost-effective method for automating the quantification of male aggression and courtship in Drosophila melanogaster. The work as presented provides solid evidence that the use of the behavioral setup that the authors designed - using readily available laboratory equipment and standardised high-performing classifiers they developed using existing software packages - accurately and reliably characterises social behavior in Drosophila. The work will be of interest to Drosophila neurobiologists and particularly to those working on male social behaviors.

    2. Reviewer #1 (Public review):

      The study introduces an open-source, cost-effective method for automating the quantification of male social behaviors in Drosophila melanogaster. It combines machine-learning-based behavioral classifiers developed using JAABA (Janelia Automatic Animal Behavior Annotator) with inexpensive hardware constructed from off-the-shelf components. This approach addresses the limitations of existing methods, which often require expensive hardware and specialized setups. The authors demonstrate that their new "DANCE" classifiers accurately identify aggression (lunges) and courtship behaviors (wing extension, following, circling, attempted copulation, and copulation), closely matching manually annotated ground-truth data. Furthermore, DANCE classifiers outperform existing rule-based methods in accuracy. Finally, the study shows that DANCE classifiers perform as well when used with low-cost experimental hardware as with standard experimental setups across multiple paradigms, including RNAi knockdown of the neuropeptide Dsk and optogenetic silencing of dopaminergic neurons.

      The authors make creative use of existing resources and technology to develop an inexpensive, flexible, and robust experimental tool for the quantitative analysis of Drosophila behavior. A key strength of this work is the thorough benchmarking of both the behavioral classifiers and the experimental hardware against existing methods. In particular, the direct comparison of their low-cost experimental system with established systems across different experimental paradigms is compelling. While JAABA-based classifiers have been previously used to analyze aggression and courtship (Tao et al., J. Neurosci., 2024; Sten et al., Cell, 2023; Chiu et al., Cell, 2021; Isshi et al., eLife, 2020; Duistermars et al., Neuron, 2018), the demonstration that they work as well without expensive experimental hardware opens the door to more low-cost systems for quantitative behavior analysis.

      Although the study provides a detailed evaluation of DANCE classifier performance, its conclusions would be strengthened by a more comprehensive analysis. The authors assess classifier accuracy using a bout-level comparison rather than a frame-level analysis, as employed in previous studies (Kabra et al., Nat Methods, 2013). They define a true positive as any instance where a DANCE-detected bout overlaps with a manually annotated ground-truth bout by at least one frame. This criterion may inflate true positive rates and underestimate false positives, particularly for longer-duration courtship behaviors. For example, a 15-second DANCE-classified wing extension bout that overlaps with ground truth for only one frame would still be considered a true positive. A frame-level analysis performance would help address this possibility.

      In summary, this work provides a practical and accessible approach to quantifying Drosophila behavior, reducing the economic barriers to the study of the neural and molecular mechanisms underlying social behavior.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript addresses the development of a low-cost behavioural setup and standardised open-source high-performing classifiers for aggression and courtship behaviour. It does so by using readily available laboratory equipment and previously developed software packages. By comparing the performance of the setup and the classifiers to previously developed ones, this study shows the classifier's overperformance and the reliability of the low-cost setup in recapitulating previously described effects of different manipulations on aggression and courtship.

      Strengths:

      The newly developed classifiers for lunges, wing extension, attempted copulation, copulation, following, and circling, perform better than available previously developed ones. The behavioural setup developed is low cost and reliably allows analysis of both aggression and courtship behaviour, validated through social experience manipulation (social isolation), gene knock (Dsk in Dilp2 neurons) and neuronal inactivation (dopaminergic neurons) known to affect courtship and aggression.

      Weaknesses:

      Aggression encompasses multiple defined behaviours, yet only lunges were analysed. Moreover, the CADABRA software to which DANCE was compared analyses further aggression behaviours, making their comparisons incomplete. In addition, though DANCE performs better than CADABRA and Divider in classifying lunges in the behavioural setup tested, it did not yield very high recall and F1 scores.

      DANCE is of limited use for neuronal circuit-level enquiries, since mechanisms for intensity and temporally controlled optogenetic manipulations, which are nowadays possible with open-source software and low-cost hardware, were not embedded in its development.

    4. Reviewer #3 (Public review):

      The preprint by Yadav et al. describes a new setup to quantify a number of aggression and mating behaviors in Drosophila melanogaster. The investigation of these behaviors requires the analysis of a large number of videos to identify each kind of behavior displayed by a fly. Several approaches to automatize this process have been published before, but each of them has its limitations. The authors set out to develop a new setup that includes very low-cost, easy-to-acquire hardware and open-source machine-learning classifiers to identify and quantify the behavior.

      Strengths:

      (1) The study demonstrates that their cheap, simple, and easy-to-obtain hardware works just as well as custom-made, specialized hardware for analyzing aggression and mating behavior. This enables the setup to be used in a wide range of settings, from research with limited resources to classroom teaching.

      (2) The authors used previously published software to train new classifiers for detecting a range of behaviors related to aggression and mating and to make them freely available. The classifiers are very positively benchmarked against a manually acquired ground truth as well as existing algorithms.

      (3) The study demonstrates the applicability of the setup (hardware and classifiers) to common methods in the field by confirming a number of expected phenotypes with their setup.

      Weaknesses:

      (1) When measuring the performance of the duration-based classifiers, the authors count any bout of behavior as true positive if it overlaps with a ground-truth positive for only 1 frame - despite the minimal duration of a bout is 10 frames, and most bouts are much longer. That way, true positives could contain cases that are almost totally wrong as long there was an overlap of a single frame. For the mating behaviors that are classified in ongoing bouts, I think performance should be evaluated based on the % of correctly classified frames, not bouts.

      (2) In the methods part, only one of the pre-existing algorithms (MateBook), is described. Given that the comparison with those algorithms is a so central part of the manuscript, each of them should be briefly explained and the settings used in this study should be described.

      Taken together, this work can greatly facilitate research on aggression and mating in Drosophila. The combination of low-cost, off-the-shelf hardware and open-source, robust software enables researchers with very little funding or technical expertise to contribute to the scientific process and also allows large-scale experiments, for example in classroom teaching with many students, or for systematic screenings.

    1. Reviewer #2 (Public Review):

      In this study, the authors characterize the defensive responses of C. elegans to the predatory Pristionchus species. Drawing parallels to ecological models of predatory imminence and prey refuge theory, they outline various behaviors exhibited by C. elegans when faced with predator threats. They also find that these behaviors can be modulated by the peptide NLP-49 and its receptor SEB-3 in various degrees.

      The conclusions of this paper are mostly well-supported, the writing and the figures are clear and easy to interpret. However, some of the claims need to be better supported and the unique findings of this work should be clarified better in text.

      (1) Previous work by the group (Quach, 2022) showed that Pristionchus adopt a "patrolling strategy" on a lawn with adult C. elegans and this depends on bacterial lawn thickness. Consequently, it may be hypothesized that C. elegans themselves will adopt different predator avoidance strategies depending on predator tactics differing due to lawn variations. The authors have not shown why they selected a particular size and density of bacterial lawn for the experiments in this paper, and should run control experiments with thinner and denser lawns with differing edge densities to make broad arguments about predator avoidance strategies for C. elegans. In addition, C. elegans leaving behavior from bacterial lawns (without predators) are also heavily dependent on density of bacteria, especially at the edges where it affects oxygen gradients (Bendesky, 2011), and might alter the baseline leaving rates irrespective of predation threats. The authors also do not mention if all strains or conditions in each figure panel were run as day-matched controls. Given that bacterial densities and ambient conditions can affect C. elegans behavior, especially that of lawn-leaving, it is important to run day-matched controls.

      (2) Both the patch-leaving and feeding in outstretched posture behaviors described here in this study were reported in an earlier paper by the same group (Quach, 2022) as mentioned by the authors in the first section of the results. While they do characterize these further in this study, these are not novel findings of this work.

      (3) For Figures 1F-H, given that animals can reside on the lawn edges as well as the center, bins explored are not a definitive metric of exploration since the animals can decide to patrol the lawn boundary (especially since the lawns have thick edges). The authors should also quantify tracks along the edge from videographic evidence as they have done previously in Figure 5 of Quach, 2022 to get a total measure of distance explored.

      (4) Where were the animals placed in the wide-arena predator-free patch post encounter? It is mentioned that the animal was placed at the center of the arena in lines 220-221. While this makes sense for the narrow-arena, it is unclear how far from the patch animals were positioned for the wide exit arena. Is it the same distance away as the distance of the patch from the center of the narrow exit arena? Please make this clear in the text or in the methods.

      (5) Do exit decisions from the bacterial patch scale with number of bites or is one bite sufficient? Do all bites lead to bite-induced aversive response? This would be important to quantify especially if contextualizing to predatory imminence.

      (6) Why are the threats posed by aversive but non-lethal JU1051 and lethal PS312 evaluated similarly? Did the authors characterize if the number of bites are different for these strains? Can the authors speculate on why this would happen in the discussion?

      (7) The authors indicate that bites from the non-aversive TU445 led to a low number of exits and thus it was consequently excluded from further analysis. If anything, this strain would have provided a good negative control and baseline metrics for other circa-strike and post-encounter behaviors.

      8) For Figures 3 G and H, the reduction in bins explored (bins_none - bins_RS1594) due to the presence of predators should be compared between wildtype and mutants, instead of the difference between none and RS5194 for each strain.

      (9) While the authors argue that baseline speeds of seb-3 are similar to wild type (Figure S3), previous work (Jee, 2012) has shown that seb-3 not only affects speed but also roaming/dwelling states which will significantly affect the exploration metric (bins explored) which the authors use in Figs 3G-H and 4E-F. Control experiments are necessary to avoid this conundrum. Authors should either visualize and quantify tracks (as suggested in 3) or quantify roaming-dwelling in the seb-3 animals in the absence of predator threat.

      (10) While it might be beyond the scope of the study, it would be nice if the authors could speculate on potential sites of actions of NLP-49 in the discussion, especially since it is expressed in a distinct group of neurons.

    2. Reviewer #1 (Public Review):

      Summary:

      In this manuscript, Quach et al. report a detailed investigation into the defense mechanisms of Caenorhabditis elegans in response to predatory threats from Pristionchus pacificus. Based on principles from predatory imminence and prey refuge theories, the authors delineate three defense modes (pre-encounter, post-encounter, and circa-strike) corresponding to increasing levels of threat proximity. These modes are observed in a controlled but naturalistic setup and are quantified by multiple behavioral outputs defined in time and/or space domains allowing nuanced phenotypic assays. The authors demonstrate that C. elegans displays graded defense behavioral responses toward varied lethality of threats and that only life-threatening predators trigger all three defense modes. The study also offers a narrative on the behavioral strategies and underlying molecular regulation, focusing on the roles of SEB-3 receptors and NLP-49 peptides in mediating responses in these defense modes. They found that the interplay between SEB-3 and NLP-49 peptides appears complex, as evidenced by the diverse outcomes when either or both genes are manipulated in various behavioral modes.

      Strengths:

      The paper presents an interesting story, with carefully designed experiments and necessary controls, and novel findings and implications about predator-induced defensive behaviors and underlying molecular regulation in this important model organism. The design of experiments and description of findings are easy to follow and well-motivated. The findings contribute to our understanding of stress response systems and offer broader implications for neuroethological studies across species.

      Weaknesses:

      Although overall the study is well designed and movitated, the paper could benefit from further improvements on some of the methods descriptions and experiment interpretations.

    3. eLife Assessment

      This study presents a valuable finding on predator threat detection in C. elegans and the role of neuropeptide systems in defensive behavioral strategies. The evidence supporting the conclusions is solid, although additional analyses and control experiments would strengthen the claims of the study. Overall, the work is of interest to the C. elegans community as well as neuroethologists and ecologists studying predator-prey interactions.

    1. eLife Assessment

      This useful study reports detailed molecular dynamics (MD) simulations of T-cell receptors (TCRs) in complex with a peptide/MHC complex, for a better understanding of the mechanism of T-cell activation. The MD simulations provide solid evidence supporting that different TCRs can respond mechanically in different ways upon binding to the same pMHC complex. The analyses are systematic and provide testable predictions that can be evaluated by future mutagenesis and force microscopy studies.

    2. Reviewer #1 (Public review):

      Summary:

      This paper describes molecular dynamics simulations (MDS) of the dynamics of two T-cell receptors (TCRs) bound to the same major histocompatibility complex molecule loaded with the same peptide (pMHC). The two TCRs (A6 and B7) bind to the pMHC with similar affinity and kinetics, but employ different residue contacts. The main purpose of the study is to quantify via MDS the differences in the inter- and intra-molecular motions of these complexes, with a specific focus on what the authors describe as catch-bond behavior between the TCRs and pMHC, which could explain how T-cells can discriminate between different peptides in the presence of weak separating force.

      Strengths:

      The authors present extensive simulation data that indicates that, in both complexes, the number of high-occupancy inter-domain contacts initially increases with applied load, which is generally consistent with the authors' conclusion that both complexes exhibit catch-bond behavior, although to different extents. In this way, the paper expands our understanding of peptide discrimination by T-cells. The conclusions of the study are generally well supported by data. Further, the paper makes predictions about the relative strength of the catch-bond response of the two TCRs, which could be tested experimentally through protein mutagenesis and force application in Atomic Force Microscopy.

    3. Reviewer #2 (Public review):

      In this work, Chang-Gonzalez and coworkers follow up on an earlier study on the force-dependence of peptide recognition by a T-cell receptor using all-atom molecular dynamics simulations. In this study, they compare the results of pulling on a TCR-pMHC complex between two different TCRs with the same peptide. A goal of the paper is to determine whether the newly studied B7 TCR has the same load-dependent behavior mechanism shown in the earlier study for A6 TCR. The primary result is that while the unloaded interaction strength is similar, A6 exhibits more force-stabilization.

      This is a detailed study, and establishing the difference between these two systems with and without applied force may establish them as a good reference setup for others who want to study mechanobiological processes if the data were made available, and could give additional molecular details for T-Cell-specialists.

    4. Reviewer #3 (Public review):

      Summary:

      The paper by Chang-Gonzalez et al. is a molecular dynamics (MD) simulation study of the dynamic recognition (load-induced catch bond) by the T cell receptor (TCR) of the complex of peptide antigen (p) and the major histocompatibility complex (pMHC) protein. The methods and simulation protocols are essentially identical as those employed in a previous study by the same group (Chang-Gonzalez et al., eLife 2024). In the current manuscript the authors compare the binding of the same pMHC complex to two different TCRs, B7 and A6 which was investigated in the previous paper. While the binding is more stable for both TCRs under load (of about 10-15 pN) than in the absence of load, the main difference is that B7 shows a smaller amount of stable contacts with the pMHC than A6.

      Strengths:

      The topic is interesting because of the relevance of mechanosensing in biological processes including cellular immunology. The MD simulations provide strong evidence that different TCRs can respond mechanically in a different way upon binding the same pMHC complex. These findings are useful for interpreting how mechanical force is employed for modulating different function of T cells.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      Summary:

      This paper describes molecular dynamics simulations (MDS) of the dynamics of two T-cell receptors (TCRs) bound to the same major histocompatibility complex molecule loaded with the same peptide (pMHC). The two TCRs (A6 and B7) bind to the pMHC with similar affinity and kinetics, but employ different residue contacts. The main purpose of the study is to quantify via MDS the differences in the inter- and intra-molecular motions of these complexes, with a specific focus on what the authors describe as catch-bond behavior between the TCRs and pMHC, which could explain how T-cells can discriminate between different peptides in the presence of weak separating force.

      Strengths:

      The authors present extensive simulation data that indicates that, in both complexes, the number of high-occupancy interdomain contacts initially increases with applied load, which is generally consistent with the authors’ conclusion that both complexes exhibit catch-bond behavior, although to different extents. In this way, the paper somewhat expands our understanding of peptide discrimination by T-cells.

      a. The reviewer makes thoughtful assessment of our manuscript. While our manuscript is meant to be a “short” contribution, our significant new finding is that even for TCRs targeting the same pMHC, having similar structures, and leading to similar functional outcomes in conventional assays, their response to applied load can be different. This supports out recent experimental work where TCRs targeting the same pMHC differed in their catch bond characteristics, and importantly, in their response to limiting copy numbers of pMHCs on the antigen-presenting cell (Akitsu et al., Sci. Adv., 2024).

      Weaknesses:

      While generally well supported by data, the conclusions would nevertheless benefit from a more concise presentation of information in the figures, as well as from suggesting experimentally testable predictions.

      b. We have updated all figures for clear and streamlined presentation. We have also created four figure supplements to cover more details.

      Regarding testable predictions, an important prediction is that B7 TCR would exhibit a weaker catch bond behavior than A6 (line 297–298). This is a nontrivial prediction because the two TCRs targeting the same pMHC have similar structures and are functionally similar in conventional assays. This prediction can be tested by singlemolecule optical tweezers experiments. Based on our recent experiments Akitsu et al., Sci. Adv. (2024), we also predict that A6 and B7 TCRs will differ in their ability to respond to cases when the number of pMHC molecules presented are limited. Details of how they would differ require further investigation, which is beyond the scope of the present work (line 314-319).

      Another testable prediction for the conservation of the basic allostery mechanism is to test the Cβ FG-loop deletion mutant located at the hinge region of the β chain, where the deletion severely impairs the catch bond formation (line 261–264).

      Reviewer 2:

      In this work, Chang-Gonzalez and coworkers follow up on an earlier study on the force-dependence of peptide recognition by a T-cell receptor using all-atom molecular dynamics simulations. In this study, they compare the results of pulling on a TCR-pMHC complex between two different TCRs with the same peptide. A goal of the paper is to determine whether the newly studied B7 TCR has the same load-dependent behavior mechanism shown in the earlier study for A6 TCR. The primary result is that while the unloaded interaction strength is similar, A6 exhibits more force stabilization.

      This is a detailed study, and establishing the difference between these two systems with and without applied force may establish them as a good reference setup for others who want to study mechanobiological processes if the data were made available, and could give additional molecular details for T-Cell-specialists. As written, the paper contains an overwhelming amount of details and it is difficult (for me) to ascertain which parts to focus on and which results point to the overall take-away messages they wish to convey.

      R2-a. As mentioned above and as the reviewer correctly pointed out, the condensed appearance of this manuscript arose largely because we intended it to be a Research Advances article as a short follow up study of our previous paper on A6 TCR published in eLife. Most of the analysis scripts for the A6 TCR study are already available on Github. For the present manuscript, we have created a separate Github repository containing sample simulation systems and scripts for the B7 TCR.

      Regarding the focus issue, it is in part due to the complex nature of the problem, which required simulations under different conditions and multi-faceted analyses. We believe the extensive updates to the figures and texts make clearer and improved presentation. But we note that even in the earlier version, the reviewer pointed out the main take-away message well: “The primary result is that while the unloaded interaction strength is similar, A6 exhibits more force stabilization.

      Detailed comments:

      (1) In Table 1 - are the values of the extension column the deviation from the average length at zero force (that is what I would term extension) or is it the distance between anchor points (which is what I would assume based on the large values. If the latter, I suggest changing the heading, and then also reporting the average extension with an asterisk indicating no extensional restraints were applied for B7-0, or just listing 0 load in the load column. Standard deviation in this value can also be reported. If it is an extension as I would define it, then I think B7-0 should indicate extension = 0+/- something. The distance between anchor points could also be labeled in Figure 1A.

      R2-b. “Extension” is the distance between anchor points that the reviewer is referring to (blue spheres at the ends of the added strands in Figure 1A). While its meaning should be clear in the section “Laddered extensions” in “MD simulation protocol” (line 357–390), in a strict sense, we agree that using it for the end-to-end distance can be confusing. However, since we have already used it in our previous two papers (Hwang et al., PNAS 2020 and Chang-Gonzalez et al., eLife, 2024), we prefer to keep it for consistency. Instead, in the caption of Table 1, we explained its meaning, and also explicitly labeled it in Figure 1A, as the reviewer suggested.

      Please also note that the no-load case B7<sup>0</sup> was performed by separately building a TCR-pMHC complex without added linkers (line 352), and holding the distal part of pMHC (the α3 domain) with weak harmonic restraints (line 406–408). Thus, no extension can be assigned to B7<sup>0</sup>. We added a brief explanation about holding the MHC α3 domain for B7<sup>0</sup> in line 83–85.

      (2) As in the previous paper, the authors apply ”constant force” by scanning to find a particular bond distance at which a desired force is selected, rather than simply applying a constant force. I find this approach less desirable unless there is experimental evidence suggesting the pMHC and TCR were forced to be a particular distance apart when forces are applied. It is relatively trivial to apply constant forces, so in general, I would suggest this would have been a reasonable comparison. Line 243-245 speculates that there is a difference in catch bonding behavior that could be inferred because lower force occurs at larger extensions, but I do not believe this hypothesis can be fully justified and could be due to other differences in the complex.

      R2-c. There is indeed experimental evidence that the TCR-pMHC complex operates under constant separation. The spacing between a T-cell and an antigen-presenting cell is maintained by adhesion molecules such as the CD2CD58 pair, as explained in our paper on the A6 TCR Chang-Gonzalez et al., eLife, 2024 and also in our previous review paper Reinherz et al., PNAS, 2023. In in vitro single-molecule experiments, pulling to a fixed separation and holding is also commonly done. We added an explanation about this in line 79–83 of the manuscript. On the other hand, force between a T cell and and antigen-presenting cell is also controlled by the actin cytoskeleton, which make the applied load not a simple function of the separation between the two cells. An explanation about this was added in line 300–303. Detailed comparison between constant extension vs. constant force simulations is definitely a subject of our future study.

      Regarding line 243–245 of the original submission (line 297–298 of the revised manuscript), we agree with the reviewer that without further tests, lower forces at larger extensions per se cannot be an indicator that B7 forms a weaker catch bond. But with additional information, one can see it does have relevance to the catch bond strength. In addition to fewer TCR-pMHC contacts (Figure 1C of our manuscript), the intra-TCR contacts are also reduced compared to those of A6 (bottom panel of Figure 1D vs. Chang-Gonzalez et al., eLife, 2024, Figure 8A,B, first column). Based on these data, we calculated the average total intra-TCR contact occupancies in the 500–1000-ns interval, which was 30.4±0.49 (average±std) for B7 and 38.7±0.87 for A6. This result shows that the B7 TCR forms a looser complex with pMHC compared to A6. Also, B7<sup>low</sup> and B7<sup>high</sup> differ in extension by 16.3 ˚A while A6<sup>low</sup> and A6<sup>high</sup> differ by 5.1 ˚A, for similar ∼5-pN difference between low- and high-load cases. With the higher compliance of B7, it would be more difficult to achieve load-induced stabilization of the TCR-pMHC interface, hence a weaker catch bond. We explained this in line 129–132 and line 292–297.

      (3) On a related note, the authors do not refer to or consider other works using MD to study force-stabilized interactions (e.g. for catch bonding systems), e.g. these cases where constant force is applied and enhanced sampling techniques are used to assess the impact of that applied force: https://www.cell.com/biophysj/fulltext/S0006-3495(23)00341-7, https://www.biorxiv.org/content/10.1101/2024.10.10.617580v1. I was also surprised not to see this paper on catch bonding in pMHC-TCR referred to, which also includes some MD simulations: https://www.nature.com/articles/s41467-023-38267-1

      R2-d. We thank the reviewer for bringing the three papers to our attention, which are:

      (1) Languin-Catto¨en, Sterpone, and Stirnemann, Biophys. J. 122:2744 (2023): About bacterial adhesion protein FimH.

      (2) Pen˜a Ccoa, et al., bioRxiv (2024): About actin binding protein vinculin.

      (3) Choi et al., Nat. Comm. 14:2616 (2023): About a mathematical model of the TCR catch bond.

      Catch bond mechanisms of FimH and vinculin are different from that of TCR in that FimH and vinculin have relatively well-defined weak- and strong-binding states where there are corresponding crystal structures. Availability of the end-state structures permits simulation approaches such as enhanced sampling of individual states and studying the transition between the two states. In contrast, TCR does not have any structurally well-defined weak- or strong-binding states, which requires a different approach. As demonstrated in our current manuscript as well as in our previous two papers (Hwang et al., PNAS 2020 and Chang-Gonzalez et al., eLife, 2024), our microsecond-long simulations of the complex under realistic pN-level loads and a combination of analysis methods are effective for elucidating the catch bond mechanism of TCR. These are explained in line 227–238 of the manuscript.

      The third paper (Choi, et al., 2023) proposes a mathematical model to analyze extensive sets of data, and also perform new experiments and additional simulations. Of note, their model assumptions are based mainly on the steered MD (SMD) simulation in their previous paper (Wu, et al., Mol. Cell. 73:1015, 2019). In their model, formation of a catch bond (called catch-slip bond in Choi’s paper) requires partial unfolding of MHC and tilting of the TCR-pMHC interface. Our mechanism does not conflict with their assumptions since the complex in the fully folded state should first bear load in a ligand-dependent manner in order to allow any larger-scale changes. This is explained in line 239–243.

      For the revised text mentioned above (line 227–243), in addition to the 3 papers that the reviewer pointed out, we cited the following papers:

      • Thomas, et al., Annu. Rev. Biophys. 2008: Catch bond mechanisms in general.

      • Bakolitsa et al., Cell 1999, Le Trong et al., Cell 2010, Sauer et al., Nat. Comm. 2016, Mei et al., eLife 2020:

      Crystal structures of FimH and vinculin in different states.

      • Wu, et al., Mol. Cell. 73:1015, 2019: The SMD simulation paper mentioned above.

      (4) The authors should make at least the input files for their system available in a public place (github, zenodo) so that the systems are a more useful reference system as mentioned above. The authors do not have a data availability statement, which I believe is required.

      R2-d. As mentioned in R2-a above, we have added a Github repository containing sample simulation systems and scripts for the B7 TCR.

      Reviewer 3:

      Summary:

      The paper by Chang-Gonzalez et al. is a molecular dynamics (MD) simulation study of the dynamic recognition (load-induced catch bond) by the T cell receptor (TCR) of the complex of peptide antigen (p) and the major histocompatibility complex (pMHC) protein. The methods and simulation protocols are essentially identical to those employed in a previous study by the same group (Chang-Gonzalez et al., eLife 2024). In the current manuscript, the authors compare the binding of the same pMHC to two different TCRs, B7 and A6 which was investigated in the previous paper. While the binding is more stable for both TCRs under load (of about 10-15 pN) than in the absence of load, the main difference is that, with the current MD sampling, B7 shows a smaller amount of stable contacts with the pMHC than A6.

      Strengths:

      The topic is interesting because of the (potential) relevance of mechanosensing in biological processes including cellular immunology.

      Weaknesses:

      The study is incomplete because the claims are based on a single 1000-ns simulation at each value of the load and thus some of the results might be marred by insufficient sampling, i.e., statistical error. After the first 600 ns, the higher load of B7<sup>high</sup> than B7<sup>low</sup> is due mainly to the simulation segment from about 900 ns to 1000 ns (Figure 1D). Thus, the difference in the average value of the load is within their standard deviation (9 +/- 4 pN for B7<sup>low</sup> and 14.5 +/- 7.2 for B7<sup>high</sup>, Table 1). Even more strikingly, Figure 3E shows a lack of convergence in the time series of the distance between the V-module and pMHC, particularly for B7<sup>0</sup> (left panel, yellow) and B7<sup>low</sup> (right panel, orange). More and longer simulations are required to obtain a statistically relevant sampling of the relative position and orientation of the V-module and pMHC.

      R3-a. The reviewer uses data points during the last 100 ns to raise an issue with sampling. But since we are using realistic pN range forces, force fluctuates more slowly. In fact, in our simulation of B7<sup>high</sup>, while the force peaks near 35 pN at 500 ns (Figure 1D of our manuscript), the interfacial contacts show no noticeable changes around 500 ns (Figure 2B and Figure 2–figure supplement 1C of our manuscript). Similarly slow fluctuation of force was also observed for A6 TCR (Figure 8 of Chang-Gonzalez et al., eLife (2024)). Thus, a wider time window must be considered rather than focusing on forces in the last 100-ns interval.

      To compare fluctuation in forces, we added Figure 1–figure supplement 2, which is based on Appendix 3–Figure 1 of our A6 paper. It shows the standard deviation in force versus the average force during 500–1000 ns interval for various simulations in both A6 (open black circles) and B7 (red squares) systems. Except for Y8A<sup>low</sup> and dFG<sup>low</sup> of A6 (explained below), the data points lie on nearly a straight line.

      Thermodynamically, the force and position of the restraint (blue spheres in Figure 1A of our manuscript) form a pair of generalized force and the corresponding spatial variable in equilibrium at temperature 300 K, which is akin to the pressure P and volume V of an ideal gas. If V is fixed, P fluctuates. Denoting the average and std of pressure as ⟨P⟩ and ∆P, respectively, Burgess showed that ∆P/P⟩ is a constant (Eq. 5 of Burgess, Phys. Lett. A, 44:37; 1973). In the case of the TCRαβ-pMHC system, although individual atoms are not ideal gases, since their motion leads to the fluctuation in force on the restraints, the situation is analogous to the case where pressure arises from individual ideal gas molecules hitting the confining wall as the restraint. Thus, the near-linear behavior in the figure above is a consequence of the system being many-bodied and at constant temperature. The linearity is also an indicator that sampling of force was reasonable in the 500–1000-ns interval. The fact that A6 and B7 data show a common linear profile further demonstrates the consistency in our force measurement. About the two outliers of A6, Y8A<sup>low</sup> is for an antagonist peptide and dFG<sup>low</sup> is the Cβ FG-loop deletion mutant. Both cases had reduced numbers of contacts with pMHC, which likely caused a wider conformational motion, hence greater fluctuation in force.

      Upon suggestion by the reviewer, we extended the simulations of B7<sup>0</sup>, B7<sup>low</sup> and B7<sup>high</sup> to about 1500 ns (Table 1). While B7<sup>0</sup> and B7<sup>low</sup> behaved similarly, B7<sup>high</sup> started to lose contacts at around 1300 ns (top panel of Figure 1D and Figure 2B). A closer inspection revealed that destabilization occurred when the complex reached low-force states. Even before 1300 ns, at about 750 ns, the force on B7<sup>high</sup> drops below 5 pN, and another drop in force occurred at around 1250 ns, though to a lesser extent (Figure 1D). These changes are followed by increase in the Hamming distance (Figure 2B). Thus, in B7<sup>high</sup>, destabilization is caused not by a high force, but by a lack of force, which is consistent with the overarching theme of our work, the load-induced stabilization of the TCRαβ-pMHC complex.

      The destabilization of B7<sup>high</sup> during our simulation is a combined effect of its overall weaker interface compared to A6 (despite having comparable number of contacts in crystal structures; line 265–269), and its high compliance (explained in the second paragraph of our response R2-c above). Under a fixed extension, the higher compliance of the complex can reach a low-force state where breakage of contacts can happen. In reality, with an approximately constant spacing between a T cell and an antigen-presenting cell, force is also regulated by the actin cytoskeleton (explained in the first paragraph of R2-c above). While detailed comparison between constant-extension and constant-force simulation is the subject of a future study, for this manuscript, we used the 500–1000-ns interval for calculating time-averaged quantities, for consistency across different simulations. For time-dependent behaviors, we showed the full simulation trajectories, which are Figure 1D, Figure 2B, Figure 2–figure supplement 1 (except for panel E), and Figure 4–figure supplement 1B.

      Thus, rather than performing replicate simulations, we perform multiple simulations under different conditions and analyze them from different angles to obtain a consistent picture. If one were interested in quantitative details under a given condition, e.g., dynamics of contacts for a given extension or the time when destabilization occurs at a given force, replicate simulations would be necessary. However, our main conclusions such as load-induced stabilization of the interface through the asymmetric motion, and B7 forming a weaker complex compared to A6, can be drawn from our extensive analysis across multiple simulations. Please also note that reviewer 1 mentioned that our conclusions are “generally well supported by data.”

      A similar argument applies to Figure 2–figure supplement 1F (old Figure 3B that the reviewer pointed out). If precise values of the V-module to pMHC distance were needed, replicate simulations would be necessary, however, the figure demonstrates that B7<sup>high</sup> maintains more stable interface before the disruption at 1300 ns compared to B7<sup>low</sup>, which is consistent with all other measures of interfacial stability we used. The above points are explained throughout our updated manuscript, including

      • Line 106–110, 125–132, 156–158, 298–303.

      • Figures showing time-dependent behaviors have been updated and Figure 1–figure supplement 2 has been added, as explained above.

      It is not clear why ”a 10 A distance restraint between alphaT218 and betaA259 was applied” (section MD simulation protocol, page 9).

      R3-b. αT218 and β_A259 are the residues attached to a leucine-zipper handle in _in vitro optical trap experiments (Das, et al., PNAS 2015). In T cells, those residues also connect to transmembrane helices. Our newly added Figure 1–figure supplement 1 shows a model of N15 TCR used in experiments in Das’ paper, constructed based on PDB 1NFD. Blue spheres represent C<sub>α</sub> atoms corresponding to αT218 and βA259 of B7 TCR. Their distance is 6.7 ˚A. The 10-˚A distance restraint in simulation was applied to mimic the presence of the leucine zipper that prevents excessive separation of the added strands. The distance restraint is a flatbottom harmonic potential which is activated only when the distance between the two atoms exceeds 10 ˚A, which we did not clarify in our original manuscript. It is now explained in line 371–373. The same restraint was used in our previous studies on JM22 and A6 TCRs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Clarify the reason for including arguably non-physiological simulations, in which the C domain is missing. Is the overall point that it is essential for proper peptide discrimination?

      R1-c. This is somewhat a philosophical question. Rather than recapitulating experiment, we believe the goal of simulation is to gain insight. Hence, a model should be justified by its utility rather than its direct physiological relevance. The system lacking the C-module is useful since it informs about the allosteric role of the C-module by comparing its behavior with that of the full TCRαβ-pMHC complex. The increased interfacial stability of Vαβ-pMHC is also consistent with our discovery that the C-module likely undergoes a partial unfolding to an extended state, where the bond lifetime increases (Das, et al., PNAS 2015; Akitsu et al., Sci. Adv., 2024). In this sense, Vαβ-pMHC has a more direct physiological relevance. Furthermore, considering single-chain versions of an antibody lacking the C-module (scFv) are in widespread use (Ahmad et al., J. Immunol. Res., 2012) including CAR T cells, a better understanding of a TCR lacking the C-module may help with developing a novel TCR-based immunotherapy. These explanations have been added in line 253–261.

      (2) Suggest changing Vαβ-pMHC to B7<sup>0</sup>∆C to emphasize that the constant domain is deleted.

      R1-d. While we appreciate the reviewer’s suggestion, the notation Vαβ-pMHC was used in our previous two papers (Hwang, PNAS 2020, Chang-Gonzalez, eLife 2024). We thus prefer to keep the existing notation.

      (3) Suggest adding A6 data to table 1 for comparison, making it clear if it is from a previous paper.

      R1-e. Table 1 of the present manuscript and Table 1 of the A6 paper differ in items displayed. Instead of merging, we added the extension and force for A6 corresponding to B7<sup>low</sup> and B7<sup>high</sup> in the caption of Table 1.

      (4) Suggest discussing the catch-bond behavior in terms of departure from equilibrium, e.g. is it possible to distinguish between different (catch vs slip) bond behaviors on the basis of work of separation histograms? If the difference does not show up in equilibrium work, the exponential work averages would be similar, but work histograms could be very different.

      R1-f. Although energetics of the catch versus slip bond will provide additional insight, it is beyond the scope of the present simulations that do not involve dissociation events nor simulations of slip-bond receptors. We instead briefly mention the energetic aspect in terms of T-cell activation in line 316–319.

      (5) Have the simulations in Figure 1 reached steady state? The force and occupancy increase almost linearly up until 500ns, then seem to decrease rather dramatically by 750ns. It might be worthwhile to extend one simulation to check.

      R1-g. We did extend the simulation to about 1500 ns. The large and slow fluctuation in force is an inherent property of the system, as explained in R3-a above.

      (6) Is the loss of contacts for B7<sup>0</sup> due to thermalization and relaxation away from the X-ray structure?

      R1-h. The initial thermalization at 300 K is not responsible for the loss of contacts for B7<sup>0</sup> since we applied distance restraints to the initial contacts to keep them from breaking during the preparatory runs (line 358–370). While ‘relaxation away from the X-ray structure’ gives an impression that the complex approaches an equilibrium conformation in the absence of the crystallographic confinement, our simulation indicates that the stability of the complex depends on the applied load. We made the distinction between relaxation and the load-dependent stability clearer in line 233–238.

      (7) Figure 4 contains a very large amount of data. Could it be simplified and partly moved to SI? For example, panel G is somewhat hard to read at this scale, and seems non-essential to the general reader.

      R1-i. Upon the reviewer’s suggestion, we simplified Figure 4 by moving some of the panels to Figure 4–figure supplement 1. Panels have also been made larger for better readability.

      (8) If the coupling between C and V domains is necessary for catch-bond behavior, can one propose mutations that would disrupt the interface to test by experiment? This would be interesting in light of the authors’ own comment on p. 8 that ’a logical evolutionary pressure would be for the C domains to maximize discriminatory power by adding instability to the TCR chassis,’ which might lead to a verifiable hypothesis.

      R1-j. This has already been computationally and experimentally tested for other TCRs by the Cβ FG-loop deletion mutants that diminish the catch bond (Das, et al., PNAS 2015; Hwang et al., PNAS 2020; ChangGonzalez et al., eLife, 2024). Furthermore, the Vγδ-Cαβ chimera where the C-module of TCRγδ is replaced by that of TCR_αβ_ that strengthens the V-C coupling achieved a gain-of-function catch bond character while the wild-type TCRγδ is a slip-bond receptor (Mallis, et al., PNAS 2021; Bettencourt et al., Biophys. J. 2024). We added our prediction that the FG-loop deletion mutants of B7 TCR will behave similarly in line 261–264.

      (9) Regarding extending TCR and MHC termini using native sequences, as described in the methods, what would be the disadvantage of using the same sequence, which could be made much more rigid, e.g. a poly-Pro sequence? After all, the point seems to be applying a roughly constant force, but flexible/disordered linkers seem likely to increase force fluctuation.

      R1-k. The purpose of adding linkers was to allow a certain degree of longitudinal and transverse motion as would occur in vivo. While it will be worthwhile to explore the effects of linker flexibility on the conformational dynamics of the complex, for the present study, we used the actual sequence for the linkers for those proteins (line 341–344).

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 2 is almost illegible, especially Figure 2A-D. I do not think that these contacts vs time would be useful to anyone except for someone interested in this particular pMHC interaction, so I would suggest moving it to a supporting figure and making it much larger.

      R2-e. Thanks for the suggestion. We created Figure 2–figure supplement 1 and made panels larger for clearer presentation.

      (2) Figure 4 is overwhelming, and does not convey any particular message.

      R2-f. This is the same comment as reviewer 1’s comment (7) above. Please see our response R1-i.

      Reviewer #3 (Recommendations for the authors):

      (1) The label ”beta2m” in Figure 1A should be moved closer to the beta2 microglobulin domain. A label TCR should be added to Figure 1A.

      R3-c. Thanks for pointing out about β2m. We have corrected it. About putting the label ‘TCR,’ to avoid cluttering, we explained that Vα, Vβ, Cα, and Cβ are the 4 subdomains of TCR in the caption of Figure 1A.

      (2) Hydrogen atoms should be removed from the peptide in Figure 1B.

      R3-d. We have removed the hydrogen atoms.

      (3) The authors should consider moving Figures 1 A-D to the SI and show a simpler description of the contact occupancy than the heat maps. The legend of Figure 2A-D is too small.

      R3-e. By ‘Figures 1 A-D’ we believe the reviewer meant Figure 2A–D. This is the same comment as reviewer 2’s comment (1). Please see our response R2-e above.

      (4) Vertical (dashed) lines should be added to Figure 3E at 500 ns to emphasize the segment of the time series used for the histograms.

      R3-f. We added vertical lines in figures showing time-dependent behaviors, which are Figure 1D, Figure 2B, Figure 2–figure supplement 1F, and Figure 4–figure supplement 1B.

    1. eLife Assessment

      This important study shows a surprising scale-invariance of the covariance spectrum of large-scale recordings in the zebrafish brain in vivo. A convincing analysis demonstrates that a Euclidean random matrix model of the covariance matrix recapitulates these properties. The results provide several new and insightful approaches for probing large-scale neural recordings.

    2. Joint public review

      Summary:

      The authors examine the eigenvalue spectrum of the covariance matrix of neural recordings in the whole-brain larval zebrafish during hunting and spontaneous behavior. They find that the spectrum is approximately power law, and, more importantly, exhibits scale-invariance under random subsampling of neurons. This property is not exhibited by conventional models of covariance spectra, motivating the introduction of the Euclidean random matrix model. The authors show that this tractable model captures the scale invariance they observe. They also examine the effects of subsampling based on anatomical location or functional relationships. Finally, they discuss the benefit of neural codes which can be subsampled without significant loss of information.

      Strengths:

      With large-scale neural recordings becoming increasingly common, neuroscientists are faced with the question: how should we analyze them? To address that question, this paper proposes the Euclidean random matrix model, which embeds neurons randomly in an abstract feature space. This model is analytically tractable and matches two nontrivial features of the covariance matrix: approximate power law scaling, and invariance under subsampling. It thus introduces an important conceptual and technical advance for understanding large-scale simultaneously recorded neural activity.

      Comment:

      Are there quantitative comparisons of the collapse indices for the null models in Figure 2 and the data covariance in 2F? If so, this could be potentially useful to report.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Summary:

      The authors examine the eigenvalue spectrum of the covariance matrix of neural recordings in the whole-brain larval zebrafish during hunting and spontaneous behavior. They find that the spectrum is approximately power law, and, more importantly, exhibits scale-invariance under random subsampling of neurons. This property is not exhibited by conventional models of covariance spectra, motivating the introduction of the Euclidean random matrix model. The authors show that this tractable model captures the scale invariance they observe. They also examine the effects of subsampling based on anatomical location or functional relationships. Finally, they briefly discuss the benefit of neural codes which can be subsampled without significant loss of information.

      Strengths:

      With large-scale neural recordings becoming increasingly common, neuroscientists are faced with the question: how should we analyze them? To address that question, this paper proposes the Euclidean random matrix model, which embeds neurons randomly in an abstract feature space. This model is analytically tractable and matches two nontrivial features of the covariance matrix: approximate power law scaling, and invariance under subsampling. It thus introduces an important conceptual and technical advance for understanding large-scale simultaneously recorded neural activity.

      Weaknesses:

      The downside of using summary statistics is that they can be hard to interpret. Often the finding of scale invariance, and approximate power law behavior, points to something interesting. But here caution is in order: for instance, most critical phenomena in neural activity have been explained by relatively simple models that have very little to do with computation (Aitchison et al., PLoS CB 12:e1005110, 2016; Morrell et al., eLife 12, RP89337, 2024). Whether the same holds for the properties found here remains an open question.

      We are grateful for the thorough and constructive feedback provided on our manuscript. We have addressed each point raised by you.

      Regarding the main concern about power law behavior and scale invariance, we would like to clarify that our study does not aim to establish criticality. Instead, we focus on describing and understanding a specific scale-invariant property in terms of collapsed eigenspectra in neural activity. We tested Morrell et al.’s latent-variable model (eLife 12, RP89337, 2024, [1]), where a slowly varying latent factor drives population activity. Although it produces a seemingly power-law-like spectrum, random sampling does not replicate the strict spectral collapse observed in our data (second row in Fig. S23). This highlights that simply adding latent factors does not fully recapitulate the scale invariance we measure, suggesting richer or more intricate processes may be involved in real neural recordings.

      Specifically, we have incorporated five key revisions.

      • As mentioned, we evaluated the latent variable model proposed by Morrell et al., and found that they fail to reproduce the scale-invariant eigenspectra observed in our data; these results are now presented in the Discussion section and supported by a new Supplementary Figure (Fig. S23).

      • We included a comparison with the findings of Manley et al. (2024 [2]) regarding the issue of saturating dimension in the Discussion section, highlighting the methodological differences and their implications.

      • We added a new mathematical derivation in the Methods section, elucidating the bounded dimensionality using the spectral properties of our model. • We have added a sentence in the Discussion section to further emphasize the robustness of our findings by demonstrating their consistency across diverse datasets and experimental techniques.

      • We have incorporated a brief discussion on the implications for neural coding (lines 330-332). In particular, Fisher information can become unbounded when the slope of the power-law rank plot is less than one, as highlighted in the recent work by Moosavi et al. (bioRxiv 2024.08.23.608710, Aug, 2024 [3]).

      We believe these revisions address the concerns raised during the review process and collectively strengthen our manuscript to provides a more comprehensive and robust understanding of the geometry and dimensionality of brain-wide activity. We appreciate your consideration of our revised manuscript and look forward to your feedback.

      Recommendations for the authors:

      In particular, in our experience replies to the reviewers are getting longer than the paper, and we (and I’m sure you!) want to avoid that. Maybe just reply explicitly to the ones you disagree with? We’re pretty flexible on our end.

      (1) The main weakness, from our point of view, is whether the finding of scale invariance means something interesting, or should be expected from a null model. We can suggest such model; if it is inconsistent with the data, that would make the results far more interesting.

      Morrell et al. (eLife 12, RP89337,2024 [1]) suggest a very simple model in which the whole population is driven by a slowly time-varying quantity. It would be nice to determine whether it matched this data. If it couldn’t, that would add some evidence that there is something interesting going on.

      We appreciate your insightful suggestion to consider the model proposed by Morrell et al. (eLife 12, RP89337, 2024 [1]), where a slowly time-varying quantity drives the entire neural population. We conducted simulations using parameters from Morrell et al. [4, 1], as detailed below.

      Our simulations show that Morrell’s model can replicate a degree of scaleinvariance when using functional sampling or RG as referred to in Morrell et al, 2021, PRL [4] (FSap, Fig.S23A-D, Author response image 1). However, it fails to fully capture the scale-invariance of collapsing spectra we observed in data under random sampling (RSap, Fig.S23E-H). This discrepancy suggests that additional dynamics or structures in the neural activity are not captured by this simple model, indicating the presence of potentially novel and interesting features in the data that merit further investigation.

      Unlike random sampling, the collapse of eigenspectra under functional sampling does not require a stringent condition on the kernel function f(x) in our ERM theory (see Discussion line 269-275), potentially explaining the differing results between Fig.S23A-D and Fig.S23E-H.

      We have incorporated these findings into the Result section 2.1 (lines 100-101) and Discussion section (lines 277-282, quoted below):

      “Morrell et al. [4, 1] suggested a simple model in which a slow time-varying factor influences the entire neural population. To explore the effects of latent variables, we assessed if this model explains the scale invariance in our data. The model posits that neural activity is primarily driven by a few shared latent factors. Simulations showed that the resulting eigenspectra differed considerably from our findings (Fig. S23). Although the Morrell model demonstrated a degree of scale invariance under functional sampling, it did not align with the scale-invariant features under random sampling observed in our data, suggesting that this simple model might not capture all crucial features in our observations.”

      Author response image 1:

      Morrell’s latent model. A: We reproduce the results as presented in Morrell et al., PRL 126(11), 118302 (2021) [4]. Parameters are same as Fig. S23A. Sampled 16 to 256 neurons. Unlike in our study, the mean eigenvalues are not normalized to one. Dashed line: eigenvalues fitted to a power law. See also Morrell et al. [4] Fig.1C. Parameters are same as Author response image 1. µ is the power law exponent (black) of the fit, which is different from the µ parameter used to characterize the slow decay of the spatial correlation function, but corresponds to the parameter α in our study.

      (2) The quantification of the degree of scale invariance is done using a ”collapse index” (CI), which could be better explained/motivated. The fact that the measure is computed only for the non-leading eigenvalues makes sense but it is not clear when originally introduced. How does this measure compare to other measures of the distance between distributions?

      We thank you for raising this important point regarding the explanation and motivation for our Collapse Index (CI). We defined the Collapse Index (CI) instead of other measures of distance between distributions for two main reasons. First, the CI provides an intuitive quantification of the shift of the eigenspectrum motivated by our high-density theory for the ERM model (Eq. 3, Fig. 4A). This high-density theory is only valid for large eigenvalues excluding the leading ones, and hence we compute the CI measure with a similar restriction of the range of area integration. Second, when using distribution to assess the collapse (e.g., we can use kernel density method to estimate the distribution of eigenvalues and then calculate the KL divergence between the two distributions), it is necessary to first estimate the distributions. This estimation step introduces errors, such as inaccuracies in estimating the probability of large eigenvalues.

      We agree that a clearer explanation would enhance the manuscript and thus have made modifications accordingly. The CI is now introduced more clearly in the Results section (lines 145-148) and further detailed in the Methods section (lines 630-636). We have also revised the CI diagram in Fig. 4A to better illustrate the shift concept using a more intuitive cartoon representation.

      (3) The paper focuses on the case in which the dimensionality saturates to a finite value as the number of recorded neurons is increased. It would be useful to contrast with a case in which this does not occur. The paper would be strengthened by a comparison with Manley et al. 2024, which argued that, unlike this study, dimensionality of activity in spontaneously behaving head-fixed mice did not saturate.

      Thank you for highlighting this comparison. We have included a discussion (lines 303-309) comparing our approach with Manley et al. (2024) [2]. While Manley et al. [2] primarily used shared variance component analysis (SVCA) to estimate neural dimensionality, they observed that using PCA led to dimensionality saturation (see Figure S4D, Manley et al. [2]), consistent with our findings (Fig. 2D). We acknowledge the value of SVCA as an alternative approach and agree that it is an interesting avenue for future research. In our study, we chose to use PCA for several reasons. PCA is a well-established and widely trusted method in the neuroscience community, with a proven track record of revealing meaningful patterns in neural data. Its mathematical properties are well understood, making it particularly suitable for our theoretical analysis. While we appreciate the insights that newer methods like SVCA can provide, we believe PCA remains the most appropriate tool for addressing our specific research questions.

      (4) More importantly, we don’t understand why dimensionality saturates. For the rank plot given in Eq. 3,

      where k is rank. Using this, one can estimate sums over eigenvalues by integrals. Focusing on the N-dependence, we have

      This gives

      We don’t think you ever told us what mu/d was (see point 13 below), but in the discussion you implied that it was around 1/2 (line 249). In that case, D<sub>PR</sub> should be approximately linear in N. Could you explain why it isn’t?

      Thank you for your careful derivation. Along this line of calculations you suggested, we have now added derivations on using the ERM spectrum to estimate the upper bound of the dimension in the Methods (section 4.14.4). To deduce D<sub>PR</sub> from the spectrum, we focus on the high-density region, where an analytical expression for large eigenvalues λ is given by:

      Here, d is dimension of functional space, L is the linear size of functional space, ρ is the neuron density and γ is the coefficient in Eq. (3), which only depends on d, µ and E(σ<sup>2</sup>). The primary difference between your derivation and ours is that the eigenvalue λ<sub>r</sub> decays rapidly after the threshold r \= β(N), which significantly affects the summations and . Since we did not discuss the small eigenvalues in the article, we represent them here as an unknown function η(r,N,L).

      The sum is the trace of the covariance matrix C. As emphasized in the Methods section, without changing the properties the covariance spectrum, we always consider a normalized covariance matrix such that the mean neural activity variance E(σ<sup>2</sup>) = 1. Thus

      rather than

      The issue stems from overlooking that Eq. (3) is valid only for large eigenvalues (λ > 1).

      Using the Cauchy–Schwarz inequality, we have a upper bound of

      Conversely, provides a lower bound of :

      As a result, we must have

      In random sampling (RSap), L is fixed. We thus must have a bounded dimensionality that is independent of N for our ERM model. In functional sampling (FSap), L varies while the neuronal density ρ is fixed, leading to a different scaling relationship of the upper bound, see Methods (section 4.14.4) for further discussion.

      (5) The authors work directly with ROIs rather than attempting to separate the signals from each neuron in an ROI. It would be worth discussing whether this has a significant effect on the results.

      We appreciate your thoughtful question on the potential impact of using ROIs. The use of ROIs likely does not impact our key findings since they are validated across multiple datasets with various recording techniques and animal models, from zebrafish calcium imaging to mouse brain multi-electrode recordings (see Figure S2, S24). The consistency of the scale-invariant covariance spectrum in diverse datasets suggests that ROIs in zebrafish data do not significantly alter the conclusions, and they together enhance the generalizability of our results. We highlight this in the Discussion section (lines 319-323).

      (6) Does the Euclidean random matrix model allow the authors to infer the value of D or µ? Since the measured observables only depend on µ/D it seems that one cannot infer the latent dimension where distances between neurons are computed. Are there any experiments that one could, in principle, perform to measure D or mu? Currently the conclusion from the model and data is that D/µ is a large number so that the spectrum is independent of neuron density rho. What about the heterogeneity of the scales σ<sub>i</sub>, can this be constrained by data?

      Measuring d and µ in the ERM Model

      We agree with you that the individual values of d and µ cannot be determined separately from our analysis. In our analysis using the Euclidean Random Matrix (ERM) model, we fit the ratio µ/d, rather than the individual values of d (dimension of the functional space) or µ (exponent of the distance-dependent kernel function). This limitation is inherent because the model’s predictions for observable quantities, such as the distribution of pairwise correlation, are dependent solely on this ratio.

      Currently there are no directly targeted experiments to measure d. The dimensions of the functional space is largely a theoretical construct: it could serve to represent latent variables encoding cognitive factors that are distributed throughout the brain or specific sensory or motor feature maps within a particular brain region. It may also be viewed as the embedding space to describe functional connectivity between neurons. Thus, a direct experimental measurement of the dimensions of the functional space could be challenging. Although there are variations in the biological interpretation of the functional space, the consistent scale invariance observed across various brain regions indicates that the neuronal relationships within the functional space can be described by a uniform slowly decaying kernel function.

      Regarding the Heterogeneity of σ<sub>i</sub>

      The heterogeneity of neuronal activity variances ( σ<sub>i</sub>) is a critical factor in our analysis. Our findings indicate that this heterogeneity:

      (1) Enhances scale invariance: The covariance matrix spectrum, which incorporates the heterogeneity of , exhibits stronger scale invariance compared to the correlation matrix spectrum, which imposes for all neurons. This observation is supported by both experimental data and theoretical predictions from the ERM model, particularly in the intermediate density regime.

      (2) Can be constrained by data: We fit a log-normal distribution to the experimentally observed σ<sup>2</sup> values to capture the heterogeneity in our model which leads to excellent agreement with data (section 4.8.1). Figure S10 provides evidence for this by directly comparing the eigenspectra obtained from experimental data (Fig S10A-F) with those generated by the fitted ERM model (Fig S10M-R). These results suggest that the data provides valuable information about the distribution of neuronal activity variances.

      In conclusion, the ERM model and our analysis cannot separately determine d and µ. We also highlight that the neuronal activity variance heterogeneity, constrained by experimental data, plays a crucial role in improving the scale invariance.

      (7) Does the fitting procedure for the positions x in the latent space recover a ground truth in your statistical regime (for the number of recorded neurons)? Suppose you sampled some neurons from a Euclidean random matrix theory. Does the MDS technique the authors use recover the correct distances?

      While sampling neurons from a Euclidean random matrix model, we demonstrated numerically that the MDS technique can accurately recover the true distances, provided that the true parameter f(x) is known. To quantify the precision of recovery, we applied the CCA analysis (Section 4.9) and compared the true coordinates from the original Euclidean random matrix with the fitted coordinates obtained through our MDS procedure. The CCA correlation between the true and fitted coordinates in each spatial dimension is nearly 1 (the difference from 1 is less than 10<sup>−7</sup>). When fitting with experimental data, one source of error arises from parameter estimation. To evaluate this, we assess the estimation error of the fitted parameters. When we choose µ \= 0_.5 in our ERM model and then fit the distribution of the pairwise correlation (Eq. 21), the estimated parameter is = 0.503 ± 0._007 (standard deviation). Then, we use the MDS-recovered distances to fit the coordinates with the fitted kernel function , which is determined by the fitted parameter . The CCA correlation between the true and fitted coordinates in each direction remains nearly 1 (the difference from 1 is less than 10<sup>−5</sup>).

      (8) l. 49: ”... both the dimensionality and covariance spectrum remain invariant ...”. Just to be clear, if the spectrum is invariant, then the dimensionality automatically is too. Correct?

      Thanks for the question. In fact, there is no direct causal relationship between eigenvalue spectrum invariance and dimensionality invariance as we elaborate below and added discussions in lines 311-317. For eigenvalue spectrum invariance, we focus on the large eigenvalues, whereas dimensionality invariance considers the second order statistics of all eigenvalues. Consequently, the invariance results for these two concepts may differ. And dimensional and spectral invariance have different requirements:

      (1) The condition for dimensional saturation is finite mean square covariance

      The participation ratio D<sub>PR</sub> for random sampling (RSap) is given by Eq. 5:

      This expression becomes invariant as N → ∞ if the mean square covariance is finite. In contrast, neural dynamics models, such as the balanced excitatory-inhibitory (E-I) neural network [5], exhibit a different behavior, where , leading to unbounded dimensionality (see discussion lines 291-295, section 6.9 in SI).

      (2) The requirements for spectral invariance involving the kernel function

      In our Euclidean Random Matrix (ERM) model, the eigenvalue distribution follows:

      For spectral invariance to emerge: (1) The eigenvalue distribution must remain unchanged after sampling. (2) Since sampling reduces the neuronal density ρ. (3) The ratio µ/d must approach 0 to maintain invariance.

      We can also demonstrate that D<sub>PR</sub> is independent of density ρ in the large N limit (see the answer of question 4).

      In conclusion, there is no causal relationship between spectral invariance and dimensionality invariance. This is also the reason why we need to consider both properties separately in our analysis.

      (9) In Eq. 1, the exact expression, which includes i=j, isn’t a lot harder than the one with i=j excluded. So why i≠j?

      The choice is for illustration purposes. In Eq. 1, we wanted to demonstrate that the dimension saturates to a value independent of N. When dividing the numerator and denominator of this expression by N<sup>2</sup>, the term is independent of the neuron number N, but the term associated with the diagonal entries is of order O(1_/N_) and can be ignored for large N.

      (10) Fig. 2D: Could you explain where the theory line comes from?

      We first estimate ] from all neurons, and then compute D<sub>PR</sub> for different neuron numbers N using Eq.5 (). This is further clarified in lines 511-512.

      (11) l 94-5: ”It [scale invariance] is also absent when replacing the neural covariance matrix eigenvectors with random ones, keeping the eigenvalues identical (Fig. 2H).” If eigenvalues are identical, why does the spectrum change?

      The eigenspectra of the covariance matrices in full size are the same by construction, but the eigenspectra of the sampled covariance matrices are different because the eigenvectors affect the sampling results. Please also refer to the construction process described in section 4.3 where this is also discussed: “The composite covariance matrix with substituted eigenvectors in (Fig. 2H) was created as described in the following steps. First, we generated a random orthogonal matrix U<sub>r<.sup> (based on the Haar measure) for the new eigenvectors. This was achieved by QR decomposition A=U<sub>r</sub>R of a random matrix A with i.i.d. entries A<sub>ij</sub> ∼ N(0_,1/N_). The composite covariance matrix C<sub>r</sub> was then defined as, where Λ is a diagonal matrix that contains the eigenvalues of C. Note that since all the eigenvalues are real and U<sub>r</sub> is orthogonal, the resulting C<sub>r</sub> is a real and symmetric matrix. By construction, C<sub>r</sub> and C have the same eigenvalues, but their sampled eigenspectra can differ.”

      (12) Eq 3: There’s no dependence on the distribution of sigma. Is that correct?

      Indeed, this is true in the high-density regime when the neuron density ρ is large. The p(λ) depends only on E(σ<sup>2</sup>) rather than the distribution of σ (see Eq. 8). However, in the intermediate density regime, p(λ) depends on the distribution of σ (see Eq.9 and Eq.10). In our analysis, we consider E(σ<sup>4</sup>) as a measure of heterogeneity.

      (13) Please tell us the best fit values of µ/d.

      This information now is added in the figure caption of Fig S10: µ/d \= [0_.456,0.258,0.205,0.262,0.302,0._308] in fish 1-6.

      (14) l 133: ”The eigenspectrum is rho-independent whenever µ/d ≈ 0.”

      It looks to me like rho sets the scale but not the shape. Correct? If so, why do we care about the overall scale – isn’t it the shape that’s important?

      Yes, our study focuses on the overall scale not only the shape, because many models, such as the ERM with other kernel functions, random RNNs, Morrell’s latent model [4, 1], can exhibit a power-law spectrum. However, these models do not exhibit scale-invariance in terms of spectrum curve collapsing. Therefore, considering the overall scale reveal additional non-trivial phenomenon.

      (15) Figs. 3 and 4: Are the grey dots the same as in previous figures? Either way, please specify what they are in the figure caption.

      Yes, they are the same, and thank you for pointing it out. It has been specified in the figure caption now.

      (16) Fig. 4B: Top is correlation matrix, bottom is covariance matrix, correct? If so, that should be explicit. If not, it should be clear what the plots are.

      That is correct. Both matrices (correlation - top, covariance - bottom) are labeled in the figure caption and plot (text in the lower left corner).

      (17) l 158: ”First, the shape of the kernel function f(x) over a small distance ...”. What does ”over a small distance” mean?

      We thank you for seeking clarification on this point. We understand that the phrase ”over a small distance” could be made clearer. We made a revised explanation in lines 164-165 Here, “over a small distance” refers to modifications of the particular kernel function f(x) we use Eq. 11 near x \= 0 in the functional space, while preserving the overall power-law decay at larger distances. The t-distribution based f(x) (Eq. 11) has a natural parameter ϵ that describes the transition to near 0. So we modified f(x) in different ways, all within this interval of |x| ≤ ϵ, and considered different values of ϵ. Table S3 and Figure S7 provide a summary of these modifications. Figure S7 visually compares these modifications to the standard power-law kernel function, highlighting the differences in shape near x \= 0.

      Our findings indicate that these alterations to the kernel function at small distances do not significantly affect the distribution of large eigenvalues in the covariance spectrum. This supports our conclusion that the large eigenvalues are primarily determined by the slow decay of the kernel function at larger distances in the functional space, as this characteristic governs the overall correlations in neural activity.

      (18) l390 . This x<sub>i</sub> is, we believe, different from the x<sub>i</sub> which is position in feature space. Given the difficulty of this paper, it doesn’t help to use the same symbol to mean two different things. But maybe we’re wrong?

      Thank you for your careful reading and suggestion. Indeed here x<sub>i</sub> was representing activity rather than feature space position. We have thus revised the notation (Line 390 has been updated to line 439 as well.):

      In this revised notation: a<sub>i</sub>(t) represents the neural activity of neuron i at time t (typically the firing rate we infer from calcium imaging). is simply the mean activity of neuron i across time. Meanwhile, we’ll keep x<sub>i</sub> exclusively for denoting positions in the functional space.

      This change should make it much easier to distinguish between neural activity measurements and spatial coordinates in the functional space.

      (19) Eq. 19: is it correct that g(u) is not normalized to 1? If so, does that matter?

      It is correct that the approximation of g(u) is not normalized to 1, as Eq. 19 provides an approximation suitable only for small pairwise distances (i.e., large correlation). Therefore, we believe this does not pose an issue. We have newly added this note in lines 691-693.

      (20) I get a different answer in Eq. 20:

      Whereas in Eq. 20,

      µ

      Which is correct?

      Thank you for your careful derivation. We believe the difference arises in the calculation of g(u).In our calculations:

      ,

      (Your first equation seems to missed an 1_/µ_ in R’s exponent.)

      ,

      That is, Eq. 20 is correct. From these, we obtain

      rather than

      We hope this clarifies the question.

      (21) I’m not sure we fully understand the CCA analysis. First, our guess as to what you did: After sampling (either Asap or Fsap), you used ERM to embed the neurons in a 2-D space, and then applied canonical correlation analysis (CCA). Is that correct? If so, it would be nice if that were more clear.

      We first used ERM to embed all the neurons in a 2-D functional space, before any sampling. Once we have the embedding, we can quantify how similar the functional coordinates are with the anatomical coordinates using R<sub>CCA</sub> (section 2.4). We can then use the anatomical and functional coordinates to perform ASap and FSap, respectively. Our theory in section 2.4 predicts the effect on dimension under these samplings given the value of R<sub>CCA</sub> estimated earlier (Fig. 5D). The detailed description of the CCA analysis is in section 4.9, where we explain how CCA is used to find the axes in both anatomical and functional spaces that maximize the correlation between projections of neuron coordinates.

      As to how you sampled under Fsap, I could not figure that out – even after reading supplementary information. A clearer explanation would be very helpful.

      Thank you for your feedback. Functional sampling (FSap) entails the expansion of regions of interest (ROIs) within the functional space, as illustrated in Figure 5A, concurrently with the calculation of the covariance matrix for all neurons contained within the ROI. Technically, we implemented the sampling using the RG approach [6], which is further elaborated in Section 4.12 (lines 852-899), quoted below.

      Stage (i): Iterative Clustering We begin with N</sub>0</sub> neurons, where N</sub>0</sub> is assumed to be a power of 2. In the first iteration, we compute Pearson’s correlation coefficients for all neuron pairs. We then search greedily for the most correlated pairs and group the half pairs with the highest correlation into the first cluster; the remaining neurons form the second cluster. For each pair (a,b), we define a coarse-grained variable according to:

      ,

      Where normalizes the average to ensure unit nonzero activity. This process reduces the number of neurons to N<sub>1</sub> = N<sub>0</sub>/2. In subsequent iterations, we continue grouping the most correlated pairs of the coarse-grained neurons, iteratively reducing the number of neurons by half at each step. This process continues until the desired level of coarse-graining is achieved.

      When applying the RG approach to ERM, instead of combining neural activity, we merge correlation matrices to traverse different scales. During the _k_th iteration, we compute the coarse-grained covariance as:

      and the variance as:

      Following these calculations, we normalize the coarse-grained covariance matrix to ensure that all variances are equal to one. Note that these coarse-grained covariances are only used in stage (i) and not used to calculate the spectrum.

      Stage (ii): Eigenspectrum Calculation The calculation of eigenspectra at different scales proceeds through three sequential steps. First, for each cluster identified in Stage (i), we compute the covariance matrix using the original firing rates of neurons within that cluster (not the coarse-grained activities). Second, we calculate the eigenspectrum for each cluster. Finally, we average these eigenspectra across all clusters at a given iteration level to obtain the representative eigenspectrum for that scale.

      In stage (ii), we calculate the eigenspectra of the sub-covariance matrices across different cluster sizes as described in [6]. Let N<sub>0</sub> = 2<sup>n</sub> be the original number of neurons. To reduce it to size N \= N<sub>0</sub>/2<sup>k</sup> = 2<sup>n-k</sup>, where k is the kth reduction step, consider the coarse-grained neurons in step nk in stage (i). Each coarse-grained neuron is a cluster of 2<sup>n-k</sup> neurons. We then calculate spectrum of the block of the original covariance matrix corresponding to neurons of each cluster (there are 2<sup>k</sup> such blocks). Lastly, an average of these 2<sup>k</sup> spectra is computed.

      For example, when reducing from N<sub>0</sub> = 2<sup>3</sup> = 8 to N \= 2<sup>3−1</sup> = 4 neurons (k \= 1), we would have two clusters of 4 neurons each. We calculate the eigenspectrum for each 4x4 block of the original covariance matrix, then average these two spectra together. To better understand this process through a concrete example, consider a hypothetical scenario where a set of eight neurons, labeled 1,2,3,...,7,8, are subjected to a two-step clustering procedure. In the first step, neurons are grouped based on their maximum correlation pairs, for example, resulting in the formation of four pairs: {1,2},{3,4},{5,6}, and {7,8} (see Fig. S22). Subsequently, the neurons are further grouped into two clusters based on the results of the RG step mentioned above. Specifically, if the correlation between the coarse-grained variables of the pair {1,2} and the pair {3,4} is found to be the largest among all other pairs of coarse-grained variables, the first group consists of neurons {1,2,3,4}, while the second group contains neurons {5,6,7,8}. Next, take the size of the cluster N = 4 for example. The eigenspectra of the covariance matrices of the four neurons within each cluster are computed. This results in two eigenspectra, one for each cluster. The correlation matrices used to compute the eigenspectra of different sizes do not involve coarse-grained neurons. It is the real neurons 1,2,3,...,7,8, but with expanding cluster sizes. Finally, the average of the eigenspectra of the two clusters is calculated.

      (22) Line 37: ”even if two cell assemblies have the same D<sub>PR</sub>, they can have different shapes.” What is meant by shape here isn’t clear.

      Thank you for pointing out this potential ambiguity. The “shape” here refers to the geometric configuration of the neural activity space characterized as a highdimensional ellipsoid by the covariance. Specifically, if we denote the eigenvalues of the covariance matrix as λ<sub>1</sub>,λ<sub>2</sub>,...,λ<sub>N</sub>, then corresponds to the length of the i-th semi-axis of this ellipsoid (Figure 1B). As shown in Figure 1C, two neural populations with the same dimensionality (D<sub>PR</sub> = 25/11 ≈ 2.27) exhibit different eigenvalue spectra, leading to differently shaped ellipsoids. This clarification is now included in lines 39-40.

      (23) Please discuss if any information about the latent dimension or kernel function can be inferred from the measurements.

      Same as comment(6): we would like to clarify that in our analysis using the Euclidean Random Matrix (ERM) model, we fit the ratio µ/d, rather than the individual values of d (dimension of the functional space) or µ (exponent of the distancedependent kernel function). This limitation is inherent because the model’s predictions for observable quantities, such as the eigenvalue spectrum of the covariance matrix, are dependent solely on this ratio.

      For the kernel function, once the d is chosen, we can infer the general shape of the kernel function from data (Figs S12 and S13), up to a certain extent (see also lines 164-166). In particular, we can compare the eigenspectrum of the simulation results for different kernel functions with the eigenspectrum of our data. This allows us to qualitatively exclude certain kernel functions, such as the exponential and Gaussian kernels (Fig. S4), which show clear differences from our data.

      References

      (1) M. C. Morrell, I. Nemenman, A. Sederberg, Neural criticality from effective latent variables. eLife 12, RP89337 (2024).

      (2) J. Manley, S. Lu, K. Barber, J. Demas, H. Kim, D. Meyer, F. M. Traub, A. Vaziri, Simultaneous, cortex-wide dynamics of up to 1 million neurons reveal unbounded scaling of dimensionality with neuron number. Neuron (2024).

      (3) S. A. Moosavi, S. S. R. Hindupur, H. Shimazaki, Population coding under the scale-invariance of high-dimensional noise (2024).

      (4) M. C. Morrell, A. J. Sederberg, I. Nemenman, Latent dynamical variables produce signatures of spatiotemporal criticality in large biological systems. Physical Review Letters 126, 118302 (2021).

      (5) A. Renart, J. De La Rocha, P. Bartho, L. Hollender, N. Parga, A. Reyes, K. D. Harris, The asynchronous state in cortical circuits. science 327, 587–590 (2010).

      (6) L. Meshulam, J. L. Gauthier, C. D. Brody, D. W. Tank, W. Bialek, Coarse graining, fixed points, and scaling in a large population of neurons. Physical Review Letters 123, 178103 (2019).

    1. eLife Assessment

      This study presents valuable insights into the involvement of miR-26b in the progression of metabolic dysfunction-associated steatohepatitis (MASH). The delivery of microRNA-containing nanoparticles to reduce MASH severity has practical implications as a therapeutic strategy. The authors use two sets of transgenic mouse models, conducted kinase activity profiling of mouse liver samples, and supplemented their findings with additional experiments on human liver and plasma, providing solid support for their findings.

    2. Reviewer #1 (Public review):

      Based on previous publications suggesting a potential role for miR-26b in the pathogenesis of metabolic dysfunction-associated steatohepatitis (MASH), the researchers aim to clarify its function in hepatic health and explore the therapeutical potential of lipid nanoparticles (LNPs) to treat this condition. First, they employed both whole-body and myeloid cell-specific miR-26b KO mice and observed elevated hepatic steatosis features in these mice compared to WT controls when subjected to WTD. Moreover, livers from whole-body miR-26b KO mice also displayed increased levels of inflammation and fibrosis markers. Kinase activity profiling analyses revealed distinct alterations, particularly in kinases associated with inflammatory pathways, in these samples. Treatment with LNPs containing miR-26b mimics restored lipid metabolism and kinase activity in these animals. Finally, similar anti-inflammatory effects were observed in the livers of individuals with cirrhosis, whereas elevated miR-26b levels were found in the plasma of these patients in comparison with healthy control. Overall, the authors conclude that miR-26b plays a protective role in MASH and that its delivery via LNPs efficiently mitigates MASH development.

      The study has some strengths, most notably, its employ of a combination of animal models, analyses of potential underlying mechanisms, as well as innovative treatment delivery methods with significant promise. However, it also presents certain weaknesses that could be improved. The precise role of miR-26b in a human context remains elusive, hindering direct translation to clinical practice.

      Comments on revised version:

      Some of the recommendations provided by this Reviewer in the first version of the manuscript have been successfully addressed in the revision. However, others, particularly those related to human translation, remain unresolved due to the lack of additional samples for analysis. Since the revised title now indicates that the mechanisms described were primarily observed in mice, it seems reasonable to defer addressing this issue to future studies.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript by Peters, Rakateli et al. aims to characterize the contribution of miR-26b in a mouse model of metabolic dysfunction-associated steatohepatitis (MASH) generated by Western-type diet on background of Apoe knock-out. In addition, the authors provide a rescue of the miR-26b using lipid nanoparticles (LNPs), with potential therapeutic implications. In addition, the authors provide useful insights on the role of macrophages and some validation of the effect of miR-26b LNPs on human liver samples.

      Strengths:

      The authors provide a well designed mouse model, that aims to characterize the role of miR-26b in a mouse model of metabolic dysfunction-associated steatohepatitis (MASH) generated by Western-type diet on background of Apoe knock-out. The rescue of the phenotypes associated with the model used using miR-26b using lipid nanoparticles (LNPs) provides an interesting avenue to novel potential therapeutic avenues.

      Weaknesses:

      Although the authors provide a new and interesting avenue to understand the role of miR-26b in MASH, the study needs some additional validations and mechanistic insights in order to strengthen the authors' conclusions.

      (1) Analysis the expression of miRNAs based on miRNA-seq of human samples (see https://ccb-compute.cs.uni-saarland.de/isomirdb/mirnas) suggests that miR-26b-5p is highly abundant both on liver and blood. It seems hard to reconcile that despite miRNA abundance being similar on both tissues, the physiological effects claimed by the authors in Figure 2 come exclusively from the myeloid (macrophages).

      - Thanks for the clarification provided on your revised version of the manuscript

      (2) Similarly, the miRNA-seq expression from isomirdb suggests also that expression of miR-26a-5p is indeed 4-fold higher than miR-26b-5p both in liver and blood. Since both miRNAs share the same seed sequence, and most of the supplemental regions (only 2 nt difference), their endogenous targets must be highly overlapped. It would be interesting to know whether deletion of miR-26b is somehow compensated by increased expression of miR-26a-5p loci. That would suggest that the model is rather a depletion of miR-26.

      UUCAAGUAAUUCAGGAUAGGU mmu-miR-26b-5p mature miRNA<br /> UUCAAGUAAUCCAGGAUAGGCU mmu-miR-26a-5p mature miRNA

      - Thanks for the clarification provided. Nevertheless, I would note that measurements of the host transcript can be difficult to interpret. The processing of the hairpin by Drosha results in rapid decay of the reaming of the non-hairpin part, usually yielding very low expression levels. The mature levels of miR-26a-5p could be more accurate.

      (3) Similarly, the miRNA-seq expression from isomirdb suggests also that expression of miR-26b-5p is indeed 50-fold higher than miR-26b-3p in liver and blood. This difference in abundance of the two strands are usually regarded as one of them being the guide strand (in this case the 5p) and the other being the passenger (in this case the 3p). In some cases, passenger strands can be a byproduct of miRNA biogenesis, thus the rescue experiments using LNPs with both strands on equimolar amounts would not reflect the physiological abundance miR-26b-3p. The non-physiological over abundance of miR-26b-3p would constitute a source of undesired off-targets.

      - I agree with the authors that the functional data doesn't show evidence of undesired off-targets. Nevertheless, I would consider that for future studies. miRNA-phenotypes can be subtle in normal conditions and become more obvious on stressed conditions, the same might apply to off-target effects.

      (4) It would also be valuable to check the miRNA levels on the liver upon LNP treatment, or at least the signatures of miR-26b-3p and miR-26b-5p activity using RNA-seq on the RNA samples already collected.

      - Thanks for providing the miRNA quantification on the revised version of the manuscript.

      (5) Some of the phenotypes described, such as the increase in cholesterol, overlap with the previous publication van der Vorst et al. BMC Genom Data (2021), despite in this case the authors are doing their model in Apoe knock-out and Western-type diet. I would encourage the authors to investigate more or discuss why the initial phenotypes don't become more obvious despite the stressors added in the current manuscript.

      - Thanks for the clarification provided on your revised version of the manuscript.

      (6) The authors have focused part of their analysis on a few gene markers that show relatively modest changes. Deeper characterization using RNA-seq might reveal other genes that are more profoundly impacted by miR-26 depletion. It would strengthen the conclusions proposed if the authors validated that changes on mRNA abundance (Sra, Cd36) do impact the protein abundance. These relatively small changes or trends in mRNA expression, might not translate into changes in protein abundance.

      - Thanks for addressing this concern raised by R1 and R2.

      (7) In figures 5 and 7, the authors run a phosphorylation array (STK) to analyze the changes in the activity of the kinome. It seems that a relatively big number of signaling pathways are being altered, I think that should be strengthened by further validations by Western blot on the collected tissue samples. For quite a few of the kinases there might be antibodies that recognise phosphorylation. The two figures lack a mechanistic connection to the rest of the manuscript.

      - I appreciate the clarification provided by the authors regarding the difference between the activity assay and a Western blot for phosphorylated proteins. Is there any orthogonal technique to validate the PamGene activity assay available?

      Comments on revised version:

      The authors have addressed most of the changes suggested by R1 and R2.

    1. eLife Assessment

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studied in isolation. A combination of analytical calculations and direct numerical simulations provides convincing evidence that the interplay of these inhibitory circuits can separately control network gain and stability.

    2. Reviewer #1 (Public review):

      Summary:

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studies in isolation. A combination of analytical calculations and direct numerical simulations provide convincing evidence that the interplay of these inhibitory circuits can separately control network gain and stability.

      Strengths

      The paper has improved in revision, and the intuitive summary statements added to the end of each results section are quite helpful. The addition of numerical simulations to extend the conclusions beyond the linear range of network behavior are also quite helpful.

      Weaknesses

      None

    3. Reviewer #2 (Public review):

      Summary:

      Bos and colleagues address the important question of how two major inhibitory interneuron classes in the neocortex differentially affect cortical dynamics. They address this question by studying Wilson-Cowan-type mathematical models. Using a linearized fixed point approach, and subsequent simulations of neural circuits operating in the dynamic stochastically-driven regime, they provide compelling evidence that the existence of multiple interneuron classes can explain the counterintuitive finding that inhibitory modulation can increase the gain of the excitatory cell population while also increasing the stability of the circuit's state to minor perturbations. This effect depends on the connection strengths within their circuit model, providing important guidance as to when and why it arises.

      Overall, I find this study to have substantial merit. The authors have also done a commendable job of revising the paper in light of the critiques raised by myself and the other reviewers.

      Strengths:

      (1) The thorough investigation of how changes in the connectivity structure affect the gain-stability relationship is a major strength of this work. It provides an opportunity to understand when and why gain and stability will or will not both increase together. It also provides a nice bridge to the experimental literature, where different gain-stability relationships are reported from different studies.

      (2) The simplified and abstracted mathematical model has the benefit of facilitating our understanding of this puzzling phenomenon. It is not easy to find the right balance between biologically-detailed models vs simple but mathematically tractable ones, and I think the authors struck an excellent balance in this study.

      (3) While the fixed-point analysis has potentially substantial limitations for understanding cortical computations away from the steady-state, the authors used simulations to verify that their main findings hold in the stochastically-driven regime that more closely reflects the dynamics observed in in vivo neuroscience experiments.

      Weaknesses:

      (1) As the authors note in their Discussion, it would be worthwhile to study this effect in chaotic and/or oscillatory regimes, in addition to the ones they included here. I agree with their assessment that those investigations should be left for a future study.

      (2) The analysis is limited to paths within this simple E,PV,SOM circuit. This misses more extended paths (like thalamocortical loops) that involve interactions between multiple brain areas. Including those paths in the expansion in Eqs. 11-14 (Fig. 1C) may be an important direction for future work.

    4. Reviewer #3 (Public review):

      Summary:

      Bos et al study a computational model of cortical circuits with excitatory (E) and two subtypes of inhibition - parvalbumin (PV) and somatostatin (SOM) expressing interneurons. They perform stability and gain analysis of simplified models with nonlinear transfer functions when SOM neurons are perturbed. Their analysis suggests that in a specific setup of connectivity, instability and gain can be untangled, such that SOM modulation leads to both increase in stability and gain, in contrast to the typical direction in neuronal networks where increased gain results in decreased stability.

      Strengths:

      - Analysis of the canonical circuit in response to SOM perturbations. Through numerical simulations and mathematical analysis, the authors have provided a rather comprehensive picture of how SOM modulation may affect response changes.<br /> - Shedding light on two opposing circuit motifs involved in the canonical E-PV-SOM circuitry - namely, direct inhibition (SOM -> E) vs disinhibition (SOM -> PV -> E). These two pathways can lead to opposing effects, and it is often difficult to predict which one results from modulating SOM neurons. In simplified circuits, the authors show how these two motifs can emerge and depend on parameters like connection weights.<br /> - Suggesting potentially interesting consequences for cortical computation. The authors suggest that certain regimes of connectivity may lead to untangling of stability and gain, such that increases in network gain are not compromised by decreasing stability. They also link SOM modulation in different connectivity regimes to versatile computations in visual processing in simple models.

      Weaknesses:

      - Computationally, the analysis is solid, but it's very similar to previous studies (del Molino et al, 2017). Many studies in the past few years have done the perturbation analysis of a similar circuitry with or without nonlinear transfer functions (some of them listed in the references). This study applies the same framework to SOM perturbations, which is a useful computational analysis, in view of the complexity of the high-dimensional parameter space.<br /> - A general weakness of the paper is a lack of direct comparison to biological parameters or experiments. How different experiments can be reconciled by the results obtained here, and what new circuit mechanisms can be revealed? In its current form, the paper reads as a general suggestion that different combinations of gain modulation and stability can be achieved in a circuit model equipped with many parameters (12 parameters). This is potentially interesting but not surprising, given the high dimensional space of possible dynamical properties. A more interesting result would have been to relate this to biology, by providing reasoning why it might be relevant to certain circuits (and not others), or to provide some predictions or postdictions, which are currently not very strong in the manuscript.<br /> - Tuning curves are simulated for an individual orientation (same for all neurons), not considering the heterogeneity of neuronal networks with multiple orientation selectivity (and other visual features) - making the model too simplistic.

    5. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studies in isolation. Support for the main conclusions is generally solid, but could be strengthened by additional analyses.

      Strengths

      The paper has improved in revision, and the new intuitive summary statements added to the end of each results section are quite helpful. Weaknesses

      The concern about whether the results hold outside of the range in which neural responses are linear remains. This is particularly true given the discontinuity observed in the stability measure. I appreciate the concern (provided in the response to the first round of reviews) that studying nonlinear networks requires a lot of work. A more limited undertaking would be to test the behavior of a spiking network at a few key points identified by your linearization approach. Such tests could use relatively simple (and perhaps imperfect) measures of gain and stability. This could substantially enhance the paper, regardless of the outcome.

      We appreciate the reviewer’s concern and in our resubmission we explore if networks dynamics that operate outside of the case where linearization is possible would continue to show our main result on the (dis)entanglement of stability and gain; the short answer is yes. To this end we have added a new section and Figure to our main text.

      “Gain and stability in stochastically forced E – PV – SOM circuits

      To confirm that our results do not depend on our approach of a linearization around a fixed point, we numerically simulate similar networks as shown above (Figure 2) in which the E and PV population receive slow varying, large amplitude noise (Figure 6A). This leads to noisy rate dynamics sampling a large subspace of the full firing rate grid (r<sub>E</sub>,r<sub>P</sub>) and thus any linearization would fail to describe the network response. In this stochastically forced network we explore how adding an SOM modulation or a stimulus affects this subspace (Figure 6B). To quantify stability without linearization, we assume that a network is more stable the lower the mean and variance of E rates. This is because very stable networks can better quench input fluctuations [Kanashiro et al., 2017; Hennequin et al., 2018]. To quantify gain, we calculate the change in E rates when adding the stimulus, yet having identical noise realizations for stimulated and non-stimulated networks (Methods).

      For the disinhibitory network without feedback a positive SOM modulation decreases stability due to increases of the mean and variance of E rates (Figure 6Ci) while the network gain increases (Figure 6Cii). As seen before (Figure 2A,B), stability and gain change in opposite directions in a disinhibitory circuit without feedback. Adding feedback PV → SOM and applying a negative SOM modulation increases both, stability and gain and therefore disentangles the inverse relation also in a noisy circuit (Figure 6D-F). This gives numerical support that our results do not depend on the assumption of linearization.

      “Methods: Noisy input and numerical measurement of stability and gain

      We consider a temporally smoothed input process ξ<sub>X</sub> with white noise ζ (zero mean, standard deviation one): for populations X ∈{E,P} with timescale τ<sub>ξ</sub> = 50ms, σ<sub>X</sub> \= 6 and fixed mean input IX. To quantify the stability of the network without linearization, we assume that a network is more stable if the mean and variance of excitatory rates are low. To quantify network gain, we freeze the white noise process ζ for the case of with and without stimulus presentation and calculate the difference of E rates at each time point, leading to a distribution of network gains (Figure 6Cii,Fii). Total simulation time is 1000 seconds.”

      We decided against using a spiking network because sufficiently asynchronous spiking network dynamics can still obey a linearized mean field theory (if the fluctuations in population firing rates are small). In our new analysis the firing rate deviations from the time averaged firing rate are sizable, making a linearization ineffective.

      In summary, based on our additional analysis of recurrent circuits with noisy inputs we conclude that our results also hold in fluctuating networks, without the need of assuming realization aroud a stable fixed point.

      Reviewer #2 (Public Review):

      Summary:

      Bos and colleagues address the important question of how two major inhibitory interneuron classes in the neocortex differentially affect cortical dynamics. They address this question by studying Wilson-Cowan-type mathematical models. Using a linearized fixed point approach, they provide convincing evidence that the existence of multiple interneuron classes can explain the counterintuitive finding that inhibitory modulation can increase the gain of the excitatory cell population while also increasing the stability of the circuit’s state to minor perturbations. This effect depends on the connection strengths within their circuit model, providing valuable guidance as to when and why it arises.

      Overall, I find this study to have substantial merit. I have some suggestions on how to improve the clarity and completeness of the paper.

      Strengths:

      (1) The thorough investigation of how changes in the connectivity structure affect the gain-stability relationship is a major strength of this work. It provides an opportunity to understand when and why gain and stability will or will not both increase together. It also provides a nice bridge to the experimental literature, where different gain-stability relationships are reported from different studies.

      (2) The simplified and abstracted mathematical model has the benefit of facilitating our understanding of this puzzling phenomenon. (I have some suggestions for how the authors could push this understanding further.) It is not easy to find the right balance between biologically-detailed models vs simple but mathematically tractable ones, and I think the authors struck an excellent balance in this study.

      We thank the reviewer for their support of our work.

      Weaknesses:

      (1) The fixed-point analysis has potentially substantial limitations for understanding cortical computations away from the steady-state. I think the authors should have emphasized this limitation more strongly and possibly included some additional analyses to show that their conclusions extend to the chaotic dynamical regimes in which cortical circuits often live.

      In the response to reviewer 1 we have included model analyses that addresses the limitations of linearization. Rather than use a chaotic model, which would require significant effort, we opted for a stochastically forced network, where the sizable fluctuations in rate dynamics preclude linearization.

      (2) The authors could have discussed – even somewhat speculatively – how VIP interneurons fit into this picture. Their absence from this modelling framework stands out as a missed opportunity.

      We agree that including VIP neurons into the framework would be an obvious and potentially interesting next step. At this point we only include them as potential modulators of SOM neurons. Modeling their dynamics without them receiving inputs from E, PV, or SOM neurons would be uninteresting. However, including them properly into the circuit would be outside the scope of the paper.

      (3) The analysis is limited to paths within this simple E, PV, SOM circuit. This misses more extended paths (like thalamocortical loops) that involve interactions between multiple brain areas. Including those paths in the expansion in Eqs. 11-14 (Fig. 1C) may be an important consideration.

      We agree that our pathway expansion can be used to study more than just the E – PV – SOM circuit. However, properly investigating full thalamocortcial loops should be done in a subsequent study.

      Comments on revisions:

      I think the authors have done a reasonable job of responding to my critiques, and the paper is in pretty good shape. (Also, thanks for correctly inferring that I meant VIP interneurons when I had written SST in my review! I have updated the public review accordingly.)

      I still think this line of research would benefit substantially from considering dynamic regimes including chaotic ones. I strongly encourage the authors to consider such an extension in future work.

      Please see our response above to Reviewer 1.

      Reviewer #3 (Public Review):

      Summary:

      Bos et al study a computational model of cortical circuits with excitatory (E) and two subtypes of inhibition parvalbumin (PV) and somatostatin (SOM) expressing interneurons. They perform stability and gain analysis of simplified models with nonlinear transfer functions when SOM neurons are perturbed. Their analysis suggests that in a specific setup of connectivity, instability and gain can be untangled, such that SOM modulation leads to both increases in stability and gain, in contrast to the typical direction in neuronal networks where increased gain results in decreased stability.

      Strengths:

      - Analysis of the canonical circuit in response to SOM perturbations. Through numerical simulations and mathematical analysis, the authors have provided a rather comprehensive picture of how SOM modulation may affect response changes.

      - Shedding light on two opposing circuit motifs involved in the canonical E-PV-SOM circuitry - namely, direct inhibition (SOM -¿ E) vs disinhibition (SOM -¿ PV -¿ E). These two pathways can lead to opposing effects, and it is often difficult to predict which one results from modulating SOM neurons. In simplified circuits, the authors show how these two motifs can emerge and depend on parameters like connection weights.

      - Suggesting potentially interesting consequences for cortical computation. The authors suggest that certain regimes of connectivity may lead to untangling of stability and gain, such that increases in network gain are not compromised by decreasing stability. They also link SOM modulation in different connectivity regimes to versatile computations in visual processing in simple models.

      We thank the reviewer for their support of our work.

      Weaknesses

      Computationally, the analysis is solid, but it’s very similar to previous studies (del Molino et al, 2017). Many studies in the past few years have done the perturbation analysis of a similar circuitry with or without nonlinear transfer functions (some of them listed in the references). This study applies the same framework to SOM perturbations, which is a useful computational analysis, in view of the complexity of the high-dimensional parameter space.

      Link to biology: the most interesting result of the paper with regard to biology is the suggestion of a regime in which gain and stability can be modulated in an unconventional way - however, it is difficult to link the results to biological networks:

      - A general weakness of the paper is a lack of direct comparison to biological parameters or experiments. How different experiments can be reconciled by the results obtained here, and what new circuit mechanisms can be revealed? In its current form, the paper reads as a general suggestion that different combinations of gain modulation and stability can be achieved in a circuit model equipped with many parameters (12 parameters). This is potentially interesting but not surprising, given the high dimensional space of possible dynamical properties. A more interesting result would have been to relate this to biology, by providing reasoning why it might be relevant to certain circuits (and not others), or to provide some predictions or postdictions, which are currently missing in the manuscript.

      - For instance, a nice motivation for the paper at the beginning of the Results section is the different results of SOM modulation in different experiments - especially between L23 (inhibition) and L4 (disinhibition). But no further explanation is provided for why such a difference should exist, in view of their results and the insights obtained from their suggested circuit mechanisms. How the parameters identified for the two regimes correspond to different properties of different layers?

      Please see our answer to the previous round of revision.

      - One of the key assumptions of the model is nonlinear transfer functions for all neuron types. In terms of modelling and computational analysis, a thorough analysis of how and when this is necessary is missing (an analysis similar to what has been attempted in Figure 6 for synaptic weights, but for cellular gains). A discussion of this, along with the former analysis to know which nonlinearities would be necessary for the results, is needed, but currently missing from the study. The nonlinearity is assumed for all subtypes because it seems to be needed to obtain the results, but it’s not clear how the model would behave in the presence or absence of them, and whether they are relevant to biological networks with inhibitory transfer functions.

      Please see our answer to the previous round of revision.

      - Tuning curves are simulated for an individual orientation (same for all), not considering the heterogeneity of neuronal networks with multiple orientation selectivity (and other visual features) - making the model too simplistic.

      Please see our answer to the previous round of revision.

      Reviewer #1 (Recommendations For The Authors):

      Introduction, first paragraph, last sentence: suggest ”sense,” -¿ ”sense” (no comma)

      Introduction, second paragraph, first sentence: suggest ”is been” -¿ ”has been”

      Introduction, very end of next to last paragraph: clarify ”modulate the circuit”

      Figure 1 legend: can you make the ”Change ...” in the legend for 1D clearer - e.g. ”strenghen SOM → E connections and eliminate SOM → P connections”.

      Paragraph immediately below Figure 1: In sentence starting ”Specifically ...” can you relate the cases described here back to the equation in Figure 1C?

      Sentence right below equation 2: This sentence does not separate the network gain from the cellular gain as clearly as it could.

      Page 7, second full paragraph: sentence starting ”Therefore, with ...” could be split into two or otherwise made clearer.

      Sentence starting ”Furthermore” right below Figure 5 has an extra comma

      We thank the reviewer for their additional comments, we made the respective changes in the manuscript.

      Reviewer #3 (Recommendations For The Authors):

      There is a long part in the reply letter discussing the link to biology - but the revised manuscript doesn’t seem to reflect that.

      The information in the reply letter discussing the link to biology has been added at multiple points in the discussion. In the section ‘decision of labor between PV and SOM neurons’ we mention Ferguson and Carding 2020, in the section ‘impact of SOM neuron modulation on tuning curves’ we discuss Phillups and Hasenstaub 2016, and in the section ‘limitations and future directions’ we mention Tobin et al., 2023.

      The writing can be improved - for example, see below instances:

      P. 7: Intuitively, the inverse relationship follows for inhibitory and disinhibitory pathways (and their mixture) because the firing rate grid (heatmap) does not depend on how the SOM neurons inhibit the E - PV circuit.

      P.8: We first remark that by adding feedback E connections onto SOM neurons, changes in SOM rates can now affect the underlying heatmaps in the (rE, rP) grid.

      Not clear how ”rates can affect the heatmaps”. It’s too colloquial and not scientifically rigorous or sound.

      We added further explanations at the respective places in the manuscript to improve the writing.

    1. eLife Assessment

      This is an important study showing that people who are hungry (vs. sated) put more weight on taste (vs. health) in their food choices. The experiment is well-designed and includes choice behavior, eye-tracking, and state-of-the-art computational modeling, resulting in compelling evidence supporting the conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      In this article, the authors set out to understand how people's food decisions change when they are hungry vs. sated. To do so, they used an eye-tracking experiment where participants chose between two food options, each presented as a picture of the food plus its "Nutri-Score". In both conditions, participants fasted overnight, but in the sated condition, participants received a protein shake before making their decisions. The authors find that participants in the hungry condition were more likely to choose the tastier option. Using variants of the attentional drift diffusion model, they further find that the best fitting model has different attentional discounts on the taste and health attributes, and that the attentional discount on the health information was larger for the hungry participants.

      Strengths:

      The article has many strengths. It uses a food-choice paradigm that is established in neuroeconomics. The experiment uses real foods, with accurate nutrition information, and incentivized choices. The experimental manipulation is elegant in its simplicity - administering a high-calorie protein shake. It is also commendable that the study was within-participant. The experiment also includes hunger and mood ratings to confirm the effectiveness of the manipulation. The modeling work is impressive in its rigor - the authors test 8 different variants of the DDM, including recent models like the maaDDM, as well as some completely new variants (maaDDM2phi and 2phisp). The model fits decisively favor the maaDDM2phi.

      Weaknesses:

      While I do appreciate the within-participant design, it does raise a small concern about potential demand effects. The authors' results would have been more compelling if they had replicated when only analyzing the first session from each participant. However, the authors did demonstrate that there was no effect of order on the results, which helps to alleviate this concern.

    3. Reviewer #2 (Public review):

      Summary:

      This study investigates the effect of fed vs hungry state on food decision making.

      70 participants performed a computerized food choice task with eye tracking. Food images came from a validated set with variability in food attributes. Foods ranged from low caloric density unprocessed (fruits) to high caloric density processed foods (chips and cookies).

      Prior to the choice task participants rated images for taste, health, wanting, and calories. In the choice task participants simply selected one of two foods. They were told to pick the one they preferred. Screens consisted of two food pictures along with their "Nutri-Score". They were told that one preferred food would be available for consumption at the end.

      A drift-diffusion model (DDM) was fit to the reaction time values. Eye tracking was used to measure dwell time on each part of the monitor.

      Findings: participants tended to select the item they had rated as "tastier", however, health also contributed to decisions.

      Strengths:

      The most interesting and innovative aspect of the paper is the use of the DDM models to infer from reaction time and choice the relative weight of the attributes.

      Were the ratings re-done at each session? E.g. were all tastiness ratings for the sated session made while sated? This is relevant as one would expect the ratings of tastiness and wanting to be affected by current fed state.

      Weaknesses:

      My main criticism, which doesn't affect the underlying results, is that the labeling of food choices as being taste- or health-driven is misleading. Participants were not cued to select health vs taste. Studies in which people were cued to select for taste vs health exist (and are cited here). Also, the label "healthy" is misleading, as here it seems to be strongly related to caloric density. A high-calorie food is not intrinsically unhealthy (even if people rate it as such). The suggestion that hunger impairs making healthy decisions is not quite the correct interpretation of the results here (even though everyone knows it to be true). Another interpretation is that hungry people in negative calorie balance simply prefer more calories.

      Comments on revisions: No further comments - all my questions addressed.

    4. Reviewer #3 (Public review):

      Summary:

      This well-powered study tested the effects of hunger on value-based dietary decision-making. The main hypothesis was that attentional mechanisms guide choices toward unhealthier and tastier options when participants are hungry, and are in the fasted state compared to satiated states. Participants were tested twice - in a fasted state and in a satiated state after consuming a protein shake. Attentional mechanisms were measured during dietary decision-making by linking food choices and reaction times to eye-tracking data and mathematical drift-diffusion models. The results showed that hunger makes high-conflict food choices more taste-driven and less health-driven. This effect was formally mediated by relative dwell time, which approximates attention drawn to chosen relative to unchosen options. Computational modeling showed that a drift-diffusion model, which assumed that food choices result from a noisy accumulation of evidence from multiple attributes (i.e., taste and health) and discounted non-looked attributes and options, best explained observed choices and reaction times.

      Strengths:

      This study's findings are valuable for understanding how energy states affect decision-making and provide an answer to how hunger can lead to unhealthy choices. These insights are relevant to psychology, behavioral economics, and behavioral change intervention designs.

      The study has a well-powered sample size and hypotheses were pre-registered. The analyses comprised classical linear models and non-linear computational modeling to offer insight into putative cognitive mechanisms.

      In summary the study advances the understanding of the links between energy states and value-based decision-making by showing that depleting is powerful for shaping the formation of food preferences. Moreover, the computational analysis part offers a plausible mechanistic explanation at the algorithmic level of observed effects.

      Weaknesses:

      Some parts of the positioning of the hunger state manipulation and the interpretation of its effects could be improved.

      On the positioning side, it does not seem like a 'bad' decision to replenish energy states when hungry by preferring tastier, more often caloric options. In this sense, it is unclear whether the observed behavior in the fasted state is a fallacy or a response to signals from the body. The introduction does mention these two aspects of preferring more caloric food when hungry. However, some ambiguity remains about whether the study results indeed reflect suboptimal choice behavior or a healthy adaptive behavior to restore energy stores.

      On the interpretation side, previous work has shown that beliefs about the nourishing and hunger-killing effectiveness of drinks or substances influence subjective and objective markers of hunger, including value-based dietary decision-making, and attentional mechanisms approximated by computational models and the activation of cognitive control regions in the brain. The present study shows differences between the protein shake and a natural history condition (fasted, state). This experimental design, however, cannot rule between alternative interpretations of observed effects. Notably, effects could be due to (a) the drink's active, nourishing ingredients, (b) to consuming a drink versus nothing, or (c) both.

      Comments on revisions:

      The authors addressed all my comments appropriately and I have no further requests. Thank you for the added discussion of findings and extra analyses.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank the editors and the reviewers for their time and constructive comments, which helped us to improve our manuscript “The Hungry Lens: Hunger Shifts Attention and Attribute Weighting in Dietary Choice” substantially. In the following we address the comments in depth:

      R1.1: First, in examining some of the model fits in the supplements, e.g. Figures S9, S10, S12, S13, it looks like the "taste weight" parameter is being constrained below 1. Theoretically, I understand why the authors imposed this constraint, but it might be unfairly penalizing these models. In theory, the taste weight could go above 1 if participants had a negative weight on health. This might occur if there is a negative correlation between attractiveness and health and the taste ratings do not completely account for attractiveness. I would recommend eliminating this constraint on the taste weight.

      We appreciate the reviewer’s suggestion to test a multi-attribute attentional drift-diffusion model (maaDDM) that does not constrain the taste and health weights to the range of 0 and 1. We tested two versions of such a model. First, we removed the phi-transformation, allowing the weight to take on any value (see Author response image 1). The results closely matched those found in the original model. Partially consistent with the reviewer’s comment, the health weight became slightly negative in some individuals in the hungry condition. However, this model had convergence issues with a maximal Rhat of 4.302. Therefore, we decided to run a second model in which we constrained the weights to be between -1 and 2. Again, we obtained effects that matched the ones found in the original model (see Author response image 2), but again we had convergence issues. These convergence issues could arise from the fact that the models become almost unidentifiable, when both attention parameters (theta and phi) as well as the weight parameters are unconstrained.

      Author response image 1.

      Author response image 2.

      R1.2: Second, I'm not sure about the mediation model. Why should hunger change the dwell time on the chosen item? Shouldn't this model instead focus on the dwell time on the tasty option?

      We thank the reviewer for spotting this inconsistency. In our GLMMs and the mediation model, we indeed used the proportion of dwell time on the tasty option as predictors and mediator, respectively. The naming and description of this variable was inconsistent in our manuscript and the supplements. We have now rephrased both consistently.

      R1.3: Third, while I do appreciate the within-participant design, it does raise a small concern about potential demand effects. I think the authors' results would be more compelling if they replicated when only analyzing the first session from each participant. Along similar lines, it would be useful to know whether there was any effect of order.

      R3.2: On the interpretation side, previous work has shown that beliefs about the nourishing and hunger-killing effectiveness of drinks or substances influence subjective and objective markers of hunger, including value-based dietary decision-making, and attentional mechanisms approximated by computational models and the activation of cognitive control regions in the brain. The present study shows differences between the protein shake and a natural history condition (fasted, state). This experimental design, however, cannot rule between alternative interpretations of observed effects. Notably, effects could be due to (a) the drink's active, nourishing ingredients, (b) consuming a drink versus nothing, or (c) both. […]

      R3 Recommendation 1:

      Therefore, I recommend discussing potential confounds due to expectancy or placebo effects on hunger ratings, dietary decision-making, and attention. […] What were verbatim instructions given to the participants about the protein shake and the fasted, hungry condition? Did participants have full knowledge about the study goals (e.g. testing hunger versus satiation)? Adding the instructions to the supplement is insightful for fully harnessing the experimental design and frame.

      Both reviewer 1 and reviewer 3 raise potential demand/ expectancy effects, which we addressed in several ways. First, we have translated and added participants’ instructions to the supplements SOM 6, in which we transparently communicate the two conditions to the participants. Second, we have added a paragraph in the discussion section addressing potential expectancy/demand effects in our design:

      “The present results and supplementary analyses clearly support the two-fold effect of hunger state on the cognitive mechanisms underlying choice. However, we acknowledge potential demand effects arising from the within-subject Protein-shake manipulation. A recent study (Khalid et al., 2024) showed that labeling water to decrease or increase hunger affected participants subsequent hunger ratings and food valuations. For instance, participants expecting the water to decrease hunger showed less wanting for food items. DDM modeling suggested that this placebo manipulation affected both drift rate and starting point. The absence of a starting point effect in our data speaks against any prior bias in participants due to any demand effects. Yet, we cannot rule out that such effects affected the decision-making process, for example by increasing the taste weight (and thus the drift rate) in the hungry condition.”

      Third, we followed Reviewer 1’s suggestion and tested, whether the order of testing affected the results. We did so by adding “order” to the main choice and response time (RT) GLMM. We neither found an effect of order on choice (β<sub>order</sub>=-0.001, SE\=0.163, p<.995), nor on RT (β<sub>order</sub>=0.106, SE\=0.205, p<.603) and the original effects remain stable (see Author response table 1a and Author response table 1 2a below). Further, we used two ANOVAs to compare models with and without the predictor “order”. The ANOVAs indicated that GLMMs without “order” better explained choice and RT (see Author response table 1b and Author response table 2b). Taken together, these results suggest that demand effects played a negligible role in our study.

      Author response table 1.

      a) GLMM: Results of Tasty vs Healthy Choice Given Condition, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: choice ~ condition + scale(_rel_taste_DT) + order + (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      Author response table 2.

      a) GLMM: Response Time Given Condition, Choice, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: RT ~ choice + condition + scale(_rel_taste_DT) + order + choice * scale(rel_taste_DT) (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      R1.4: Fourth, the authors report that tasty choices are faster. Is this a systematic effect, or simply due to the fact that tasty options were generally more attractive? To put this in the context of the DDM, was there a constant in the drift rate, and did this constant favor the tasty option?

      We thank the reviewer for their observant remark about faster tasty choices and potential links to the drift rate. While our starting point models show that there might be a small starting point bias towards the taste boundary, which would result in faster tasty decisions, we took a closer look at the simulated value differences as obtained in our posterior predictive checks to see if the drift rate was systematically more extreme for tasty choices (Author response image 3). In line with the reviewer’s suggestion that tasty options were generally more attractive, tasty decisions were associated with higher value differences (i.e., further away from 0) and consequently with faster decisions. This indicates that the main reason for faster tasty choices was a higher drift rate in those trials (as a consequence of the combination of attribute weights and attribute values rather than “a constant in the drift rate”), whereas a strong starting point bias played only a minor role.

      Author response image 3.

      Note. Value Difference as obtained from Posterior Predictive Checks of the maaDDM2𝜙 in hungry and sated condition for healthy (green) and tasty (orange) choices.

      R1.5: Fifth, I wonder about the mtDDM. What are the units on the "starting time" parameters? Seconds? These seem like minuscule effects. Do they align with the eye-tracking data? In other words, which attributes did participants look at first? Was there a correlation between the first fixations and the relative starting times? If not, does that cast doubt on the mtDDM fits? Did the authors do any parameter recovery exercises on the mtDDM?

      We thank Reviewer 1 for their observant remarks about the mtDDM. In line with their suggestion, we have performed a parameter recovery which led to a good recovery of all parameters except relative starting time (rst). In addition, we had convergence issues of rst as revealed by parameter Rhats around 20. Together these results indicate potential limitations of the mtDDM when applied to tasks with substantially different visual representations of attributes leading to differences in dwell time for each attribute (see Figure 3b and Figure S6b). We have therefore decided not to report the mtDDM in the main paper, only leaving a remark about convergence and recovery issues.

      R2: My main criticism, which doesn't affect the underlying results, is that the labeling of food choices as being taste- or health-driven is misleading. Participants were not cued to select health vs taste. Studies in which people were cued to select for taste vs health exist (and are cited here). Also, the label "healthy" is misleading, as here it seems to be strongly related to caloric density. A high-calorie food is not intrinsically unhealthy (even if people rate it as such). The suggestion that hunger impairs making healthy decisions is not quite the correct interpretation of the results here (even though everyone knows it to be true). Another interpretation is that hungry people in negative calorie balance simply prefer more calories.

      First, we agree with the reviewer that it should be tested to what extent participants’ choice behavior can be reduced to contrasting taste vs. health aspects of their dietary decisions (but note that prior to making decisions, they were asked to rate these aspects and thus likely primed to consider them in the choice task). Having this question in mind, we performed several analyses to demonstrate the suitability of framing decisions as contrasting taste vs. health aspects (including the PCA reported in the Supplemental Material).

      Second, we agree with the reviewer in that despite a negative correlation (Author response image 4) between caloric density and health, high-caloric items are not intrinsically unhealthy. This may apply only to two stimuli in our study (nuts and dried fruit), which are also by our participants recognized as such.

      Finally, Reviewer 2’s alternative explanation, that hungry individuals prefer more calories is tested in SOM5. In line with the reviewer’s interpretation, we show that hungry individuals indeed are more likely to select higher caloric options. This effect is even stronger than the effect of hunger state on tasty vs healthy choice. However, in this paper we were interested in the effect of hunger state on tasty vs healthy decisions, a contrast that is often used in modeling studies (e.g., Barakchian et al., 2021; Maier et al., 2020; Rramani et al., 2020; Sullivan & Huettel, 2021). In sum, we agree with Reviewer 2 in all aspects and have tested and provided evidence for their interpretation, which we do not see to stand in conflict with ours.

      Author response image 4.

      Note. strong negative correlation between health ratings and objective caloric content in both hungry (r\=-.732, t(64)=-8.589, p<.001) and sated condition (r\=-.731, t(64)=-8.569, p<.001).

      R3.1: On the positioning side, it does not seem like a 'bad' decision to replenish energy states when hungry by preferring tastier, more often caloric options. In this sense, it is unclear whether the observed behavior in the fasted state is a fallacy or a response to signals from the body. The introduction does mention these two aspects of preferring more caloric food when hungry. However, some ambiguity remains about whether the study results indeed reflect suboptimal choice behavior or a healthy adaptive behavior to restore energy stores.

      We thank Reviewer 3 for this remark, which encouraged us to interpret the results also form a slightly different perspective. We agree that choosing tasty over healthy options under hunger may be evolutionarily adaptive. We have now extended a paragraph in our discussion linking the cognitive mechanisms to neurobiological mechanisms:

      “From a neurobiological perspective, both homeostatic and hedonic mechanisms drive eating behaviour. While homeostatic mechanisms regulate eating behaviour based on energy needs, hedonic mechanisms operate independent of caloric deficit (Alonso-Alonso et al., 2015; Lowe & Butryn, 2007; Saper et al., 2002). Participants’ preference for tasty high caloric food options in the hungry condition aligns with a drive for energy restoration and could thus be taken as an adaptive response to signals from the body. On the other hand, our data shows that participants preferred less healthy options also in the sated condition. Here, hedonic drivers could predominate indicating potentially maladaptive decision-making that could lead to adverse health outcomes if sustained. Notably, our modeling analyses indicated that participants in the sated condition showed reduced attentional discounting of health information, which poses potential for attention-based intervention strategies to counter hedonic hunger. This has been investigated for example in behavioral (Barakchian et al., 2021; Bucher et al., 2016; Cheung et al., 2017; Sullivan & Huettel, 2021), eye-tracking (Schomaker et al., 2022; Vriens et al., 2020) and neuroimaging studies (Hare et al., 2011; Hutcherson & Tusche, 2022) showing that focusing attention on health aspects increased healthy choice. For example, Hutcherson and Tusche (2022) compellingly demonstrated that the mechanism through which health cues enhance healthy choice is shaped by increased value computations in the dorsolateral prefrontal cortex (dlPFC) when cue and choice are conflicting (i.e., health cue, tasty choice). In the context of hunger, these findings together with our analyses suggest that drawing people’s attention towards health information will promote healthy choice by mitigating the increased attentional discounting of such information in the presence of tempting food stimuli.”

      Recommendations for the authors:

      R1: The Results section needs to start with a brief description of the task. Otherwise, the subsequent text is difficult to understand.

      We included a paragraph at the beginning of the results section briefly describing the experimental design.

      R1/R2: In Figure 1a it might help the reader to have a translation of the rating scales in the figure legend.

      We have implemented an English rating scale in Figure 1a.

      R2: Were the ratings redone at each session? E.g. were all tastiness ratings for the sated session made while sated? This is relevant as one would expect the ratings of tastiness and wanting to be affected by the current fed state.

      The ratings were done at the respective sessions. As shown in S3a there is a high correlation of taste ratings across conditions. We decided to take the ratings of the respective sessions (rather than mean ratings across sessions) to define choice and taste/health value in the modeling analyses, for several reasons. First, by using mean ratings we might underestimate the impact of particularly high or low ratings that drove choice in the specific session (regression to the mean). Second, for the modeling analysis in particular, we want to model a decision-making process at a particular moment in time. Consequently, the subjective preferences in that moment are more accurate than mean preferences.

      R2: It would be helpful to have a diagram of the DDM showing the drifting information to the boundary, and the key parameters of the model (i.e. showing the nDT, drift rate, boundary, and other parameters). (Although it might be tricky to depict all 9 models).

      We thank the reviewer for their recommendation and have created Figure 6, which illustrates the decision-making process as depicted by the maaDDM2phi.

      R3.1: Past work has shown that prior preferences can bias/determine choices. This effect might have played a role during the choice task, which followed wanting, taste, health, and calorie ratings during which participants might have already formed their preferences. What are the authors' positions on such potential confound? How were the food images paired for the choice task in more detail?

      The data reported here, were part of a larger experiment. Next to the food rating and choice task, participants also completed a social preference rating and choice task, as well as rating and choice tasks for intertemporal discounting. These tasks were counterbalanced such that first the three rating tasks were completed in counterbalanced order and second the three choice tasks were completed in the same order (e.g. food rating, social rating, intertemporal rating; food choice, social choice, intertemporal choice). This means that there were always two other tasks between the food rating and food choice task. In addition, to the temporal delay between rating and choice tasks, our modeling analyses revealed that models including a starting point bias performed worse than those without the bias. Although we cannot rule out that participants might occasionally have tried to make their decision before the actual task (e.g., by keeping their most/least preferred option in mind and then automatically choosing/rejecting it in the choice task), we think that both our design as well as our modeling analyses speak against any systematic bias of preference in our choice task. The options were paired such that approximately half of the trials were random, while for the other half one option was rated healthier and the other option was rated tastier (e.g., Sullivan & Huettel, 2021)

      R3.2: In line with this thought, theoretically, the DDMs could also be fitted to reaction times and wanting ratings (binarized). This could be an excellent addition to corroborate the findings for choice behavior.

      We have implemented several alternative modeling analyses, including taste vs health as defined by Nutri-Score (Table S12 and Figures S22-S30) and higher wanted choice vs healthy choice (Table S13; Figure S30-34). Indeed, these models corroborate those reported in the main text demonstrating the robustness of our findings.

      R3.3: The principal component analysis was a good strategy for reducing the attribute space (taste, health, wanting, calories, Nutriscore, objective calories) into two components. Still, somehow, this part of the results added confusion to harnessing in which of the analyses the health attribute corresponded only to the healthiness ratings and taste to the tastiness ratings and if and when the components were used as attributes. This source of confusion could be mitigated by more clearly stating what health and taste corresponded to in each of the analyses.

      We thank the reviewer for this recommendation and have now reported the PCA before reporting the behavioural results to clarify that choices are binarized based on participants’ taste and health ratings, rather than the composite scores. We have chosen this approach, as it is closer to our hypotheses and improves interpretability.

      R3.4: From the methods, it seems that 66 food images were used, and 39 fell into A, B, C, and D Nutriscores. How were the remaining 27 images selected, and how healthy and tasty were the food stimuli overall?

      The selection of food stimuli was done in three steps: First, from Charbonnier and collegues (2016) standardized food image database (available at osf.io/cx7tp/) we excluded food items that were not familiar in Germany/unavailable in regular German supermarkets. Second, we excluded products that we would not be able to incentivize easily (i.e., fastfood, pastries and items that required cooking/baking/other types of preparation). Third, we added the Nutri Scores to the remaining products aiming to have an equal number of items for each Nutri-Score, of which approximately half of the items were sweet and the other half savory. This resulted in a final stimuli-set of 66 food images (13 items =A; 13 items=B; 12 items=C; 14 items =D; 14 items = E). The experiment with including the set of food stimuli used in our study is also uploaded here: osf.io/pef9t/.With respect to the second question, we would like to point out that preference of food stimuli is very individual, therefore we obtained the ratings (taste, health, wanting and estimated caloric density) of each participant individually. However, we also added the objective total calories, which is positively correlated subjective caloric density and negatively correlated with Nutri-Score (coded as A=5; B=4; C=3; D=2; E=1) and health ratings (see Figure S7).

      R3.5: It seems that the degrees of freedom for the paired t-test comparing the effects of the condition hungry versus satiated on hunger ratings were 63, although the participant sample counted 70. Please verify.

      This is correct and explained in the methods section under data analysis: “Due to missing values for one timepoint in six participants (these participants did not fill in the VAS and PANAS before the administration of the Protein Shake in the sated condition) the analyses of the hunger state manipulation had a sample size of 64.”

      R3.5: Please add the range of BMI and age of participants. Did all participants fall within a healthy BMI range

      The BMI ranged from 17.306 to 48.684 (see Author response image 5), with the majority of participants falling within a normal BMI (i.e., between 18.5 and 24.9. In our sample, 3 participants had a BMI lager than 30. By using subject as a random intercept in our GLMMs we accounted for potential deviations in their response.

      Author response image 5.

      R3.5: Defining the inference criterion used for the significance of the posterior parameter chains in more detail can be pedagogical for those new to or unfamiliar with inferences drawn from hierarchical Bayesian model estimations and Bayesian statistics.

      We have added an explanation of the highest density intervals and what they mean with respect to our data in the respective result section.

    1. eLife assessment

      Using intracellular in vitro and in vivo recordings and a deep learning approach, this study shows that mouse dentate gyrus mossy cells (MCs) and CA3 pyramidal cells process information from an important electrophysiological hall mark of hippocampus, sharp wave-ripples (SWRs). The innovative use of deep learning to predict SWR waveforms from MC membrane potentials represents an interesting methodological advance. While the key findings are potentially fundamental, some of the evidence is currently incomplete and should be revised to better support the findings.

    2. Reviewer #1 (Public Review):

      The authors recorded from multiple mossy cells (MCs) of the dentate gyrus in slices or in vivo using anesthesia. They recorded MC spontaneous activity during spontaneous sharp waves (SWs) detected in area CA3 (in vitro) or in CA1 ( in vivo). They find variability of the depolarization of MCs in response to a SW. They then used deep learning to parse out more information. They conclude that CA3 sends different "information" to different MCs. However, this is not surprising because different CA3 neurons project to different MCs and it was not determined if every SW reflected the same or different subsets of CA3 activity.

      The strengths include recording up to 5 MCs at a time. The major concerns are in the finding that there is variability. This seems logical, not surprising. Also it is not clear how deep learning could lead to the conclusion that CA3 sends different "information" to different MCs. It seems already known from the anatomy because CA3 neurons have diverse axons so they do not converge on only one or a few MCs. Instead they project to different MCs. Even if they would, there are different numbers of boutons and different placement of boutons on the MC dendrites, leading to different effects on MCs. There also is a complex circuitry that is not taken into account in the discussion or in the model used for deep learning. CA3 does not only project to MCs. It also projects to hilar and other dentate gyrus GABAergic neurons which have complex connections to each other, MCs, and CA3. Furthermore, MCs project to MCs, the GABAergic neurons, and CA3. Therefore at any one time that a SW occurs, a very complex circuitry is affected and this could have very different effects on MCs so they would vary in response to the SW. This is further complicated by use of slices where different parts of the circuit are transected from slice to slice.

      It is also not discussed if SWs have a uniform frequency during the recording session. If they cluster, or if MC action potentials occur just before a SW, or other neurons discharge before, it will affect the response of the MC to the SW. If MC membrane potential varies, this will also effect the depolarization in response to the SW.

      In vivo, the SWs may be quite different than in vivo but this is not discussed. The circuitry is quite different from in vitro. The effects of urethane could have many confounding influences.

      Furthermore, how much the in vitro and in vivo SWs tell us about SWs in awake behaving mice is unclear.

      Also, methods and figures are hard to understand.

    3. Reviewer #2 (Public Review):

      • A summary of what the authors were trying to achieve<br /> Drawing from theoretical insights on the pivotal role of mossy cells (MCs) in pattern separation - a key process in distinguishing between similar memories or inputs - the authors investigated how MCs in the dentate gyrus of the hippocampus encode and process complex neural information. By recording from up to five MCs simultaneously, they focused on membrane potential dynamics linked to sharp wave-ripple complexes (SWRs) originating from the CA3 area. Indeed, using a machine learning approach, they were able to demonstrate that even a single MC's synaptic input can predict a significant portion (approximately 9%) of SWRs, and extrapolation suggested that synaptic input obtained from 27 MCs could account for 90% of the SWR patterns observed. The study further illuminates how individual MCs contribute to a distributed but highly specific encoding system. It demonstrates that SWR clusters associated with one MC seldom overlap with those of another, illustrating a precise and distributed encoding strategy across the MC network.

      • An account of the major strengths and weaknesses of the methods and results<br /> Strengths:<br /> (1) This study is remarkable because it establishes a critical link between the subthreshold activities of individual neurons and the collective dynamics of neuronal populations.<br /> (2) The authors utilize machine learning to bridge these levels of neuronal activity. They skillfully demonstrate the predictive power of membrane potential fluctuations for neuronal events at the population level and offer new insights into neuronal information processing.<br /> (3) To investigate sharp wave/ripple-related synaptic activity in mossy cells (MCs), the authors performed challenging experiments using whole-cell current-clamp recordings. These recordings were obtained from up to five neurons in vitro and from single mossy cells in live mice. The latter recordings are particularly valuable as they add to the limited published data on synaptic input to MCs during in vivo ripples.

      Weaknesses:<br /> (1) The model description could significantly benefit from additional details regarding its architecture, training, and evaluation processes. Providing these details would enhance the paper's transparency, facilitate replication, and strengthen the overall scientific contribution. For further details, please see below.<br /> (2) The study recognizes the concept of pattern separation, a central process in hippocampal physiology for discriminating between similar inputs to form distinct memories. The authors refer to a theoretical paper by Myers and Scharfman (2011) that links pattern separation with activity backpropagating from CA3 to mossy cells. Despite this initial citation, the concept is not discussed again in the context of the new findings. Given the significant role of MCs in the dentate gyrus, where pattern separation is thought to occur, it would be valuable to understand the authors' perspective on how their findings might relate to or contribute to existing theories of pattern separation. Could the observed functions of MCs elucidated in this study provide new insights into their contribution to processes underlying pattern separation?<br /> (3) Previous work concluded that sharp waves are associated with mossy cell inhibition, as evidenced by a consistent ripple function-related hyperpolarization of the membrane potential in these neurons when recorded at resting membrane potential (Henze & Buzsáki, 2007). In contrast, the present study reveals an SWR-induced depolarization of the membrane potential. Can the authors explain the observed modulation of the membrane potential during CA1 ripples in more detail? What was the proportion of cases of depolarization or hyperpolarization? What were the respective amplitude distributions? Were there cases of activation of the MCs, i.e., spiking associated with the ripple? This more comprehensive information would add significance to the study as it is not currently available in the literature.<br /> (4) In the study, the observation that mossy cells (MCs) in the lower (infrapyramidal) blade of the dentate gyrus (DG) show higher predictability in SWR patterns is both intriguing and notable. This finding, however, appears to be mentioned without subsequent in-depth exploration or discussion. One wonders if this observed predictability might be influenced by potential disruptions or severed connections inherent to the brain slice preparation method used. Furthermore, it prompts the question of whether similar observations or trends have been noted in MCs recorded in vivo, which could either corroborate or challenge this intriguing in vitro finding.<br /> (5) The study's comparison of SWR predictability by mossy cells (MCs) is complicated by using different recording sites: CA3 for in vitro and CA1 for in vivo experiments, as shown in Fig. 2. Since CA1-SWRs can also arise from regions other than CA3 (see e.g. Oliva et al., 2016, Yamamoto and Tonegawa, 2017), it is difficult to reconcile in vitro and in vivo results. Addressing this difference and its implications for MC predictability in the results discussion would strengthen the study.

      • An appraisal of whether the authors achieved their aims, and whether the results support their conclusions<br /> As outlined in the abstract and introduction, the primary aim is to investigate the role of MCs in encoding neuronal information during sharp wave ripple complexes, a crucial neuronal process involved in memory consolidation and information transmission in the hippocampus. It is clear from the comprehensive details in this study that the authors have meticulously pursued their goals by providing extensive experimental evidence and utilizing innovative machine learning techniques to investigate the encoding of information in the hippocampus by mossy cells (MCs). Together, this study provides a compelling account supported by rigorous experimental and analytical methods. Linking subthreshold membrane potentials and population activity by machine learning provides a comprehensive new analytic approach and sheds new light on the role of MCs in information processing in the hippocampus. The study not only achieves the stated goals, but also provides novel methodology, and valuable insights into the dynamics of neural coding and information flow in the hippocampus.

      • A discussion of the likely impact of the work on the field, and the utility of the methods and data to the community<br /> Impact: Both the novel methodology and the provided biological insights will be of great interest to the community.<br /> Utility of methods/data: The applied deep learning approach will be of particular interest if the authors provide more details to improve its reproducibility (see related suggestions below).

    4. Reviewer #3 (Public Review):

      Compared to the pyramidal cells of the CA1 and CA3 regions of the hippocampus, and the granule cells of the dentate gyrus (DG), the computational role(s) of mossy cells of the DG have received much less attention over the years and are consequently not well understood. Mossy cells receive feedforward input from granule cells and feedback from CA3 cells. One significant factor is the compression of the large number of CA3 cells that input onto a much smaller population of mossy cells, which then send feedback connections to the granule cell layer. The present paper seeks to understand this compression in terms of neural coding, and asks whether the subthreshold activity of a small number of mossy cells can predict above chance levels the shapes of individual SWs produced by the CA3 cells. Using elegant multielectrode intracellular recordings of mossy cells, the authors use deep learning networks to show that they can train the network to "predict" the shape of a SW that preceded the intracellular activity of the mossy cells. Putatively, a single mossy cell can predict the shape of SWs above chance. These results are interesting, but there are some conceptual issues and questions about the statistical tests that must be addressed before the results can be considered convincing.

      Strengths<br /> (1) The paper uses technically challenging techniques to record from multiple mossy cells at the same time, while also recording SWs from the LFP of the CA3 layer. The data appear to be collected carefully and analyzed thoughtfully.<br /> (2) The question of how mossy cells process feedback input from CA3 is important to understand the role of this feedback pathway in hippocampal processing.<br /> (3) Given the concerns expressed below about proper statistical testing are resolved, the data appear supportive of the main conclusions of the authors and suggest that, to some degree, the much smaller population of mossy cells can conserve the information present in the larger population of CA3 cells, presumably by using a more compressed, dense population code.

      Weaknesses<br /> (4) Some of the statistical tests appear inappropriate because they treat each CA3 SW and associated Vm from a mossy cell as independent samples. This violates the assumptions of statistical tests such as the Kolmogorov-Smirnov tests of Figure 3C and Fig 3E. Although there is large variability among the SWs recorded and among the Vm's, they cannot be considered independent measurements if they derive from the same cell and same recording site of an individual animal. This becomes especially problematic when the number of dependent samples adds up to the tens of thousands, providing highly inflated numbers of samples that artificially reduce the p values. Techniques such as mixed-effects models are being increasingly used to factor out the effects of within cell and within animal correlations in the data. The authors need to do something similar to factor out these contributions in order to perform statistical tests, throughout the manuscript when this problem occurs.<br /> (5) A separate statistical problem occurs when comparing real data against a shuffled, surrogate data set. From the methods, I gather that Figure 3C combined data from 100 surrogate shuffles to compare to the real data. It is inappropriate to do a classic statistical test of data against such shuffles, because the number of points in the pooled surrogate data sets are not true samples from a population. It is a mathematical certainty that one can eventually drive a p value to < 0.05 just by increasing the number of shuffles sufficiently. Thus, the p value is determined by the number of computer shuffles allowed by the time and processing power of a computer, rather than by sampling real data from the population. Figures such as 4C and 5A are examples that test data against shuffle appropriately, as a single value is determined to be within or outside the 95% confidence interval of the shuffle, and this determination is not directly affected by the number of shuffles performed.<br /> (6) The last line of the Discussion states that this study provides "important insights into the information processing of neural circuits at the bottleneck layer," but it is not clear what these insights are. If the statistical problems are addressed appropriately, then the results do demonstrate that the information that is reflected in SWs can be reconstructed by cells in the MC bottleneck, but it is not certain what conceptual insights the authors have in mind. They should discuss more how these results further our understanding of the function of the feedback connection from CA3 to the mossy cells, discuss any limitations on their interpretation from recording LFPs rather than the single-unit ensemble activity (where the information is really encoded).<br /> 7) In Figure 1C, the maximum of the MC response on the first inset precedes the SW, and the onset of the Vm response may be simultaneous with SW. This would suggest that the SW did not drive the mossy cell, but this was a coincident event. How many SW-mossy cell recordings are like this? Do the authors have a technical reason to believe that these are events in which the mossy cell is driven by the CA3 cells active during the SW?

    5. Author response:

      Reviewer #1 (Public Review):

      We are grateful to this reviewer for her/his constructive comments, which have greatly improved our work. Individual responses are provided below.

      The authors recorded from multiple mossy cells (MCs) of the dentate gyrus in slices or in vivo using anesthesia. They recorded MC spontaneous activity during spontaneous sharp waves (SWs) detected in area CA3 (in vitro) or in CA1 ( in vivo). They find variability of the depolarization of MCs in response to a SW. They then used deep learning to parse out more information. They conclude that CA3 sends different "information" to different MCs. However, this is not surprising because different CA3 neurons project to different MCs and it was not determined if every SW reflected the same or different subsets of CA3 activity.

      Thank you for your valuable comments. We agree that our finding that different MCs receive different information is unsurprising. These data are, in fact, to be expected from the anatomical knowledge of the circuit structure. However, as a physiological finding, there is a certain value in proving this fact; please note that it was not clear whether the neural activity of individual MCs received heterogeneous/variable information at the physiological level. It was therefore necessary to investigate this by recording neural activity. We believe this study is important because it quantitatively demonstrates this fact.

      The strengths include recording up to 5 MCs at a time. The major concerns are in the finding that there is variability. This seems logical, not surprising. Also it is not clear how deep learning could lead to the conclusion that CA3 sends different "information" to different MCs. It seems already known from the anatomy because CA3 neurons have diverse axons so they do not converge on only one or a few MCs. Instead they project to different MCs. Even if they would, there are different numbers of boutons and different placement of boutons on the MC dendrites, leading to different effects on MCs. There also is a complex circuitry that is not taken into account in the discussion or in the model used for deep learning. CA3 does not only project to MCs. It also projects to hilar and other dentate gyrus GABAergic neurons which have complex connections to each other, MCs, and CA3. Furthermore, MCs project to MCs, the GABAergic neurons, and CA3. Therefore at any one time that a SW occurs, a very complex circuitry is affected and this could have very different effects on MCs so they would vary in response to the SW. This is further complicated by use of slices where different parts of the circuit are transected from slice to slice.

      The first half of this paragraph is closely related to the previous paragraph. We propose that the variation in membrane potential of the simultaneously recorded MCs allows for the expression of diverse information. We also believe that this is highly novel in that no previous work has described the extent to which SWR is encoded in MCs. Our study proposes a new quantitative method that relates two variables (LFP and membrane potential) that are inherently incomparable. Specifically, we used machine learning (please note that it is a neural network, but not "deep learning") to achieve this quantification, and we believe this innovation is noteworthy.

      In the latter part of this article, you raise another important point. First, we would like to point out that this comment contains a slight misunderstanding. Our goal is not to reproduce the circuit structure of the hippocampus in silico but to propose a "function (or mapping/transformation)" that connects the two different modalities, i.e., LFP and Vm. This function should be as simple as possible, which is desirable from an explanatory point of view. In this respect, our machine learning model is a 'perceptron'-like 3-layer neural network. One of the simplest classical neural network models can predict the LFP waveform from Vm, which is quite surprising and an achievement we did not even imagine before. The fact that our model does not consider dendrites or inhibitory neurons is not a drawback but an important advantage. On the other hand, the fact that the data we used for our predictions were primarily obtained using slice experiments may be a drawback of this study, and we agree with your comments. However, we can argue that the new quantitative method we propose here is versatile since we showed that the same machine learning can be used to predict in vivo single-cell data.

      It is also not discussed if SWs have a uniform frequency during the recording session. If they cluster, or if MC action potentials occur just before a SW, or other neurons discharge before, it will affect the response of the MC to the SW. If MC membrane potential varies, this will also effect the depolarization in response to the SW.

      Thank you for raising an important point. We have done some additional analyses in response to your comment. First, we plotted how the SWR parameter fluctuated during our recording time (especially for data recorded for long periods of more than 5 minutes). As shown in the new Figure 1 - figure supplement 4, we can see that the frequency of SWRs was kept uniform during the recording time. These data ensure the rationale for pooling data over time.

      We also calculated the average membrane potentials of MCs before and after SWRs and found that MCs did not show depolarization or hyperpolarization before SWs, unlike Vm of CA1 neurons. These data indicate that the surrounding circuitry was not particularly active before SW, eliminating any concern that such unexpected preceding activity might affect our analysis. These data are shown in Figure 1 - figure supplement 2.

      In vivo, the SWs may be quite different than in vivo but this is not discussed. The circuitry is quite different from in vitro. The effects of urethane could have many confounding influences. Furthermore, how much the in vitro and in vivo SWs tell us about SWs in awake behaving mice is unclear.

      We agree with this point. Ideally, recording in vitro and in vivo under conditions as similar as possible would be optimal. However, as you know, patch-clamp recording from mossy cells in vivo is technically challenging, and currently, there is no alternative to conducting experiments under anesthesia. We believe that science advances not merely through theoretical discourse, but by contributing empirical data collected under existing conditions. However, as we mentioned in the paper, we believe that in vivo and in vitro SWR share some properties and a common principle of occurrence. We also observed that there are similar characteristics in the membrane potential response of MC to SWR. However, as you have pointed out, data derived from these limitations require careful interpretation, and we have explicitly stated in the paper that not only are there such problems, but that there are also common properties in the data obtained in vivo and in vitro (Page 12, Line 357).

      Also, methods and figures are hard to understand as described below.

      Thank you for all your comments. We have carefully considered the reviewers' comments and improved the text and legend. We hope you will take the time to review them.

      Reviewer #2 (Public Review):

      Thank you for the positive evaluations, which have encouraged us to resubmit this manuscript. We have revised our manuscript in accordance with your comments. Our point-by-point responses are as follows:

      • A summary of what the authors were trying to achieve

      Drawing from theoretical insights on the pivotal role of mossy cells (MCs) in pattern separation - a key process in distinguishing between similar memories or inputs - the authors investigated how MCs in the dentate gyrus of the hippocampus encode and process complex neural information. By recording from up to five MCs simultaneously, they focused on membrane potential dynamics linked to sharp wave-ripple complexes (SWRs) originating from the CA3 area. Indeed, using a machine learning approach, they were able to demonstrate that even a single MC's synaptic input can predict a significant portion (approximately 9%) of SWRs, and extrapolation suggested that synaptic input obtained from 27 MCs could account for 90% of the SWR patterns observed. The study further illuminates how individual MCs contribute to a distributed but highly specific encoding system. It demonstrates that SWR clusters associated with one MC seldom overlap with those of another, illustrating a precise and distributed encoding strategy across the MC network.

      We appreciate that this reviewer found scientific value in our manuscript. Thanks to the comments, we were pleased to be able to revise and improve the manuscript. Individual responses are listed below:

      • An account of the major strengths and weaknesses of the methods and results

      Strengths:

      (1) This study is remarkable because it establishes a critical link between the subthreshold activities of individual neurons and the collective dynamics of neuronal populations.

      (2) The authors utilize machine learning to bridge these levels of neuronal activity. They skillfully demonstrate the predictive power of membrane potential fluctuations for neuronal events at the population level and offer new insights into neuronal information processing.

      (3) To investigate sharp wave/ripple-related synaptic activity in mossy cells (MCs), the authors performed challenging experiments using whole-cell current-clamp recordings. These recordings were obtained from up to five neurons in vitro and from single mossy cells in live mice. The latter recordings are particularly valuable as they add to the limited published data on synaptic input to MCs during in vivo ripples.

      We appreciate the reviewer’s critical evaluations, which have encouraged us to revise and resubmit this manuscript. We have revised our manuscript in line with the reviewer’s comments. Our point-by-point responses are provided below:

      Weaknesses:

      (1) The model description could significantly benefit from additional details regarding its architecture, training, and evaluation processes. Providing these details would enhance the paper's transparency, facilitate replication, and strengthen the overall scientific contribution. For further details, please see below.

      Thank you for the suggestions. We have responded with model details based on the following comments.

      (2) The study recognizes the concept of pattern separation, a central process in hippocampal physiology for discriminating between similar inputs to form distinct memories. The authors refer to a theoretical paper by Myers and Scharfman (2011) that links pattern separation with activity backpropagating from CA3 to mossy cells. Despite this initial citation, the concept is not discussed again in the context of the new findings. Given the significant role of MCs in the dentate gyrus, where pattern separation is thought to occur, it would be valuable to understand the authors' perspective on how their findings might relate to or contribute to existing theories of pattern separation. Could the observed functions of MCs elucidated in this study provide new insights into their contribution to processes underlying pattern separation?

      Thank you for your valuable comment. The role of MCs in pattern separation is described in the discussion as follows:

      “It has been shown through theoretical models that MCs are a contributor to pattern separation (Myers and Scharfman, 2011). In general, the pathway of neural information is diverged from the entorhinal cortex through the larger granule cell layer and then compressed into the smaller CA3 cell layer. In this case, there is a high possibility of information loss during the transmission process. Thus, a backprojection mechanism via MCs has been proposed as a device to prevent information loss. Indeed, in theoretical models, such backprojection improves pattern separation and memory capacity, and the results are closer to experimental data than models without built-in backprojection. However, it was unclear what information individual MCs receive during backprojection. Our results show that CA3 SWR is distributed and encoded in the MC population, and that even though the number of MCs is smaller than in other regions, it is possible to reproduce about 30% of the SWR in CA3 from the membrane potential of only five MCs. Based on these results, it is believed that MCs not only play a role in preventing information loss, but also play a role in receiving some kind of newly encoded memory information in the CA3 region, and it is highly likely that the information contained in the backprojections is different from the neural information transmitted through conventional transmission pathways. Indeed, the fact that the information replayed in CA3 is reflected as SWR and propagated to each brain region suggests that the newly encoded memory information in CA3 is propagated to MC. If  backprojection simply returned the information transmitted from DG to CA3, and to MC, this would be unrealistic and extremely inefficient. However, it is still unclear what kind of memory information is actually backprojected and distributed to the MC, and how it differs from the memory information transmitted in the forward direction. These are open questions that need to be addressed in future experiments in awake animals.” (Page 11, Line 333)

      (3) Previous work concluded that sharp waves are associated with mossy cell inhibition, as evidenced by a consistent ripple function-related hyperpolarization of the membrane potential in these neurons when recorded at resting membrane potential (Henze & Buzsáki, 2007). In contrast, the present study reveals an SWR-induced depolarization of the membrane potential. Can the authors explain the observed modulation of the membrane potential during CA1 ripples in more detail? What was the proportion of cases of depolarization or hyperpolarization? What were the respective amplitude distributions? Were there cases of activation of the MCs, i.e., spiking associated with the ripple? This more comprehensive information would add significance to the study as it is not currently available in the literature.

      Sorry for confusing the conclusion. First, we did not mention in the paper that in vivo MC depolarized during SWR. The following sentences have added to result:

      “Previous research has shown that the hyperpolarization of MC membrane potential associated with SWR indicates that SWR is related to the inhibition of mossy cells (Henze and Buzsáki, 2007). However, our data showed that the proportion of cases of depolarization or hyperpolarization was about the same, with a slight excess of depolarization. However, it should be noted that MCs are highly active and fluctuating cells, and the determination of whether they are depolarized or hyperpolarized is highly dependent on the method of analysis. Moreover, the firing rate of MCs that we recorded was 1.07 ± 0.93 Hz (mean ± SD from 6 cells, 6 mice), and 6.68 ± 4.79% (mean ± SD from 6 cells, 6 mice, n = 757 SWR events) of all SWRs recruited MC firing (calculated as firing within 50 ms after the SWR peak). ” (Page 5, Line 143)

      (4) In the study, the observation that mossy cells (MCs) in the lower (infrapyramidal) blade of the dentate gyrus (DG) show higher predictability in SWR patterns is both intriguing and notable. This finding, however, appears to be mentioned without subsequent in-depth exploration or discussion. One wonders if this observed predictability might be influenced by potential disruptions or severed connections inherent to the brain slice preparation method used. Furthermore, it prompts the question of whether similar observations or trends have been noted in MCs recorded in vivo, which could either corroborate or challenge this intriguing in vitro finding.

      As you pointed out, one cannot rule out the possibility that this predictability may be influenced by potential disruptions or disconnections inherent in the methods used to prepare the acute slices. And the number of cells is limited to six with respect to the anatomical location of the MC recorded in vivo, making SWR and MC patch clamp recording very difficult even under anesthesia. Therefore, it is difficult to find statistical significance in the current data. We have added following text in Discussion:

      “In addition, the finding that SWR is more predictive when the recorded location of the MC is near the lower blade of the DG is unexpected, so the possibility that this result is influenced by potential disruptions or severed connections during the preparation of the acute slice cannot be ruled out.” (Page 14, Line 405)

      (5) The study's comparison of SWR predictability by mossy cells (MCs) is complicated by using different recording sites: CA3 for in vitro and CA1 for in vivo experiments, as shown in Fig. 2. Since CA1-SWRs can also arise from regions other than CA3 (see e.g. Oliva et al., 2016, Yamamoto and Tonegawa, 2017), it is difficult to reconcile in vitro and in vivo results. Addressing this difference and its implications for MC predictability in the results discussion would strengthen the study.

      Thank you for your comment. We have added the following discussion to your comment:

      “In this study, we performed MC patch-clamp recording both in vivo and in vitro, and clarified that SWR can be predicted from V_m of MC in both cases. However, there are three caveats to the interpretation of these data. First, the _in vivo SWR cannot be said to be exactly the same as the in vitro SWR: note that in vitro SWR has some similarities to in vivo SWR, such as spatial and spectral profiles and neural activity patterns (Maier et al., 2009; Hájos et al., 2013; Pangalos et al., 2013). The same concern applies to MC synaptic inputs. The in vivo V_m data may contain more information compared to the _in vitro single MC data, because the entire projections that target MCs are intact, resulting in a complete set of synaptic inputs related to SWR activity, as opposed to slices where connections are severed. While we recognize these differences, it is also very likely that there are common ways of expressing information. Second, since the in vivo LFP recordings were obtained from the CA1 region, it is possible that the CA1-SWR receives input from the CA2 region (Oliva et al., 2016) and the entorhinal cortex (Yamamoto and Tonegawa, 2017). In addition, urethane anesthesia has been observed to reduce subthreshold activity, spike synchronization, and SWR (Yagishita et al., 2020), making it difficult to achieve complete agreement with in vitro SWR recorded from the CA3 region. Finally, although we were able to record MC V_m during _in vivo SWR in this study, the in vivo data set consisted of recordings from a single MC, in contrast to the in vitro dataset. To perform the same analysis as in the in vitro experiment, it would be desirable to record LFPs from the CA3 region and collect data from multiple MCs simultaneously, but this is technically very difficult. In this study, it was difficult to directly clarify the consistency between CA3 network activity and in vivo MC synaptic input, but the fact that the SWR waveform can be predicted from in vivo MC V_m in CA1-SWR may be the result of some CA3 network activity being reflected in CA1-SWR. It is undeniable that more accurate predictions would have been possible if it had been possible to record LFP from the CA3 regions _in vivo. ” (Page 12, Line 357)

      • An appraisal of whether the authors achieved their aims, and whether the results support their conclusions

      As outlined in the abstract and introduction, the primary aim is to investigate the role of MCs in encoding neuronal information during sharp wave ripple complexes, a crucial neuronal process involved in memory consolidation and information transmission in the hippocampus. It is clear from the comprehensive details in this study that the authors have meticulously pursued their goals by providing extensive experimental evidence and utilizing innovative machine learning techniques to investigate the encoding of information in the hippocampus by mossy cells (MCs). Together, this study provides a compelling account supported by rigorous experimental and analytical methods. Linking subthreshold membrane potentials and population activity by machine learning provides a comprehensive new analytic approach and sheds new light on the role of MCs in information processing in the hippocampus. The study not only achieves the stated goals, but also provides novel methodology, and valuable insights into the dynamics of neural coding and information flow in the hippocampus.

      We appreciate the reviewer’s critical evaluations, which have encouraged us to revise and resubmit this manuscript. We have revised our manuscript in line with the reviewer’s comments.

      • A discussion of the likely impact of the work on the field, and the utility of the methods and data to the community

      Impact: Both the novel methodology and the provided biological insights will be of great interest to the community.

      Utility of methods/data: The applied deep learning approach will be of particular interest if the authors provide more details to improve its reproducibility (see related suggestions below).

      We appreciate that this reviewer found scientific value in our manuscript. Thanks to the comments.

      Reviewer #3 (Public Review):

      We appreciate that this reviewer raised several important issues. We are pleased to have been able to revise the paper into a better manuscript based on these comments. Individual responses are listed below:

      Compared to the pyramidal cells of the CA1 and CA3 regions of the hippocampus, and the granule cells of the dentate gyrus (DG), the computational role(s) of mossy cells of the DG have received much less attention over the years and are consequently not well understood. Mossy cells receive feedforward input from granule cells and feedback from CA3 cells. One significant factor is the compression of the large number of CA3 cells that input onto a much smaller population of mossy cells, which then send feedback connections to the granule cell layer. The present paper seeks to understand this compression in terms of neural coding, and asks whether the subthreshold activity of a small number of mossy cells can predict above chance levels the shapes of individual SWs produced by the CA3 cells. Using elegant multielectrode intracellular recordings of mossy cells, the authors use deep learning networks to show that they can train the network to "predict" the shape of a SW that preceded the intracellular activity of the mossy cells. Putatively, a single mossy cell can predict the shape of SWs above chance. These results are interesting, but there are some conceptual issues and questions about the statistical tests that must be addressed before the results can be considered convincing.

      We appreciate that this reviewer found scientific value in our manuscript. Thanks to the comments, we were pleased to be able to revise and improve the manuscript. Individual responses are listed below:

      Strengths

      (1) The paper uses technically challenging techniques to record from multiple mossy cells at the same time, while also recording SWs from the LFP of the CA3 layer. The data appear to be collected carefully and analyzed thoughtfully.

      (2) The question of how mossy cells process feedback input from CA3 is important to understand the role of this feedback pathway in hippocampal processing.

      3) Given the concerns expressed below about proper statistical testing are resolved, the data appear supportive of the main conclusions of the authors and suggest that, to some degree, the much smaller population of mossy cells can conserve the information present in the larger population of CA3 cells, presumably by using a more compressed, dense population code.

      We appreciate the reviewer’s critical evaluations, which have encouraged us to revise and resubmit this manuscript. We have revised our manuscript in line with the reviewer’s comments. Our point-by-point responses are provided below:

      Weaknesses

      4) Some of the statistical tests appear inappropriate because they treat each CA3 SW and associated Vm from a mossy cell as independent samples. This violates the assumptions of statistical tests such as the Kolmogorov-Smirnov tests of Figure 3C and Fig 3E. Although there is large variability among the SWs recorded and among the Vm's, they cannot be considered independent measurements if they derive from the same cell and same recording site of an individual animal. This becomes especially problematic when the number of dependent samples adds up to the tens of thousands, providing highly inflated numbers of samples that artificially reduce the p values. Techniques such as mixed-effects models are being increasingly used to factor out the effects of within cell and within animal correlations in the data. The authors need to do something similar to factor out these contributions in order to perform statistical tests, throughout the manuscript when this problem occurs.

      Thank you for the insightful comment. As for the correlation between the animals, since they were brought in at the same age and kept in the same environment, we do not think it is necessary to account for the differences due to environmental factors. As the reviewer pointed out, we cannot completely rule out the possibility that within cell or within animal correlation might influence the results, so we plotted the differences in prediction accuracy between cells, slices, and animals (Figure 3 - figure supplement 7). The results showed that prediction accuracy of the real data was better than that of the shuffled data in 66 of the 87 MCs (75.9%). In response to the comment that measurements from the same animal do not constitute independent samples, we have indicated that the average ΔRMSE for each mouse were calculated and these values were significantly different from 0 (n = 14, *p = 0.0041, Student’s t-test). In other words, even if each animal is considered an independent sample, it is possible to obtain statistically significant differences.

      5) A separate statistical problem occurs when comparing real data against a shuffled, surrogate data set. From the methods, I gather that Figure 3C combined data from 100 surrogate shuffles to compare to the real data. It is inappropriate to do a classic statistical test of data against such shuffles, because the number of points in the pooled surrogate data sets are not true samples from a population. It is a mathematical certainty that one can eventually drive a p value to < 0.05 just by increasing the number of shuffles sufficiently. Thus, the p value is determined by the number of computer shuffles allowed by the time and processing power of a computer, rather than by sampling real data from the population. Figures such as 4C and 5A are examples that test data against shuffle appropriately, as a single value is determined to be within or outside the 95% confidence interval of the shuffle, and this determination is not directly affected by the number of shuffles performed.

      Thank you for raising a very good point. We understand the reviewer's comments, but we cannot fully agree with the part that says "It is mathematical certainty that one can eventually drive a p value to < 0.05 just by increasing the number of shuffles sufficiently". This is because when comparing data with no difference at all, no amount of shuffling will produce a significant difference. In this regard, we agree that increasing the number of shuffles will lower the p-value when comparing data with even a small difference. Based on the reviewer's comments, we used a paired t-test to test whether the difference between RMSEreal and RMSEsurrogate was significantly different from 0, and showed it was significantly different (Figure 3 - figure supplement 5). Even when a paired t-test was used for the test, as in Figure 3E, a significant difference in the prediction error of the real and shuffled data was observed for all MC number inputs and also for the in vivo data.

      6) The last line of the Discussion states that this study provides "important insights into the information processing of neural circuits at the bottleneck layer," but it is not clear what these insights are. If the statistical problems are addressed appropriately, then the results do demonstrate that the information that is reflected in SWs can be reconstructed by cells in the MC bottleneck, but it is not certain what conceptual insights the authors have in mind. They should discuss more how these results further our understanding of the function of the feedback connection from CA3 to the mossy cells, discuss any limitations on their interpretation from recording LFPs rather than the single-unit ensemble activity (where the information is really encoded).

      Thank you for your insightful comment. We have added the following text to the discussion:

      “Given that different SWRs may encode information that correlates with different experiences, it is also possible that the activity of individual MCs may play a role in encoding different experiences via SWRs. Indeed, several in vivo studies have confirmed that MC activity is involved in the space encoding (Bui et al., 2018; Huang et al., 2024). However, the relationship with SWRs has not been investigated. The significance of the fact that the SWR recorded from CA3 is reflected in the MC as synaptic input is that it not only shows the transmission pathway from CA3 to MC, but also reveals the information below the threshold that leads to firing, and in a broad sense, it approaches the mechanism by which information processing by neuronal firing. And the expression of synaptic input to the MC is not uniform, but varies in a variety of ways according to the pattern of SWR. Based on previous research showing that diversity is important for information representation (Padmanabhan and Urban, 2010; Tripathy et al., 2013), it is possible that this heterogeneity in membrane potential levels, rather than the all-or-none output of neuronal firing activity, is the key to encoding more precise information. In this respect, our research, which focuses on information encoding at the subthreshold level, may be able to extract even more information than information encoded by firing activity. ” (Page 14, Line 419)

      7) In Figure 1C, the maximum of the MC response on the first inset precedes the SW, and the onset of the Vm response may be simultaneous with SW. This would suggest that the SW did not drive the mossy cell, but this was a coincident event. How many SW-mossy cell recordings are like this? Do the authors have a technical reason to believe that these are events in which the mossy cell is driven by the CA3 cells active during the SW?

      Thank you for your insightful comment. Based on your comment, we have aligned all the MC EPSPs for each SWR onset and found that the EPSPs rise after the SWR onset (Figure 1 - figure supplement 2). This leads us to believe that the EPSP of the MC is most likely driven by the SWR.

    1. eLife Assessment

      The authors describe an approach to construct hybrid neuraminidase molecules that express epitopes (loops) of a specific neuraminidase grafted onto another neuraminidase. The loops (epitopes) are from low-expressing neuraminidases and the scaffold is derived from a high-expressing neuraminidase. This paper is an important contribution giving new insights into the structure, function, and immunogenicity of influenza virus neuraminidases. The paper presents convincing evidence supporting the conclusions arrived at by the authors.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript described a structure-guided approach to graft important antigenic loops of the neuraminidase to a homotypic but heterologous NA. This approach allows the generation of well-expressed and thermostable recombinant proteins with antigenic epitopes of choice to some extent. The loop-grafted NA was designated hybrid.

      Strengths:

      The hybrid NA appeared to be more structurally stable than the loop-donor protein while acquiring its antigenicity. This approach is of value when developing a subunit NA vaccine which is difficult to express. So that antigenic loops could be potentially grafted to a stable NA scaffold to transfer strain-specific antigenicity.

      Weaknesses:

      However, major revisions to better organize the text, and figure and make clarifications on a number of points, are needed. There are a few cases in which a later figure was described first, data in the figures were not sufficiently described, or where there were mismatched references to figures.

      More importantly, the hybrid proteins did not show any of the advantages over the loop-donor protein in the format of VLP vaccine in mouse studies, so it's not clear why such an approach is needed to begin with if the original protein is doing fine.

    3. Reviewer #2 (Public review):

      In their manuscript, Rijal and colleagues describe a 'loop grafting' strategy to enhance expression levels and stability of recombinant neuraminidase. The work is interesting and important, but there are several points that need the author's attention.

      Major points

      (1) The authors overstress the importance of the epitopes covered by the loops they use and play down the importance of antibodies binding to the side, the edges, or the underside of the NA. A number of papers describing those mAbs are also not included.

      (2) The rationale regarding the PR8 hybrid is not well described and should be described better.

      (3) Figure 3B and 6C: This should be given as numbers (quantified), not as '+'.

      (4) Figure 5A and 7A: Negative controls are missing.

      (5) The authors claim that they generate stable tetramers. Judging from SDS-PAGE provided in Supplementary Figure 3B (BS3-crosslined), many different species are present including monomers, dimers, tetramers, and degradation products of tetramers. In line 7 for example there are at least 5 bands.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript described a structure-guided approach to graft important antigenic loops of the neuraminidase to a homotypic but heterologous NA. This approach allows the generation of well-expressed and thermostable recombinant proteins with antigenic epitopes of choice to some extent. The loop-grafted NA was designated hybrid.

      Strengths:

      The hybrid NA appeared to be more structurally stable than the loop-donor protein while acquiring its antigenicity. This approach is of value when developing a subunit NA vaccine which is difficult to express. So that antigenic loops could be potentially grafted to a stable NA scaffold to transfer strain-specific antigenicity.

      Weaknesses:

      However, major revisions to better organize the text, and figure and make clarifications on a number of points, are needed. There are a few cases in which a later figure was described first, data in the figures were not sufficiently described, or where there were mismatched references to figures.

      More importantly, the hybrid proteins did not show any of the advantages over the loop-donor protein in the format of VLP vaccine in mouse studies, so it's not clear why such an approach is needed to begin with if the original protein is doing fine.

      We thank the reviewer for their helpful comments. We have incorporated feedback from the authors to improve the manuscript. Please see our point-by-point response.

      The purpose of loop-grafting between H5N1/2021 (a high-expressor) and the PR8 virus was not to improve the expression of PR8, which is already a good expressing NA. Instead, the loop-grafting and the in vivo experiments were done to show the loop-specific protection following a lethal PR8 virus challenge.

      Reviewer #2 (Public review):

      In their manuscript, Rijal and colleagues describe a 'loop grafting' strategy to enhance expression levels and stability of recombinant neuraminidase. The work is interesting and important, but there are several points that need the author's attention.

      Major points

      (1) The authors overstress the importance of the epitopes covered by the loops they use and play down the importance of antibodies binding to the side, the edges, or the underside of the NA. A number of papers describing those mAbs are also not included.

      We have discussed the distribution of epitopes on NA molecule in the Discussion section "The distribution of epitopes in neuraminidase" (new line number 350). In Supplementary Figures 1 and 2, we have compiled the epitopes reported by polyclonal sera and mAbs via escape virus selection or crystal structural studies. There are 45 residues examples of escape virus selection, and we found that approximately 90% of the epitopes are located within the top loops (Loops 01 and Loops 23, which include the lateral sides and edges of NA). We have also included the epitopes of underside mAbs NDS.1 and NDS.3 in Supplementary Figure 2. Some of the interactions formed by these mAbs are also within the L01 and L23 loops. All relevant references are cited in Supplementary Figures 1 and 2.

      A new figure has been added [Figure 1b (ii)] to illustrate the surface mapping of epitopes on NA.

      (2) The rationale regarding the PR8 hybrid is not well described and should be described better.

      We described the rationale for the PR8 hybrid (new lines 247-250). For clarity, we have added the following sentence within the section "Loop transfer between two distant N1 NAs:...."

      (new lines 255-258):

      "mSN1 showed sufficient cross-reactivity to N1/09 to protect mice against virus challenge. Therefore, we performed loop transfer between mSN1 and PR8N1, which differ by 18 residues within the L01 and L23 loops and show no or minimal cross-reactivity, to assess the loop-specific protection."

      (3) Figure 3B and 6C: This should be given as numbers (quantified), not as '+'.

      We have included the numerical data in Supplementary Figure 6. The data is presented in semi-quantitative manner for simplification. To improve clarity, we have now added the following sentence to the Figure 3c legend: "Refer to Supplementary Figure 6 for binding titration data".

      (4) Figure 5A and 7A: Negative controls are missing.

      A pool of Empty VLP sera was included as a negative control, showing no inhibition at 1:40 dilution. In the figure legends, we have stated "Pooled sera to unconjugated mi3 VLP was negative control and showed no inhibition at 1:40 dilution (not included in the graphs)"

      (5) The authors claim that they generate stable tetramers. Judging from SDS-PAGE provided in Supplementary Figure 3B (BS3-crosslinked), many different species are present including monomers, dimers, tetramers, and degradation products of tetramers. In line 7 for example there are at least 5 bands.

      Tetrameric conformation of soluble proteins is evidenced by the size-exclusion chromatographs shown in Figures 3a and 6b. The BS3 crosslinked SDS-PAGE are only suggestive data, indicating that the protein is a tetramer if a band appears at ~250 kDa. However, depending on the reaction conditions, lower molecular weight bands may also be observed if crosslinking is incomplete.

    1. eLife Assessment

      This manuscript presents a potentially important strategy for stimulating mammalian Müller glia to proliferate in vivo by manipulating cell cycle components. The results are convincing that a large number of Müller glia can be induced to re-enter the cell cycle without a damage stimulus. These findings are likely to appeal to retinal biologists and neuroscientists in general.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Wu et al. introduce a novel approach to reactivate the Muller glia cell cycle in the mouse retina by simultaneously reducing p27Kip1 and increasing cyclin D1 using a single AAV vector. The approach effectively promotes Muller glia proliferation and reprogramming without disrupting retinal structure or function. Interestingly, reactivation of the Muller glia cell cycle downregulates IFN pathway, which may contribute to the induced retinal regeneration. The results presented in this manuscript may offer a promising approach for developing Müller glia cell-mediated regenerative therapies for retinal diseases.

      Comments on revisions:

      The authors have revised the manuscript and addressed my concerns.

    3. Reviewer #2 (Public review):

      This manuscript by Wu, Liao et al. reports that simultaneous knockdown of P27Kip1 with overexpression of Cyclin D can stimulate Muller glia to re-enter the cell cycle in the mouse retina. There is intense interest in reprogramming mammalian muller glia into a source for neurogenic progenitors, in the hopes that these cells could be a source for neuronal replacement in neurodegenerative diseases. Previous work in the field has shown ways in which mouse Muller glia can be neurogenically reprogrammed and these studies have shown cell cycle re-entry prior to neurogenesis. In other works, typically, the extent of glial proliferation is limited, and the authors of this study highlight the importance of stimulating large numbers of Muller glia to re-enter the cell cycle with the hopes they will differentiate into neurons.

      The authors have satisfactorily responded to all my previous reviewer comments. The authors have significantly improved their imaging quality in Figure 1 and 4. The authors have admirably re-considered their FISH and scRNA-seq data and performed critical control experiments. They now provide a more nuanced interpretation of their data by removing reference to MG-inducing rod genes which is now interpreted as ambient contamination. Taken together, this manuscript now provides strong evidence of a viral way to induce large numbers of MG to re-enter the cell cycle without a damage stimulus.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, Wu et al. introduce a novel approach to reactivate the Muller glia cell cycle in the mouse retina by simultaneously reducing p27Kip1 and increasing cyclin D1 using a single AAV vector. The approach effectively promotes Muller glia proliferation and reprograming without disrupting retinal structure or function. Interestingly, reactivation of the Muller glia cell cycle downregulates IFN pathway, which may contribute to the induced retinal regeneration. The results presented in this manuscript may offer a promising approach for developing Müller glia cell-mediated regenerative therapies for retinal diseases.

      Strengths:

      The data are convincing and supported by appropriate, validated methodology. These results are both technically and scientifically exciting and are likely to appeal to retinal specialists and neuroscientists in general.

      Weaknesses:

      There are some data gaps that need to be addressed.

      (1) Please label the time points of AAV injection, EdU labeling, and harvest in Figure 1B.

      We thank the reviewer for highlighting the lack of clarity in our experimental design. We have labeled all experiment timelines in the figures where appropriate in the revised version.

      (2) What fraction of Müller cells were transduced by AAV under the experimental conditions?

      We apologize for not clearly explaining the AAV transduction effeciency. AAV transduction efficiency was not uniform across the retinas. The retinal region adjacent to the optic nerve exhibits a transduction efficiency of nearly 100%. In contrast, the peripheral retina shows a lower transduction efficiency compared to the central region. The representative retinal sections with typical infection pattern are shown in Supplementary figure 4. The quantification of Edu+ MG or other markers was conducted in a 250 µm region with the highest efficiency. For scRNA-seq experiment, retinal regions with high AAV transduction efficiency were dissected with the aid of a control GFP virus.   

      (3) It seems unusually rapid for MG proliferation to begin as early as the third day after CCA injection. Can the authors provide evidence for cyclin D1 overexpression and p27 Kip1 knockdown three days after CCA injection?

      We included the data that GFP expression is evident at 3 days post AAV-GFP-GFP injection (Supplementary Fig. 1B). Additionally, we performed immunostaining and confirmed cyclin D1 overexpression at 3 days post CCA injection (Fig. 2E) as well as qPCR analysis to confirm cyclin D1 overexpression and p27kip1 knockdown at the same time point (Supplementary Fig. 5).

      (4) The authors reported that MG proliferation largely ceased two weeks after CCA treatment. While this is an interesting finding, the explanation that it might be due to the dilution of AAV episomal genome copies in the dividing cells seems far-fetched.

      We agree with the reviewer that dilution of AAV episomal genomes is unlikely to be the sole reason for the stop of MG proliferation. By staining cyclin D1 at various days post CCA injection, we found that cyclin D1 is immediately downregulated in the mitotic MG undergoing interkinetic nuclear migration to the outer nuclear layer (Fig. 2G-I). In contrast, the effect of p27<sup>kip1</sup> knockdown by CCA lasted longer (Supplementary Figure 9-10). It is possible that other anti-proliferative genes are involved in the immediate downregulation of Cyclin D1.

      Reviewer #2 (Public Review):

      This manuscript by Wu, Liao et al. reports that simultaneous knockdown of P27Kip1 with overexpression of Cyclin D can stimulate Muller glia to re-enter the cell cycle in the mouse retina. There is intense interest in reprogramming mammalian muller glia into a source for neurogenic progenitors, in the hopes that these cells could be a source for neuronal replacement in neurodegenerative diseases. Previous work in the field has shown ways in which mouse Muller glia can be neurogenically reprogrammed and these studies have shown cell cycle re-entry prior to neurogenesis. In other works, typically, the extent of glial proliferation is limited, and the authors of this study highlight the importance of stimulating large numbers of Muller glia to re-enter the cell cycle with the hopes they will differentiate into neurons. While the evidence for stimulating proliferation in this study is convincing, the evidence for neurogenesis in this study is not convincing or robust, suggesting that stimulating cell cycle-reentry may not be associated with increasing regeneration without another proneural stimulus.

      Below are concerns and suggestions.

      Intro:

      (1) The authors cite past studies showing "direct conversion" of MG into neurons. However, these studies (PMID: 34686336; 36417510) show EdU+ MG-derived neurons suggesting cell cycle re-entry does occur in these strategies of proneural TF overexpression.

      We thank the reviewer for pointing this out. We have revised the statement to "MG reprogramming".

      (2) Multiple citations are incorrectly listed, using the authors first name only (i.e. Yumi, et al; Levi, et al;). Studies are also incompletely referenced in the references.

      We apologize for the mistakes in reference. We have corrected the reference mistakes in the revised version.

      Figure 1:

      (3) When are these experiments ending? On Figure 1B it says "analysis" on the end of the paradigm without an actual day associated with this. This is the case for many later figures too. The authors should update the paradigms to accurately reflect experimental end points.

      We thank the reviewer for highlighting the lack of clarity in our experimental design. We have labeled all experiment timelines in the figures where appropriate in the revised version.

      (4) Are there better representative pictures between P27kd and CyclinD OE, the EdU+ counts say there is a 3 fold increase between Figure 1D&E, however the pictures do not reflect this. In fact, most of the Edu+ cells in Figure 1E don't seem to be Sox9+ MG but rather horizontally oriented nuclei in the OPL that are likely microglia.

      Thanks to the reviewer for pointing this out. We have replaced the image of cyclin D1 OE retina which a more representative image.

      (5) Is the infection efficacy of these viruses different between different combinations (i.e. CyclinD OE vs. P27kd vs. control vs. CCA combo)? As the counts are shown in Figure 1G only Sox9+/Edu+ cells are shown not divided by virus efficacy. If these are absolute counts blind to where the virus is and how many cells the virus hits, if the virus efficacy varies in efficiency this could drive absolute differences that aren't actually biological.

      Rule out the possibility that the differences in MG proliferation across groups are due to variations in viral efficacy, we have examined the p27<sup>kip1</sup> knockdown and cyclin D1 overexpression efficiencies for all four groups by qPCR analysis. The result showed that cyclin D1 overexpression efficiency by AAV-GFAP-Cyclin D1 virus alone or P27 knockdown efficiency by AAV-GFAP-mCherry-p27kip1 shRNA1 is comparable to, if not even higher than, those by CCA virus (Supplementary Fig 5). Therefore, the virus efficacy cannot explain the drastic increase in MG proliferation by CCA. 

      As the central retina usually had 100% infection efficacy (Supplementary Fig. 4), we quantified the Edu+Sox9+ cell number in the 250µm regions next to the optic nerve.

      (6) According to the Jax laboratories, mice aren't considered aged until they are over 18months old. While it is interesting that CCA treatment does not seem to lose efficacy over maturation I would rephrase the findings as the experiment does not test this virus in aged retinas.

      Thank you to the reviewer for bringing this to our attention. We have changed to “older adult mice” in our revised manuscript.

      (7) Supplemental Figure 2c-d. These viruses do not hit 100% of MG, however 100% of the P27Kip staining is gone in the P27sh1 treatment, even the P27+ cell in the GCL that is likely an astrocyte has no staining in the shRNA 1 picture. Why is this?

      We have replaced the images in Supplementary Fig. 2B-D.

      Figure 2

      (8) Would you expect cells to go through two rounds of cell cycle in such a short time? The treatment of giving Edu then BrdU 24 hours later would have to catch a cell going through two rounds of division in a very short amount of time. Again the end point should be added graphically to this figure.

      We thank the reviewer for the comment. We repeated the Edu/BrdU colabelling experiment with extended periods of Edu/BrdU injections. Based on the result of the MG proliferation time course study (Fig. 2A), we injected 5 times of Edu from D1 to D5 and 5 times of BrdU from D6 to D10 post-CCA injection, which covered the major phase of MG proliferation (Fig. 2B-C). Consistent with the previous findings, we did not observe any BrdU&EdU double positive MG cells.

      Additionally, we showed that cyclin D1 overexpression immediately ceased in migrating mitotic MG (Fig. 2G-I), which may explain why CCA-treated MG do not progress to the second round of cell division.

      Figure 3

      (9) I am confused by the mixing of ratios of viruses to indicate infection success. I know mixtures of viruses containing CCA or control GFP or a control LacZ was injected. Was the idea to probe for GFP or LacZ in the single cell data to see which cells were infected but not treated? This is not shown anywhere?

      The virus infection was not uniform across the entire retina (Supplementary Fig. 4). To mark the infection hotspots, we added 10% GFP virus to the mixture. Regions of the retina with low infection efficiency were removed by dissection and excluded from the scRNA-seq analysis. Therefore, we assumed that the vast majority of MG were infected by CCA. We apologize for not clearly explaining this methodological detail in the original text. We have added the experimental design to Fig. 3A and revised the result part (line 191-196) accordingly.

      (10) The majority of glia sorted from TdTomato are probably not infected with virus. Can you subset cells that were infected only for analysis? Otherwise it makes it very hard to make population judgements like Figure 3E-H if a large portion are basically WT glia.

      This question is related to the last one. Since the regions with high virus infection efficiency were selectively dissected and isolated for analysis, the CCA-infected MG should constitute the vast majority of MG in the scRNA-seq data.

      (11) Figure 3C you can see Rho is expressed everywhere which is common in studies like this because the ambient RNA is so high. This makes it very hard to talk about "Rod-like" MG as this is probably an artifact from the technique. Most all scRNA-seq studies from MG-reprogramming have shown clusters of "rods" with MG hybrid gene expression and these had in the past just been considered an artifact.

      We agree with the reviewer that the high rod gene expression in the rod-MG cluster is an artifact. We have performed multiple rounds of RNA in situ hybridization on isolated MG nuclei. The counts of Gnat1 and Rho mRNA signal are largely overlapped between the two samples with and without CCA treatment (Supplementary Fig 14). Some MG in the control retinas without CCA treatment had up to 7 or 8 dots per cell, suggesting contamination of attached rod cell debris during retina dissociation (Supplementary Fig 14). Therefore, the result did not support that rod-MG is a reprogrammed MG population with rod gene upregulation.

      (12) It is mentioned the "glial" signature is downregulated in response to CCA treatment. Where is this shown convincingly? Figure H has a feature plot of Glul, which is not clear it is changed between treatments. Otherwise MG genes are shown as a function of cluster not treatment.

      We have added box plots of several MG-specific genes to illustrate the downregulation of the glial signature in the relevant cell cluster in the revised manuscript (Supplementary Fig. 15).

      Figure 4

      (13) The authors should be commended for being very careful in their interpretations. They employ the proper controls (Er-Cre lineage tracing/EdU-pulse chasing/scRNA-seq omics) and were very careful to attempt to see MG-derived rods. This makes the conclusion from the FISH perplexing. The few puncta dots of Rho and GNAT in MG are not convincing to this reviewer, Rho and GNAT dots are dense everywhere throughout the ONL and if you drew any random circle in the ONL it would be full of dots. The rigor of these counts also comes into question because some dots are picked up in MG in the INL even in the control case. This is confusing because baseline healthy MG do not express RNA-transcripts of these Rod genes so what is this picking up? Taken together, the conclusion that there are Rod-like MG are based off scRNA-seq data (which is likely ambient contamination) and these FISH images. I don't think this data warrants the conclusion that MG upregulate Rod genes in response to CCA.

      Given the results of RNA in situ hybridization on isolated MG, we revisited the result of the RNA in situ hybridization on retinal sections as well. We performed RNA in situ in the retinal section at 1 week post CCA treatment, expecting to see lower Gnat1 and Rho signals in the ONL-localizing MG compared to 3 weeks and 4 months post CCA treatment. However, we observed similar levels across all three time points (data not shown). The lack of dynamic changes in rod gene expression levels also suggests contamination from tightly surrounding neighboring rods. Consequently, we have reinterpreted the scRNA-seq and RNA FISH data and withdrawn the conclusion that MG upregulated rod genes after CCA treatment. We thank the reviewer for pointing out this potential issue and helping us avoid an incorrect conclusion.

      Figure 5

      (14) Similar point to above but this Glul probe seems odd, why is it throughout the ONL but completely dark through the IPL, this should also be in astrocytes can you see it in the GCL? These retinas look cropped at the INL where below is completely black. The whole retinal section should be shown. Antibodies exist to GS that work in mouse along with many other MG genes, IHC or western blots could be done to better serve this point.

      We have replaced the images in Figure 4 in the revised manuscript. Additionally, we have performed the Sox9 antibody staining to demonstrate partial MG dedifferentiation following CCA treatment (Figure 5).

      Figure 6

      (15) Figure 6D is not a co-labeled OTX2+/ TdTomato+ cell, Otx2 will fill out the whole nucleus as can be seen with examples from other MG-reprogramming papers in the field (Hoang, et al. 2020; Todd, et al. 2020; Palazzo, et al. 2022). You can clearly see in the example in Figure 6D the nucleus extending way beyond Otx2 expression as it is probably overlapping in space. Other examples should be shown, however, considering less than 1% of cells were putatively Otx2+, the safer interpretation is that these cells are not differentiating into neurons. At least 99.5% are not.

      We have replaced the image of Otx2+ Tdt+ Edu+ cell, which shows the whole nucleus filled with strong Otx2 staining.  

      (16) Same as above Figure 6I is not convincingly co-labeled HuC/D is an RNA-binding protein and unfortunately is not always the clearest stain but this looks like background haze in the INL overlapping. Other amacrine markers could be tested, but again due to the very low numbers, I think no neurogenesis is occurring.

      Since we didn’t find HuC/D+Tdt+EdU+ cells at 3 weeks post CCA treatment, we believe that the weak HuC/D+ staining in the MG daughter cells at 4 months is not background, but rather reflects an incomplete neurogenic switch. This suggests that the process of neurogenesis may be ongoing but not fully realized within the observed timeframe without additional stimuli.

      (17) In the text the authors are accidently referring to Figure 6 as Figure 7.

      We thank the reviewer for pointing out the mistake. We will correct the mistake in the revised manuscript.

      Figure 7

      (18) I like this figure and the concept that you can have additional MG proliferating without destroying the retina or compromising vision. This is reminiscent of the chick MG reprogramming studies in which MG proliferate in large numbers and often do not differentiate into neurons yet still persist de-laminated for long time points.

      General:

      (19) The title should be changed, as I don't believe there is any convincing evidence of regeneration of neurons. Understanding the barriers to MG cell-cycle re-entry are important and I believe the authors did a good job in that respect, however it is an oversell to report regeneration of neurons from this data.

      We thank the reviewer for the suggestion. We have changed the title to “Simultaneous cyclin D1 overexpression and p27kip1 knockdown enable robust Müller glia cell cycle reactivation in uninjured mouse retina” in the revised manuscript.

      (20) This paper uses multiple mouse lines and it is often confusing when the text and figures switch between models. I think it would be helpful to readers if the mouse strain was added to graphical paradigms in each figure when a different mouse line is employed.

      We have labeled the mouse lines used in each experiment in the figures where appropriate.

    1. eLife Assessment

      This study provides valuable insight into the role of Meis2 in whisker hair follicle formation and confirms prior work that nerves are dispensable for this process. The solid imaging techniques support the authors' conclusions, however the data provides limited evidence to support the mechanism of Meis2 in whisker formation.

    2. Reviewer #1 (Public review):

      Summary:

      Mehmet Mahsum Kaplan et al. demonstrate that Meis2 expression in neural crest-derived mesenchymal cells is crucial for whisker follicle (WF) development, as WF fails to develop in wnt1-Cre;Meis2 cKO mice. Advanced imaging techniques effectively support the idea that Meis2 is essential for proper WF development and that nerves, while affected in Meis2 cKO, are dispensable for WF development and not the primary cause of WF developmental failure. The study also reveals that although Meis2 significantly downregulates Foxd1 in the mesenchyme, this is not the main reason for WF development failure. The paper presents valuable data on the role of mesenchymal Meis2 in WF development. However, it is still not known what is the molecular mechanisms that link Meis2 to impact the epithelial compartment.

      Strengths:

      (1) The authors describe a novel molecular mechanism involving Mesenchymal Meis2 expression, which plays a crucial role in early WF development.<br /> (2) They employ multiple advanced imaging techniques to illustrate their findings beautifully.<br /> (3) The study clearly shows that nerves are not essential for WF development.

      Weaknesses:

      The paper lacks clarity on how Meis2 loss, along with the observed general reduction in proliferation and changes in extracellular matrix and cell adhesion, leads specifically to the loss of whisker follicles. Future studies addressing this gap, perhaps with methods enabling higher cell recovery or epithelial cell inclusion in the sequenced cells, could provide valuable insights into the specific roles of Meis2 in this context.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Kaplan et al. study mesenchymal Meis2 in whisker formation and the links between whisker formation and sensory innervation. To this end, they used conditional deletion of Meis2 using the Wnt1 driver. Whisker development was arrested at the placode induction stage in Meis2 conditional knockouts leading to absence of expression of placodal genes such as Edar, Lef1, and Shh. The authors also show that branching of trigeminal nerves innervating whisker follicles was severely affected but that whiskers did form in the complete absence of trigeminal nerves.

      Strengths:

      The analysis of Meis2 conditional knockouts shows convincingly lack of whisker formation and all epithelial whisker/hair placode markers analyzed. Using Neurog1 knockout mice, the authors show that whiskers and teeth develop in the complete absence of trigeminal nerves.

      Comments on revised version:

      In the revised manuscript, Kaplan et al. have addressed some of my previous concerns, e.g., the methodological section has been updated to include the relevant information, and the Introduction now better considers the previous literature.

      In the revised manuscript, the authors have made limited efforts to address the main criticism of my original review: lack of mechanistic insight as to why mesenchymal Meis2 leads to the absence of whisker placodes. The new data reported indicate that the lack of whisker placodes is not a mere delay. In this context, the authors also show one images of E18.5 snouts that includes developing hair follicles. Interestingly, the image shown seems to indicate that hair follicles do develop normally in the absence of mesenchymal Meis2 although this finding is not reported in any detail or quantified. The authors suggest that this could be due to an early role of Meis2 in the mesenchyme because HFs develop later. Indeed, one plausible possibility is that Meis2 does not have any direct role in whisker (or hair) follicle development but is specifically required for some other function in the whisker pad mesenchyme, a function that remains unidentified in the current study as it mainly focuses on analyzing hair follicle marker expression in whisker follicles. I think this should be better reflected in the Discussion.

      Additional comments:

      The revised manuscript included the quantification of Lef1 intensity in control and Meis2 cKO whisker follicles (lines 251-252 and 255-258). Maybe I missed, but I failed to find the information how the quantification of the intensities was made, and therefore it was not possible for me to evaluate this part of the data. Nevertheless, I think the main text is not the place for these quantifications; rather, they would better fit e.g. Suppl. Figure 4.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Mehmet Mahsum Kaplan et al. demonstrate that Meis2 expression in neural crest-derived mesenchymal cells is crucial for whisker follicle (WF) development, as WF fails to develop in wnt1-Cre;Meis2 cKO mice. Advanced imaging techniques effectively support the idea that Meis2 is essential for proper WF development and that nerves, while affected in Meis2 cKO, are dispensable for WF development and not the primary cause of WF developmental failure. The study also reveals that although Meis2 significantly downregulates Foxd1 in the mesenchyme, this is not the main reason for WF development failure. The paper presents valuable data on the role of mesenchymal Meis2 in WF development. However, further quantification and analysis of the WF developmental phenotype would be beneficial in strengthening the claim that Meis2 controls early WF development rather than causing a delay or arrest in development. A deeper sequencing data analysis could also help link Meis2 to its downstream targets that directly impact the epithelial compartment.

      Strengths:

      (1) The authors describe a novel molecular mechanism involving Mesenchymal Meis2 expression, which plays a crucial role in early WF development.

      (2) They employ multiple advanced imaging techniques to illustrate their findings beautifully.

      (3) The study clearly shows that nerves are not essential for WF development.

      We thank the reviewer for valuable comments that will help improve our study.

      Weaknesses:

      (1) The authors claim that Meis2 acts very early during development, as evidenced by a significant reduction in EDAR expression, one of the earliest markers of placode development. While EDAR is indeed absent from the lower panel in Figure 3C of the Meis2 cKO, multiple placodes still express EDAR in the upper two panels of the Meis2 cKO. The authors also present subsequent analysis at E13.3, showing one escaped follicle positive for SHH and Sox9 in Figures 1 and 3. Does this suggest that follicles are specified but fail to develop? Alternatively, could there be a delay in follicle formation? The increase in Foxd1 expression between E12.5 and E13.5 might also indicate delayed follicle development, or as the authors suggest, follicles that have escaped the phenotype. The paper would significantly benefit from robust quantification to accompany their visual data, specifically quantifying EDAR, Sox9, and Foxd1 at different developmental stages. Additionally, analyzing later developmental stages could help distinguish between a delay or arrest in WF development and a complete failure to specify placodes.

      The earliest DC (FOXD1) and placodal (EDAR, LEF1) markers tested in this study were observed only in the escaped WFs whereas these markers were missing in expected WF sites in mutants. This was also reflected in the loss of typical placodal morphology in the mutant’s epithelium. On the other hand, escaped WFs developed normally as shown by the analysis in Supp Fig 1A-B showing their normal size. These data suggest that development of escaped WFs is not delayed because they would appear smaller in size. To strengthen this conclusion, we assessed whisker development at E18.5 in Meis2 cKO mice by EDAR staining and results are shown in newly added Supplementary Figure 2. This experiment revealed that whisker phenotype persisted until E18.5 therefore this phenotype cannot be explained by a developmental delay.

      As far as quantification is concerned, we have already quantified the number of whiskers in controls and mutants at E12.5 and E13.5 in all whole mount experiments we did, i.e. Shh ISH and SOX9 or EDAR whole mount IFC. We pooled all these numbers together and calculated the whisker number reduction to 5.7+/-2.0% at E12.5 and 17.1+/-5.9 at E13.5. Line:132-134.

      (2) The authors show that single-cell sequencing reveals a reduction in the pre-DC population, reduced proliferation, and changes in cell adhesion and ECM. However, these changes appear to affect most mesenchymal cells, not just pre-DCs. Moreover, since E12.5 already contains WFs at different stages of development, as well as pre-DCs and DCs, it becomes challenging to connect these mesenchymal changes directly to WF development. Did the authors attempt to re-cluster only Cluster 2 to determine if a specific subpopulation is missing in Meis2 cKO? Alternatively, focusing on additional secreted molecules whose expression is disrupted across different clusters in Meis2 cKO could provide insights, especially since mesenchymal-epithelial communication is often mediated through secreted molecules. Did the authors include epithelial cells in the single-cell sequencing, can they look for changes in mesenchyme-epithelial cell interactions (Cell Chat) to indicate a possible mechanism?

      We agree with the reviewer that the effect of Meis2 on cell proliferation and expression of cell adhesion and ECM markers are more general because they take place in the whole underlying mesenchyme. Our genetic tools did not allow specific targeting of DC or pre-DCs. Nonetheless, we trust that our data show that mesenchymal Meis2 is required for the initial steps of WF development including Pc formation. As far as bioinformatics data are concerned, this data set was taken from the large dataset GSE262468 covering the whole craniofacial region which led to very limited cell numbers in the cluster 2 (DC): WT_E12_5 --> 28, WT_E13_5 --> 131, MUT_E12_5 --> 19, MUT_E13_5 --> 28. Unfortunately, such small cell numbers did not allow further sub-clustering, efficient normalization, integration and conclusions from their transcriptional profiles. Although a number of interesting differentially expressed genes were identified (see supplementary datasets), none of them convincingly pointed at reasonable secreted molecule candidate. 

      We agree with the reviewer that cellchat analysis could provide robust indication of the mesenchymal-epithelial communication, however our datasets included only mesenchymal cell population (Wnt1-Cre2progeny) and epithelial cells were excluded by FACS prior to sc RNA-seq. (Hudacova et al. https://doi.org/10.1016/j.bone.2024.117297)

      (3) The authors aim to link Meis2 expression in the mesenchyme with epithelial Wnt signaling by analyzing Lef1, bat-gal, Axin1, and Wnt10b expression. However, the changes described in the figures are unclear, and the phenotype appears highly variable, making it difficult to establish a connection between Meis2 and Wnt signaling. For instance, some follicles and pre-condensates are Lef1 positive in Meis2 cKO. Including quantification or providing a clearer explanation could help clarify the relationship between mesenchymal Meis2 and Wnt signaling in both epidermal and mesenchymal cells. Did the authors include epithelial cells in the sequencing? Could they use single-cell analysis to demonstrate changes in Wnt signaling?

      We have now analyzed changes in LEF1 staining intensity in the epithelium and in the upper dermis. According to these quantifications, we observed a considerable decline in the number of LEF1+ placodes in the epithelium which corresponds to the lower number of placodes. On the other hand, LEF1 intensity in the ‘escaped’ placodes were similar between controls and mutants. LEF1 signal in the upper dermis is very strong overall and its quantification did not reveal any changes in the DC and non-DC region of the upper dermis. These data corroborate with our conclusion that Meis2 in the mesenchyme is not crucial for the dermal WNT signaling but is required for induction of LEF1 expression in the epithelium. However, once ‘escaper’ placodes appear, they display normal wnt signaling in Pc, DC and subsequent development. These quantitative data have been added to the revised manuscript. Line247-260.

      (4) Existing literature, including studies on Neurog KO and NGF KO, as well as the references cited by the authors, suggest that nerves are unlikely to mediate WF development. While the authors conduct a thorough analysis of WF development in Neurog KO, further supporting this notion, this point may not be central to the current work. Additionally, the claim that Meis2 influences trigeminal nerve patterning requires further analysis and quantification for validation.

      We agree with the reviewer that analysis of the Neurogenin1 knockout mice should not be central to this report. Nonetheless, a thorough analysis of WF development in Neurog1 KO was needed to distinguish between two possible mechanisms: whisker phenotype in Meis2 cKO results from 1. impaired nerve branching 2. Function of Meis2 in the mesenchyme. We will modify the text accordingly to make this clearer to readers. We also agree that nerve branching was not extensively analyzed in the current study but two samples from mutant mice were provided (Fig1 and Supp Videos), reflecting the consistency of the phenotype (see also Machon et al. 2015). This section was not central to this report either but led us to focus fully on the mesenchyme. We think that Meis2 function in cranial nerve development is very interesting and deserves a separate study.

      We have edited the introduction to reflect the literature better. Line70-79.

      (5) Meis2 expression seems reduced but has not entirely disappeared from the mesenchyme. Can the authors provide quantification?

      We have attempted to quantify MEIS2 staining in the snout dermis. However, the background fluorescence made it challenging to reliable quantify. Additionally, since at the point, dermal region where MEIS2 expression is relevant to induce WF formation is not known, we were unable to determine the regions to analyze. Instead, we now added three additional images from multiple regions of the snout sections stained with MEIS2 antibody in Supplementary Figure 1C. We believe newly added images will make our conclusion that MEIS2 is efficiently deleted in the mutants more convincing.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Kaplan et al. study mesenchymal Meis2 in whisker formation and the links between whisker formation and sensory innervation. To this end, they used conditional deletion of Meis2 using the Wnt1 driver. Whisker development was arrested at the placode induction stage in Meis2 conditional knockouts leading to the absence of expression of placodal genes such as Edar, Lef1, and Shh. The authors also show that branching of trigeminal nerves innervating whisker follicles was severely affected but that whiskers did form in the complete absence of trigeminal nerves.

      Strengths:

      The analysis of Meis2 conditional knockouts convincingly shows a lack of whisker formation and all epithelial whisker/hair placode markers were analyzed. Using Neurog1 knockout mice, the authors show equally convincingly that whiskers and teeth develop in the complete absence of trigeminal nerves.

      We thank the reviewer for valuable comments that will help improve our study.

      Weaknesses:

      The manuscript does not provide much mechanistic insight as to why mesenchymal Meis2 leads to the absence of whisker placodes. Using a previously generated scRNA-seq dataset they show that two early markers of dermal condensates, Foxd1 and Sox2, are downregulated in Meis2 mutants. However, given that placodes and dermal condensates do not form in the mutants, this is not surprising and their absence in the mutants does not provide any direct link between Meis2 and Foxd1 or Sox2. (The absence of a structure evidently leads to the absence of its markers.)

      We apologize for unclear explanation of our data. We meant that Meis2 is functionally upstream of Foxd1 because Foxd1 is reduced upon Meis2 deletion. This means that during WF formation, Meis2 operates before Foxd1 induction and does not mean necessarily that Meis2 directly controls expression of Foxd1. Yes, we agree with reviewer’s note that Foxd1 and Sox2, as known DC markers, decline because the number of WF declines. We wanted to convince readers that Meis2 operates very early in the GRN hierarchy during WF development. We also admit that we provide poor mechanistic insights into Meis2 function as a transcription factor. We think that this weak point does not lower the value of the report showing indispensable role of Meis2 in WFs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The text could benefit from editing.

      We have proofread the text.

      Some information is missing from the materials and methods section - a description of sequenced cells, the ISH protocol used, etc.

      Methodological section has been updated and single-cell experiments were performed and described in detail by Hudacova et al. 2025  (https://doi.org/10.1016/j.bone.2024.117297). We have utilized these datasets for scRNA analysis which has been described sufficiently in the referred paper. Reference for standard in site protocol has been added.

      Reviewer #2 (Recommendations for the authors):

      In the Introduction of the paper, the authors raise the question on the role of innervation in whisker follicle induction "It has been speculated that early innervation plays a role in initiating WF formation (ref. 1)"...and..."this revives the previous speculations that axonal network may be involved in WF positioning". However, the authors forget to mention that Wrenn & Wessless, 1984 (reference 1 in the manuscript) made exactly the opposite conclusion and stated e.g. "Nerve trunks and branches are present in the maxillary process well before any sign of vibrissa formation. Because innervation is so widespread there appears to be no immediate temporal correlation between the outgrowth of a nerve branch to a site and the generation of a vibrissa there. Furthermore, at the time just prior to the formation of the first follicle rudiment, there is little or no nerve branching to the presumptive site of that first follicle while branches are found more dorsally where vibrissae will not form until later." Therefore, I find that referring to the paper by Wrenn & Wessells is somewhat misleading. Given that the whisker follicles develop in ex vivo cultured whisker pads further hints that innervation is unlikely to play a role in whisker follicle induction.

      The Introduction also hints at the role of innervation in tooth induction but forgets to refer to the literature that shows exactly the opposite. Based on the evidence it rather appears that the developing tooth regulates the establishment of its own nerve supply, not that the nerves would regulate induction of tooth development.

      in my opinion, the Introduction should be partially rewritten to better reflect the literature.

      The introduction has been revised to better reflect the literature on the role of innervation on WF and tooth development. Line70-87.

      The authors conclude that Meis2 is upstream of Foxd1, but the evidence is based on the lack of Foxd1 expression in Meis2 mutants. However, as whiskers do not form, evidently all markers are also absent. More direct evidence of Meis2 being upstream of Foxd1 (or Sox2) should be presented to consolidate the conclusions.

      We have already reacted to this point above in the section Weaknesses. The text is now modified so that the interpretation is correct. Line: 407-409.

      Other comments:

      Author contributions state that XX performed experiments but the author list does not include anyone with such initials.

      This error has been corrected in revision.

    1. eLife Assessment

      This valuable study presents a computational model that simulates walking motions in Drosophila and suggests that, if sensorimotor delays in the neural circuitry were any longer, the system would be easily destabilized by external perturbations. The hierarchical control model is sensible and the evidence supporting the conclusions convincing. The modular model, which has many interacting components with varying degrees of biological realism, will serve as a well-grounded starting point for future studies that incorporate richer or more complete empirical data.

    2. Reviewer #1 (Public Review):

      Summary:

      In this work, the authors present a novel, multi-layer computational model of motor control to produce realistic walking behaviour of a Drosophila model in the presence of external perturbations and under sensory and motor delays. The novelty of their model of motor control is that it is modular, with divisions inspired by the fly nervous system, with one component based on deep learning while the rest are based on control theory. They show that their model can produce realistic walking trajectories. Given the mostly reasonable assumptions of their model, they convincingly show that the sensory and motor delays present in the fly nervous system are the maximum allowable for robustness to unexpected perturbations.

      Their fly model outputs torque at each joint in the leg, and their dynamics model translates these into movements, resulting in time-series trajectories of joint angles. Inspired by the anatomy of the fly nervous system, their fly model is a modular architecture that separates motor control at three levels of abstraction:<br /> (1) oscillator-based model of coupling of phase angles between legs,<br /> (2) generation of future joint-angle trajectories based on the current state and inputs for each leg (the trajectory generator), and<br /> (3) closed-loop control of the joint-angles using torques applied at every joint in the model (control and dynamics).

      These three levels of abstraction ensure coordination between the legs, future predictions of desired joint angles, and corrections to deviations from desired joint-angle trajectories. The parameters of the model are tuned in the absence of external perturbations using experimental data of joint angles of a tethered fly. A notable disconnect from reality is that the dynamics model used does not model the movement of the body and ground contacts as is the case in natural walking, nor the movement of a ball for a tethered fly, but instead something like legs moving in the air for a tethered fly.

      In order to validate the realism of the generated simulated walking trajectories, the authors compare various attributes of simulated to real tethered fly trajectories and show qualitative and quantitative similarities, including using a novel metric coined as Kinematic Similarity (KS). The KS score of a trajectory is a measure of the likelihood that the trajectory belongs to the distribution of real trajectories estimated from the experimental data. While such a metric is a useful tool to validate the quality of simulated data, there is some room for improvement in the actual computation of this score. For instance, the KS score is computed for any given time-window of walking simulation using a fraction of information from the joint-angle trajectories. It is unclear if the remaining information in joint-angle trajectories that are not used in the computation of the KS score can be ignored in the context of validating the realism of simulated walking trajectories.

      The authors validate simulated walking trajectories generated by the trained model under a range of sensorimotor delays and external perturbations. The trained model is shown to generate realistic joint-angle trajectories in the presence of external perturbations as long as the sensorimotor delays are constrained within a certain range. This range of sensorimotor delays is shown to be comparable to experimental measurements of sensorimotor delays, leading to the conclusion that the fly nervous system is just fast enough to be robust to perturbations.

      Strengths:

      This work presents a novel framework to simulate Drosophila walking in the presence of external perturbations and sensorimotor delay. Although the model makes some simplifying assumptions, it has sufficient complexity to generate new, testable hypotheses regarding motor control in Drosophila. The authors provide evidence for realistic simulated walking trajectories by comparing simulated trajectories generated by their trained model with experimental data using a novel metric proposed by the authors. The model proposes a crucial role in future predictions to ensure robust walking trajectories against external perturbations and motor delay. Realistic simulations under a range of prediction intervals, perturbations, and motor delays generating realistic walking trajectories support this claim. The modular architecture of the framework provides opportunities to make testable predictions regarding motor control in Drosophila. The work can be of interest to the Drosophila community interested in digitally simulating realistic models of Drosophila locomotion behaviors, as well as to experimentalists in generating testable hypotheses for novel discoveries regarding neural control of locomotion in Drosophila. Moreover, the work can be of broad interest to neuroethologists, serving as a benchmark in modelling animal locomotion in general.

      Weaknesses:

      As the authors acknowledge in their work, the control and dynamics model makes some simplifying assumptions about Drosophila physics/physiology in the context of walking. For instance, the model does not incorporate ground contact forces and inertial effects of the fly's body. It is not clear how these simplifying assumptions would affect some of the quantitative results derived by the authors. The range of tolerable values of sensorimotor delays that generate realistic walking trajectories is shown to be comparable with sensorimotor delays inferred from physiological measurements. It is unclear if this comparison is meaningful in the context of the model's simplifying assumptions. The authors propose a novel metric coined as Kinematic Similarity (KS) to distinguish realistic walking trajectories from unrealistic walking trajectories. Defining such an objective metric to evaluate the model's predictions is a useful exercise, and could potentially be applied to benchmark other computational animal models that are proposed in the future. However, the KS score proposed in this work is calculated using only the first two PCA modes that cumulatively account for less than 50% of the variance in the joint angles. It is not obvious that the information in the remaining PCA modes may not change the log-likelihood that occurs in the real walking data.

      Comments on revisions:

      The authors have addressed the concerns and questions raised in the original review.

    3. Reviewer #2 (Public Review):

      Summary:

      In this study, Karashchuk et al. develop a hierarchical control system to control the legs of a dynamic model of the fly. They intend to demonstrate that temporal delays in sensorimotor processing can destabilize walking and that the fly's nervous system may be operating with as long of delays as could possibly be corrected for.

      Strengths:

      Overall, the approach the authors take is impressive. Their model is trained using a huge dataset of animal data, which is a strength. Their model was not trained to reproduce animal responses to perturbations, but it successfully rejects small perturbations and continues to operate stably. Their results are consistent with the literature, that sensorimotor delays destabilize movements.

      Weaknesses:

      The model is sophisticated and interesting, but the reviewer has great concerns regarding this manuscript's contributions, as laid out in the abstract:

      (1) Much simpler models can be used to show that delays in sensorimotor systems destabilize behavior (e.g., Bingham, Choi, and Ting 2011; Ashtiani, Sarvestani, and Badri-Sproewitz 2021), so why create this extremely complex system to test this idea? The complexity of the system obscures the results and leaves the reviewer wondering if the instability is due to the many, many moving parts within the model. The reviewer understands (and appreciates) that the authors tested the impact of the delay in a controlled way, which supports their conclusion. However, the reviewer thinks the authors did not use the most parsimonious model possible, and as such, leave many possible sources for other causes of instability.

      (2) In a related way, the reviewer is not sure that the elements the authors introduced reflect the structure or function of the fly's nervous system. For example, optimal control is an active field of research and is behind the success of many-legged robots, but the reviewer is not sure what evidence exists that suggests the fly ventral nerve cord functions as an optimal controller. If this were bolstered with additional references, the reviewer would be less concerned.

      (3) "The model generates realistic simulated walking that matches real fly walking kinematics...". The reviewer appreciates the difficulty in conducting this type of work, but the reviewer cannot conclude that the kinematics "match real fly walking kinematics". The range of motion of several joints is 30% too small compared to the animal (Figure 2B) and the reviewer finds the video comparisons unpersuasive. The reviewer would understand if there were additional constraints, e.g., the authors had designed a robot that physically could not complete the prescribed motions. However the reviewer cannot think of a reason why this simulation could not replicate the animal kinematics with arbitrary precision, if that is the goal.

      Comments on revisions:

      The authors have addressed the concerns and questions raised in the original review.

    4. Author response:

      The following is the authors’ response to the original reviews.

      We thank the editor and reviewers for their supportive comments about our modeling approach and conclusions, and for raising several valid concerns; we address them briefly below. In addition, a detailed, point-by-point response to the reviewers’ comments are below, along with additions and edits we have made to the revised manuscript. 

      Concerns about model’s biological realism and impact on interpretations

      The goal of this paper was to use an interpretable and modular model to investigate the impact of varying sensorimotor delays. Aspects of the model (e.g. layered architecture, modularity) are inspired by biology; at the same time, necessary abstractions and simplifications (e.g. using an optimal controller) are made for interpretability and generalizability, and they reflect common approaches from past work. The hypothesized effects of certain simplifying assumptions are discussed in detail in Section 3.5. Furthermore, the modularity of our model allows us to readily incorporate additional biological realism (e.g. biomechanics, connectomics, and neural dynamics) in future work. In the revision, we have added citations and edits to the text to clarify these points.

      Concerns that the model is overly complex

      To investigate the impact of sensorimotor delays on locomotion, we built a closed-loop model that recapitulates the complex joint trajectories of fly walking. We agree that locomotion models face a tradeoff between simplicity/interpretability and realism — therefore, we developed a model that was as simple and interpretable as possible, while still reasonably recapitulating joint trajectories and generalizing to novel simulation scenarios. Along these lines, we also did not select a model that primarily recreates empirical data, as this would hinder generalizability and add unnecessary complexity to the model. We do not think these design choices are significant weaknesses of this model; in fact, few comparable models account for all joints involved in locomotion, and fewer explicitly compare model kinematics with kinematics from data. We have add citations and edits to the text to clarify these points in the revision. 

      Concerns about the validity of the Kinematic Similarity (KS) metric to evaluate walking

      We chose to incorporate only the first two PCA modes dimensions in the KS metric because the kernel density estimator performs poorly for high dimensional data. Our primary use of this metric was to indicate whether the simulated fly continues walking in the presence of perturbations. For technical reasons, it is not feasible to perform equivalent experiments on real walking flies, which is one of the reasons we explore this phenomenon with the model. We note the dramatic shift from walking to nonwalking as delay increases (Figure 5). To be thorough, in the revision, we have investigated the effect of incorporating additional PCA modes, and whether this affects the interpretation of our results. We have additionally added to the discussion and presentation of the KS metric to clarify its purpose in this study. We agree with the reviewers that the KS metric is too coarse to reflect fine details of joint kinematics; indeed, in the unperturbed case, we evaluate our model’s performance using other metrics based on comparisons with empirical data (Figures 2, 7, 8). 

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work, the authors present a novel, multi-layer computational model of motor control to produce realistic walking behaviour of a Drosophila model in the presence of external perturbations and under sensory and motor delays. The novelty of their model of motor control is that it is modular, with divisions inspired by the fly nervous system, with one component based on deep learning while the rest are based on control theory. They show that their model can produce realistic walking trajectories. Given the mostly reasonable assumptions of their model, they convincingly show that the sensory and motor delays present in the fly nervous system are the maximum allowable for robustness to unexpected perturbations.

      Their fly model outputs torque at each joint in the leg, and their dynamics model translates these into movements, resulting in time-series trajectories of joint angles. Inspired by the anatomy of the fly nervous system, their fly model is a modular architecture that separates motor control at three levels of abstraction:

      (1) oscillator-based model of coupling of phase angles between legs,

      (2) generation of future joint-angle trajectories based on the current state and inputs for each leg (the trajectory generator), and

      (3) closed-loop control of the joint-angles using torques applied at every joint in the model (control and dynamics).

      These three levels of abstraction ensure coordination between the legs, future predictions of desired joint angles, and corrections to deviations from desired joint-angle trajectories. The parameters of the model are tuned in the absence of external perturbations using experimental data of joint angles of a tethered fly. A notable disconnect from reality is that the dynamics model used does not model the movement of the body and ground contacts as is the case in natural walking, nor the movement of a ball for a tethered fly, but instead something like legs moving in the air for a tethered fly.

      n order to validate the realism of the generated simulated walking trajectories, the authors compare various attributes of simulated to real tethered fly trajectories and show qualitative and quantitative similarities, including using a novel metric coined as Kinematic Similarity (KS). The KS score of a trajectory is a measure of the likelihood that the trajectory belongs to the distribution of real trajectories estimated from the experimental data. While such a metric is a useful tool to validate the quality of simulated data, there is some room for improvement in the actual computation of this score. For instance, the KS score is computed for any given time-window of walking simulation using a fraction of information from the joint-angle trajectories. It is unclear if the remaining information in joint-angle trajectories that are not used in the computation of the KS score can be ignored in the context of validating the realism of simulated walking trajectories.

      The authors validate simulated walking trajectories generated by the trained model under a range of sensorimotor delays and external perturbations. The trained model is shown to generate realistic jointangle trajectories in the presence of external perturbations as long as the sensorimotor delays are constrained within a certain range. This range of sensorimotor delays is shown to be comparable to experimental measurements of sensorimotor delays, leading to the conclusion that the fly nervous system is just fast enough to be robust to perturbations.

      Strengths:

      This work presents a novel framework to simulate Drosophila walking in the presence of external perturbations and sensorimotor delay. Although the model makes some simplifying assumptions, it has sufficient complexity to generate new, testable hypotheses regarding motor control in Drosophila. The authors provide evidence for realistic simulated walking trajectories by comparing simulated trajectories generated by their trained model with experimental data using a novel metric proposed by the authors. The model proposes a crucial role in future predictions to ensure robust walking trajectories against external perturbations and motor delay. Realistic simulations under a range of prediction intervals, perturbations, and motor delays generating realistic walking trajectories support this claim. The modular architecture of the framework provides opportunities to make testable predictions regarding motor control in Drosophila. The work can be of interest to the Drosophila community interested in digitally simulating realistic models of Drosophila locomotion behaviors, as well as to experimentalists in generating testable hypotheses for novel discoveries regarding neural control of locomotion in Drosophila. Moreover, the work can be of broad interest to neuroethologists, serving as a benchmark in modelling animal locomotion in general.

      We thank the reviewer for their positive comments.

      Weaknesses:

      As the authors acknowledge in their work, the control and dynamics model makes some simplifying assumptions about Drosophila physics/physiology in the context of walking. For instance, the model does not incorporate ground contact forces and inertial effects of the fly's body. It is not clear how these simplifying assumptions would affect some of the quantitative results derived by the authors. The range of tolerable values of sensorimotor delays that generate realistic walking trajectories is shown to be comparable with sensorimotor delays inferred from physiological measurements. It is unclear if this comparison is meaningful in the context of the model's simplifying assumptions.

      We now discuss how some of these assumptions affect the quantitative results in the section “Towards biomechanical and neural realism”. We reproduce the relevant sentences below:

      “The inclusion of explicit leg-ground contact interactions would also make it harder for the model to recover when perturbed, because perturbations during walking often occur upon contact with the ground (e.g. the ground is slippery or bumpy).”

      “We anticipate that the increased sensory resolution from more detailed proprioceptor models and the stability from mechanical compliance of limbs in a more detailed biomechanical model would make the system easier to control and increase the allowable range of delay parameters. Conversely, we expect that modeling the nonlinearity and noise inherent to biological sensors and actuators may decrease the allowable range of delay parameters.”

      The authors propose a novel metric coined as Kinematic Similarity (KS) to distinguish realistic walking trajectories from unrealistic walking trajectories. Defining such an objective metric to evaluate the model's predictions is a useful exercise, and could potentially be applied to benchmark other computational animal models that are proposed in the future. However, the KS score proposed in this work is calculated using only the first two PCA modes that cumulatively account for less than 50% of the variance in the joint angles. It is not obvious that the information in the remaining PCA modes may not change the log-likelihood that occurs in the real walking data.

      The primary reason we designed the KS metric was to determine whether the simulated fly continues walking in the presence of perturbations. We initially limited the analysis of the KS to the first 2 principal components. For completeness, we now investigate the additional principal components in Appendix 9 and the effect of evaluating KS with different numbers of components in Appendix 10. 

      Overall, the results look similar when including additional components for impulse perturbations. For stochastic perturbations, the range of similar walking decreases as we increase the number of components used to evaluate walking kinematics. Comparing this with Appendix 9, which shows that higher components represent higher frequencies of the walking cycle, we conclude that at the edge of stability for delays (where sum of sensory and actuation delays are about 40ms), flies can continue walking but with impaired higher frequencies (relative to no perturbations) during and after perturbation. 

      We added the following text in the methods:

      “We chose 2 dimensions for PCA for two key reasons. First, these 2 dimensions alone accounted for a large portion of the variance in the data (52.7% total, with 42.1% for first component and 10.6% for second component). There was a big drop in variance explained from the first to the second component, but no sudden drop in the next 10 components (see Appendix 9). Second, the KDE procedure only works effectively in low-dimensional spaces, and the minimal number of dimensions needed to obtain circular dynamics for walking is 2. We investigate the effect of varying the number of dimensions of PCA in Appendix 10.”

      (Note that we have corrected the percentage of variance accounted for by the principal components, as these numbers were from an older analysis prior to the first draft.)

      We also reference Appendix 10 in the results:

      “We observed that robust walking was not contingent on the specific values of motor and sensory delay, but rather the sum of these two values (Fig. 5E). Furthermore, as delay increases, higher frequencies of walking are impacted first before walking collapses entirely (Appendix 10).”

      Reviewer #2 (Public Review):

      Summary:

      In this study, Karashchuk et al. develop a hierarchical control system to control the legs of a dynamic model of the fly. They intend to demonstrate that temporal delays in sensorimotor processing can destabilize walking and that the fly's nervous system may be operating with as long of delays as could possibly be corrected for.

      Strengths:

      Overall, the approach the authors take is impressive. Their model is trained using a huge dataset of animal data, which is a strength. Their model was not trained to reproduce animal responses to perturbations, but it successfully rejects small perturbations and continues to operate stably. Their results are consistent with the literature, that sensorimotor delays destabilize movements.

      Weaknesses:

      The model is sophisticated and interesting, but the reviewer has great concerns regarding this manuscript's contributions, as laid out in the abstract:

      (1) Much simpler models can be used to show that delays in sensorimotor systems destabilize behavior (e.g., Bingham, Choi, and Ting 2011; Ashtiani, Sarvestani, and Badri-Sproewitz 2021), so why create this extremely complex system to test this idea? The complexity of the system obscures the results and leaves the reviewer wondering if the instability is due to the many, many moving parts within the model. The reviewer understands (and appreciates) that the authors tested the impact of the delay in a controlled way, which supports their conclusion. However, the reviewer thinks the authors did not use the most parsimonious model possible, and as such, leave many possible sources for other causes of instability.

      We thank the reviewer for this observation — we agree that we did not make the goal of the work quite clear. The goal of this paper was to build an interpretable and generalizable model of fly walking, which was then used to investigate varying sensorimotor delays in the context of locomotion. To this end, we used a modular model to recreate walking kinematics, and then investigated the effect of delays on locomotion. Locomotion in itself is a complex phenomenon — thus, we have chosen a model that is complex enough to reasonably recapitulate joint trajectories, while remaining interpretable.

      We have clarified this in the text near the end of the introduction:

      “Here, we develop a new, interpretable, and generalizable model of fly walking, which we use to investigate the impact of varying sensorimotor delays in Drosophila locomotion.”

      We also emphasize the investigation of sensorimotor delays in the context of locomotion in the beginning of the “Effect of sensory and motor delays on walking” section:

      “... we used our model to investigate how changing sensory and motor delays affects locomotor robustness.”

      We also remark that while they are very relevant papers for our work, neither of the prior papers focus on locomotion: the first involves a 2D balance model of a biped, and the second involves drop landings of quadrupeds.

      Lastly, we note that the investigation of delay is not the only use for this model —  in the future, this model can also be used to study other aspects of locomotion such as the role of proprioceptive feedback (see “Role of proprioceptive feedback in fly walking” section). The layered framework of the model can also be extended to other animals and locomotor strategies (see “Layered model produces robust walking and facilitates local control” section”).

      (2) In a related way, the reviewer is not sure that the elements the authors introduced reflect the structure or function of the fly's nervous system. For example, optimal control is an active field of research and is behind the success of many-legged robots, but the reviewer is not sure what evidence exists that suggests the fly ventral nerve cord functions as an optimal controller. If this were bolstered with additional references, the reviewer would be less concerned.

      We thank the reviewer for the comment — we have now further clarified how our model elements reflect the fly’s nervous system. The elements we introduce are plausible but only loosely analogous to the fly’s nervous system. While we draw parallels from these elements to anatomy (e.g. in Fig 1A-B, and in the first paragraph of the Results section), we do not mean to suggest that these functional elements directly correspond to specific structures in the fly’s nervous system. A substantial portion of the suggested future work (see “Towards biomechanical and neural realism”) aims to bridge the gap between these functional elements and fly physiology, which is beyond the scope of this work. 

      We have added clarifying text to the Results section:

      “While the model is inspired by neuroanatomy, its components do not strictly correspond to components of the nervous system --- the construction of a neuroanatomically accurate model is deferred to future work (see Discussion).”

      In the specific case of optimal control — optimal control is a theoretical model that predicts various aspects of motor control in humans, there is evidence that optimal control is implemented by the human nervous system (Todorov and Jordan, 2002; Scott, 2004; Berret et al., 2011). Based on this, we make the assumption that optimal control is a reasonable model for motor control in flies implemented by the fly nervous system as well. Fly movement makes use of proprioceptive feedback signals (Mendes et al., 2013; Pratt et al., 2024; Berendes et al., 2016), and optimal control is a plausible mechanism that incorporates feedback signals into movement.

      We have added the following clarifying text in the Results section: 

      “The optimal controller layer maintains walking kinematics in the presence of sensori motor delays and helps compensate for external perturbations. This design was inspired by optimal control-based models of movements in humans (Todorov and Jordan, 2002; Scott, 2004; Berret et al., 2011)”

      (3) "The model generates realistic simulated walking that matches real fly walking kinematics...". The reviewer appreciates the difficulty in conducting this type of work, but the reviewer cannot conclude that the kinematics "match real fly walking kinematics". The range of motion of several joints is 30% too small compared to the animal (Figure 2B) and the reviewer finds the video comparisons unpersuasive. The reviewer would understand if there were additional constraints, e.g., the authors had designed a robot that physically could not complete the prescribed motions. However the reviewer cannot think of a reason why this simulation could not replicate the animal kinematics with arbitrary precision, if that is the goal.

      We agree with the reviewer that the model-generated kinematics are not perfectly indistinguishable from real walking kinematics, and now clarify this in the text. We also agree with the reviewer that one could build a model that precisely replicates real kinematics, but as they intuit, that was not our goal. Our goal was to build a model that both replicates animal kinematics, and is interpretable and generalizable (which allows us to investigate what happens when perturbations and varying sensorimotor delays are introduced). There is a trade-off between realism and generalizability — a simulation that fully recreates empirical data would require a model that is completely fit to data, which is likely to be more complex (in terms of parameters required) and less generalizable to novel scenarios. We have made design choices that result in a model that balances these trade-offs. We do not consider this to be a weakness of the model; in fact, few comparable models account for all joints involved in locomotion, and fewer explicitly compare model kinematics with kinematics from data.

      We have tempered the language in the abstract:

      “The model generates realistic simulated walking that resembles real fly walking kinematics”

      The tempered statement, we believe, is a fair characterization of the walking — it resembles but does not perfectly match real kinematics.

      We have also introduced clarifying text in the introduction:

      “Overall, existing walking models focus on either kinematic or physiological accuracy, but few achieve both, and none consider the effect of varying sensorimotor delays. Here, we develop a new, interpretable, and generalizable model of fly walking, which we use to investigate the impact of varying sensorimotor delays in Drosophila locomotion.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Potential typo on page 5:

      2.1.2 Joint kinematics trajectory generator

      Paragraph 4, last line: Original text - ".....it also estimates the current phase". Suggested correction - "...it also estimates the current phase velocity"

      Done

      Potential typo on page 8:

      2.3 Model maintains walking under unpredictable external perturbations.

      Paragraph 3, line 2: Original text - "...brief, unexpected force (e.g. legs slipping on an unstable surface)".

      Consider replacing force with motion, or providing an example of a force as opposed to displacement (slipping).

      Done

      Potential typo on page 8:

      2.3 Model maintains walking under unpredictable external perturbations.

      Paragraph 3, line 4: Original text - "The magnitude of this velocity is drawn from a normal distribution...".

      Is this really magnitude? If so, please discuss how the sign (+/-) is assigned to velocity, and how the normal distribution is centred so as to sample only positive values representing magnitude.

      Indeed the magnitude of the velocity is drawn from a normal distribution. A positive or negative sign is then assigned with equal odds. We have added text to clarify this:

      “The sign of the velocity was drawn separately so that there is equal likelihood for negative or positive perturbation velocities.”

      Page 8:

      2.3 Model maintains walking under unpredictable external perturbations.

      In Paragraph 5: Why is the data reduced to only 2 dimensions? Could higher order PCA modes (cumulatively accounting for more than 50% variance in the data) not have distinguishing information between realistic and unrealistic walking trajectories?

      We provide a longer response for this in the public review above.

      Page 11:

      Why wouldn't a system trained in the presence of external perturbations perform better? What is the motivation to remove external perturbations during training?

      We agree that a system trained in the presence of external perturbations would probably perform better — however, we do not have data that contains walking with external perturbations. Nothing was removed — all the data used in this study involve a fly walking without perturbations.

      We have added a clarification:

      “our model maintains realistic walking in the presence of external dynamic perturbations, despite being trained only on data of walking without perturbations (no perturbation data was available).”

      Page 16:

      4.1 Tracking joint angles of D. melanogaster walking in 3D.

      Paragraph 1: Readers who wish to collect similar data might benefit from specifying the exposure time, animal size in pixels (or camera sensor format and field of view), in addition to the frame rate. Alternatively, consider mentioning the camera and lens part numbers provided by the manufacturer.

      This is a good point. We have updated the text to include these specifications:

      “We obtained fruit fly D. melanogaster walking kinematics data following the procedure previously described in (Karashchuk et al, 2021). Briefly, a fly was tethered to a tungsten wire and positioned on a frictionless spherical treadmill ball suspended on compressed air. Six cameras (Basler acA800-510um with Computar zoom lens MLM3X-MP) captured the movement of all of the fly's legs at 300 Hz. The fly size in pixels ranges from about 300x300 up to 700x500 pixels across the 6 cameras. Using Anipose, we tracked 30 keypoints on the fly, which are the following 5 points on each of the 6 legs: body-coxa, coxa-femur, femur-tibia, and tibia-tarsus joints, as well as the tip of the tarsus.”

      Potential typos on page 18:

      4.3.3 Training procedure

      Paragraph 2, line 1: Original text - "..(, p)"

      Do the authors mean "...(, )"

      Paragraph 2, line 2: Original text - "... (,, v, p)" Do the authors mean "... (,, v, )"?

      Paragraph 3, line 3: Original text - "... (,, v, p)" Do the authors mean "... (,, v, )"?

      Thank you for pointing out this issue. We have now fixed the phase p to be \phi to be consistent with the rest of the text.

      Paragraph 3, line 3: Original text - "...()"

      Do the authors mean "(d)"? If not, please discuss the difference between and d.

      Thank you for pointing this out. \hat \theta and \theta_d were used interchangeably which is confusing. We have standardized our reference to the desired trajectory as \theta_d throughout the text.

      Page 19:

      Typo after eqn. (6):

      Original text: "where x := q - q, ... A and B are Jacobians with respect to...."

      Correction: "where x := q - q, ... Ac and Bc are Jacobians with respect to...."

      Similar corrections in eqn. 7 and eqn. 8: A and B should be replaced with Ac and Bc. Done

      Page 19, eqn. (10b):

      Should the last term be qd(t+T) as opposed to qd(t+1)?

      No: in fact (10a) contains the typo: it should be y(t+1) as opposed to y(t+T). This has been fixed.

      Page 19

      The authors' detailed description of the initial steps leading up to the dynamics model, involving the construction of the ODE, linearizing the system about the fixed point makes the text broadly accessible to the general reader. Similarly, adding some more description of the predictive model (eqn. 11 - 15) could improve the text's accessibility and the reader's appreciation for the model. This is especially relevant since the effects of sensorimotor delay and external perturbations, which are incorporated in the control and dynamics model, form a major contribution to this work. What do the matrices F, G, L, H, and K look like for the Drosophila model? Are there any differences between the model in Stenberg et al. (referenced in the paper) and the authors' model for predictive control? Are there any differences in the assumptions made in Stenberg et al. compared to the model presented in this work? The readers would likely also benefit from a figure showing the information flow in the model, and describing all the variables used in the predictive control model in eqn. 11 through eqn. 15 (analogous to Figure 1 in Stenberg et al. (2022)). Such a detailed description of the control and dynamics model would help the reader easily appreciate the assumptions made in modelling the effects of sensorimotor delay and external perturbations.

      Done

      Page 20:

      Eqn. 12: Should z(t+1) be z(t+T) instead?

      Similar comment for eqn. 14

      No: we made a mistake in (10a); there should be no (t+T) terms; all terms should be (t+1) terms to reflect a standard discrete-time difference equation.

      Eqn. 13: r(t) can be defined explicitly

      Done

      4.5 Generate joint trajectories of the complete model with perturbations Paragraph 2, line 2: Please read the previous comment

      \hat \theta and \theta_d were previously used interchangeably which is confusing. We have standardized our reference to the desired trajectory as \theta_d throughout the text.

      Original text - "Every 8 timesteps, we set :=...."

      Does this mean dis set to? If so, the motivation for this is not clear.

      We mean that \theta_d is set to be equal to \theta. We have replaced “:=” with “=” for clarity.

      General comments for the authors:

      Could the authors discuss the assumptions regarding Drosophila physiology implied in the control model?

      The control model is primarily included as a plausible functional element of the fly’s nervous system, and as such implies minimal assumptions on physiology itself. The main assumption, which is evident from the description of the model components, is that the fly uses proprioceptive feedback information to inform future movements.

      We have added clarifying text to the Results section:

      “While the model is inspired by neuroanatomy, its components do not strictly correspond to components of the nervous system --- the construction of a neuroanatomically accurate model is deferred to future work (see Discussion).”

      The authors acknowledge the absence of ground contact forces in the model. It is probably worth discussing how this simplification may affect inferences regarding the acceptable range of sensorimotor delay in generating realistic walking trajectories.

      We agree, and discuss how some of these assumptions affect the quantitative results in the section “Towards biomechanical and neural realism”. We replicate the relevant sentences below:

      “The inclusion of explicit leg-ground contact interactions would also make it harder for the model to recover when perturbed, because perturbations during walking often occur upon contact with the ground (e.g. the ground is slippery or bumpy).”

      The effects of other simplifications are also mentioned in the same section.

      Can the authors provide an insight into why the use of a second derivative of joint angles as the output of the trajectory generator () leads to more realistic trajectories (4.3.1 Model formulation, paragraph 1)?

      Does the use of a second-order derivative of joint angles lead to drift error because of integration?

      Could the distribution of θd produced be out of the domain due to drift errors? Could this affect the performance of the neural network model approximating the trajectory generator?

      We are not sure why the second derivative works better than the first derivative. It is possible that modeling the system as a second order differential equation gives the network more ability to produce complex dynamics. 

      As can be seen in the example time series in Figures 2 and 3 and supplemental videos, there is no drift error from integration, so it is unlikely to affect the performance of the neural network.

      What does the model's failure (quantified by a low KS score) look like in the context of fly dynamics? What do the joint angles look like for low values of KS score? Does the fly fall down, for example?

      Since the model primarily considers kinematics, a low KS score means that kinematics are unrealistic, e.g. the legs attain unnatural angles or configurations. Examples of this can be seen in videos 4-7 (linked from Appendix 1 of the paper), as well as in the bottom row of Fig. 5, panel A. Here, at 40ms of motor delay, L2 femur rotation is seen to attain values that far exceed the normal ranges. 

      We have added a small clarification in the caption of Fig.5 panel A:

      “low KS indicates that the perturbed walking deviates from data and results in unnatural angles

      (as seen at 40ms motor delay)” 

      We remark that since our simulations do not incorporate contact forces (as the reviewer remarks above, we simulate something like legs moving in the air for a tethered fly), the fly cannot “fall down” per se. However, if forces were incorporated then yes, these unrealistic kinematics would correspond to a fly that falls down or is no longer walking.

      Reviewer #2 (Recommendations For The Authors):

      L49: "Computational models of locomotion do not typically include delay as a tunable parameter, and most existing models of walking cannot sustain locomotion in the presence of delays and external perturbations". This remark confuses the reviewer.

      (1) If models do not "typically" include delay as a tunable parameter, this suggests that atypical models do. Which models do? Please provide references.

      Our initial phrasing was confusing. We meant to say that most models do not include delay, and some models do include delay as a fixed value (rather than a tunable value). We clarify in the updated text, which is replicated below:

      “Computational models of locomotion typically have not included delays as a tunable parameter, although some models have included them as fixed values (Geyer and Herr, 2010; Geijtenbeek et al., 2013).”

      (2) Has the statement that most existing models cannot sustain locomotion with delays been tested? If so, provide references. If not, please remove this statement or temper the language.

      Since most models don’t include delays, they cannot be run in scenarios with delays. We clarify in the updated text, which is replicated below:

      “Computational models of locomotion have not typically included delays. Some have included delay as a fixed value rather than a tunable parameter (Geyer and Herr, 2010; Geijtenbeek et al., 2013). However, in general, the impact of sensorimotor delays on locomotor control and robustness remains an underexplored topic in computational neuroscience.”

      L57: "two of six legs lift off the ground at a time" - Two legs are off the ground at any time, but they do not "lift off" simultaneously in the fruit fly. To lift off simultaneously, contralateral leg pairs would need to be 33% out of phase with one another, but they are almost always 50% out of phase.

      Thank you for pointing out this oversight. We have updated the text accordingly:

      “Flies walk rhythmically with a continuum of stepping patterns that range from tetrapod (where two of six legs are off the ground at a time) to tripod (where three of six legs are off the ground at a time)"

      L88: "a new model of fly walking" - The intention of the authors is to produce a model from which to learn about walking in the fly, is that correct? The reviewer has read the paper several times now and wants to be sure that this is the authors' goal, not to engineer a control system for an animation or a robot.

      Indeed, this is our goal. We were previously unclear about this, and have made text edits to clarify this — we provide a longer response for this in the public review above (see (1)).

      L126: "These desired phases are synchronized across pairs of legs to maintain a tripod coordination pattern, even when subject to unpredictable perturbations." - Does the animal maintain tripod coordination even when perturbed? In the reviewer's experience, flies vary their interleg coordination all the time. The reviewer would also expect that if perturbed strongly (as the supplemental videos show), the animal would adapt its interleg coordination in response. The author finds this assumption to be a weak point in the paper for the use of this disturbance exploring animal locomotion.

      We do not know exactly how flies may react to our mechanical perturbations. However, we may hypothesize based on past papers. 

      Couzin-Fuchs et al (2015) apply a mechanical perturbation to walking cockroaches. They find that that tripod is temporarily broken immediately after the perturbation but the cockroach recovers to a full tripod within one step cycle. 

      DeAngelis et al (2019) apply optogenetic perturbations to fly moonwalker neurons that drive backward walking. Flies slow down following perturbation, but then recover after 200ms (about 2-3 steps) to their original speed (on average). 

      Thus, we think it is reasonable to model a fly’s internal phase coupling to maintain tripod and for its intended speed to remain the same even after a perturbation. 

      We do agree with the reviewer that it is plausible a fly might also slow down or even stop after a perturbation and we do not model such cases. We have added some text to the discussion on future work:

      “Future work may also model how higher-level planning of fly behavior interacts with the lowerlevel coordination of joint angles and legs. Walking flies continuously change their direction and speed as they navigate the environment (Katsov et al, 2017; Iwasaki et al 2024). Past work shows that flies tend to recover and walk at similar speeds following perturbations (DeAngelis et al, 2019), but individual flies might still change walking speed, phase coupling, or even transition to other behaviors, such as grooming. Modeling these higher-level changes in behavior would involve combining our sensorimotor model with models for navigation (Fisher 2022) or behavioral transitions (Berman et al, 2016).”

      L136: "...to output joint torques to the physical model of each leg" - Is this the ultimate output of the nervous system? Muscles are certainly not idealized torque generators. There are dynamics related to activation and mechanics. The reviewer is skeptical that this is a model of neural control in the animal, because the computation of the nervous system would be tuned to account for all these additional dynamics.

      We agree with the reviewer that joint torques are not the ultimate output of the nervous system. We use a torque controller because it is parsimonious, and serves our purpose of creating an interpretable and modular locomotion model.

      We also agree that muscles are an important consideration — we make mention of them later on in the paper under the section “Toward biomechanical and neural realism”, where we state “Another step toward biological realism is the incorporation of explicit dynamical models of proprioceptors, muscles, tendons, and other biomechanical aspects of the exoskeleton.”

      Our goal is not to directly model neural control of the animal. We have introduced text clarifications to emphasize this — we provide a longer response for this in the public review above (see (2)).

      L143: "To train the network from data, we used joint kinematics of flies walking on a spherical treadmill..." This is an impressive approach, but then the reviewer is confused about why the kinematics of the model are so different from those of the animal. The animal takes longer strides at a lower frequency than the model. If the model were trained with data, why aren't they identical? This kind of mismatch makes the reviewer think the approach in this paper is too complicated to address the main problem.

      The design of our trajectory generator model is one of the simplest for reproducing the output of a dynamical system. It consists of a multilayer perceptron model that models the phase velocity and joint angle accelerations at each timestep. All of its inputs are observable and interpretable: the current joint angles, joint angle derivatives, desired walking speed, and phase angle. 

      We chose this model for ease of interpretability, integration with the optimal controller, and to allow for generalization across perturbations. Given all of these constraints, this is the best model of desired kinematics we could obtain. We note that the simulated kinematics do match real fly kinematics qualitatively (Figure 2A and supplemental videos) and are close quantitatively (Figure 2B and C). We speculate that matching the animals’ strides at all walking frequencies may require explicitly modeling differences across individual flies. We leave the design and training of more accurate (but more complex) walking models for future work.

      We add some further discussion about fitting kinematics in the discussion:

      “Although we believe our model matches the fly walking sufficiently for this investigation, we do note that our model still underfits the joint angle oscillations in the walking cycle of the fly (see Figure 2 and Appendix 3). More precise fitting of the joint angle kinematics may come from increasing the complexity of the neural network architecture, improving the training procedure based on advances in imitation learning (Hussein et al., 2018), or explicitly accounting for individual differences in kinematics across flies (Deangelis et al., 2019; Pratt et al., 2024).”

      Figure 2: The reviewer thinks the violin plots in Figure 2C are misleading. Joint angles could be greater or less than 0, correct? If so, why not keep the sign (pos/neg) in the data? Taking the absolute value of the errors and "folding over" the distribution results in some strange statistics. Furthermore, the absolute value would shroud any systematic bias in the model, e.g., joint angles are always too small. The reviewer suggests the authors plot the un-rectified data and simply include 2 dashed lines, one at 5.56 degrees and one at -5.56 degrees.

      These violin plots are averages of errors over all phases within each speed. We chose to do this to summarize the errors across all phase angle plots, which are shown in detail in Appendix 3 and 4.

      For the reviewer, we have added a plot of the raw errors across all phase angle plots in Appendix 5, E.

      L156: Should "\phi\dot" be "\phi"?

      We originally had a typo: we said “phase” when we meant “phase velocity”. This has been fixed. \phi\dot is correct.

      L160: "This control is possible because the controller operates at a higher temporal frequency than the trajectory generator...". This statement concerns the reviewer. To the reviewer, this sounds like the higher-level control system communicates with the "muscles" at a higher frequency than the low-level control system, which conflicts with the hierarchical timescales at which the nervous system operates. Or do the authors mean that the optimal controller can perform many iterations in between updates from the trajectory generator level? If so, please clarify.

      We mean that the optimal controller can perform many iterations in between updates from the trajectory generator level. The text has been clarified:

      “This control is possible because the controller operates at a higher temporal frequency than the trajectory generator in the model. The controller can perform many iterations (and reject disturbances) in between updates to and from the trajectory generator.”

      L225: "We considered two types of perturbations: impulse and persistent stochastic". Are these realistic perturbations? Realistic perturbations such as a single leg slipping, or the body movement being altered would produce highly correlated joint velocities.

      These perturbations are not quite realistic — nonetheless, we illustrate their analogousness to real perturbations in the subsequent text in the paper, and restrict our simulations to ranges that would be biologically plausible (see Appendix 7). We agree that realistic perturbations would produce highly correlated joint accelerations and velocities, whereas our perturbations produce random joint accelerations. 

      L265: "...but they are difficult to manipulate experimentally..." This is true, but it can and has been done. The authors should cite:

      Bässler, U. (1993). The femur-tibia control system of stick insects-A model system for the study of the neural basis of joint control. Brain Research Reviews, 18(2), 207-226. 

      Thank you for the suggestion, we have incorporated it into the text at the end of the referenced sentence.

      L274: "...since the controller can effectively compensate for large delays by using predictions of joint angles in the future". But can the nervous system do this? Or, is there a reason to think that the nervous system can? The reviewer thinks the authors need stronger justification from the literature for their optimal control layer.

      To clarify, this sentence describes a feature of the model’s behavior when no external perturbations are present. This is not directly relevant to the nervous system, since organisms do not typically exist in an environment free of perturbations — we are not suggesting that the nervous system does this.

      In response to the question of whether the nervous system can compensate for delays using predictions: we know that delays are present in the nervous system, perturbations exist in the environment, and that flies manage to walk in spite of them. Thus, some type of compensation must exist to offset the effects of delays (the reviewer themself has provided some excellent citations that study the effects of delays). In our model, we use prediction as the compensation mechanism — this is one of our central hypotheses. We further discuss this in the section “Predictive control is critical for responding to perturbations due to motor delay”.

      L319: "The formulation of a modular, multi-layered model for locomotor control makes new experimentally-testable hypotheses about fly motor control...". What testable hypotheses are these? The authors should explicitly state them. They are not clear to the reviewer, especially given the nonphysiological nature of the control system and the mechanics.

      A number of testable hypotheses are mentioned throughout the Discussion section:

      “Our model predicts that at the same perturbation magnitude, walking robustness decreases as delays increase. This could be experimentally tested by altering conduction velocities in the fly, for example by increasing or decreasing the ambient temperature (Banerjee et al, 2021).  If a warmer ambient temperature decreases delays in the fly, but fly walking robustness remains the same in response to a fixed perturbation, this would indicate a stronger role for central control in walking than our modeling results suggest.”

      “In our model, robust locomotion was constrained by the cumulative sensorimotor delay. This result could be experimentally validated by comparing how animals with different ratios of sensory to motor delays respond to perturbations. Alternatively, it may be possible to manipulate sensory vs. motor delays in a single animal, perhaps by altering the development of specific neurons or ensheathing glia (Kottmeier et al., 2020). If sensory and motor delays have significantly different effects on walking quality, then additional compensatory mechanisms for delays could play a larger role than we expect, such as prediction through sensory integration, mechanical feedback, or compensation through central control.”

      “we hypothesize that removing proprioceptive feedback would impair an insect's ability to sustain locomotion following external perturbations.”

      “We propose that fly motor circuits may encode predictions of future joint positions, so the fly may generate motor commands that account for motor neuron and muscle delays.”

      L323: "...and biomechanical interactions between the limb and the environment". In the reviewer's experience, the primary determinant of delay tolerance is the mechanical parameters of the limb: inertia, damping, and parallel elasticity. For example, in Ashtiani et al. 2021, equation 5 shows exactly how this comes about: the delay changes the roots and poles of the control system. This is why the reviewer is confused by the complexity of the model in this submission; a simpler model would explain why delays cannot be tolerated in certain circumstances.

      We were previously unclear about the goal of the model, and have made text edits to clarify this — we provide a longer response for this in the public review above (see (1)).

      L362: Another highly relevant reference here would be Sutton et al. 2023.

      Done

      L366: Szczecinski et al. 2018 is hardly a "model"; it is mostly a description of experimental data. How about Goldsmith, Szczecinski, and Quinn 2020 in B&B? Their model of fly walking has patterngenerating elements that are coordinated through sensory feedback. In their model, motor activation is also altered by sensory feedback. The reviewer thinks the statement "Models of fly walking have ignored the role of feedback" is inaccurate and their description of these references should be refined.

      Thank you for the suggestion; we have tempered the language and revised this section to include more references, including the suggested one — text is replicated below. 

      “Many models of fly walking ignore the role of feedback, relying instead on central pattern generators (Lobato-Rios et al., 2022; Szczecinski et al., 2018; Aminzare et al., 2018) or metachondral waves (Deangelis et al., 2019) to model kinematics. Some models incorporate proprioceptive feedback, primarily as a mechanism that alters timing of movements in inter-leg coordination (Goldsmith et al., 2020; Wang-Chen et al., 2023).”

      We remark that Szczecinski et al does include a model that replicates data without using sensory feedback, so we think it is fair to include.  

      L371: "...highly dependent on proprioceptive feedback for leg coordination during walking." What about Berendes et al. 2016, which showed that eliminating CS feedback from one leg greatly diminished its ability to coordinate with the other legs? This suggests that even flies depend on sensory feedback for proper coordination, at least in some sense.

      Interesting suggestion – we have integrated it into the text a little further down, where it better fits:

      “Silencing mechanosensory chordotonal neurons alters step kinematics in walking Drosophila (Mendes et al., 2013; Pratt et al., 2024). Additionally, removing proprioceptive signals via amputation interferes with inter-leg coordination in flies at low walking speeds (Berendes et al., 2016)”

      L426: "The layered model approach also has potential applications for bio-mimetic robotic locomotion.". How fast can this model be computed? Can it run faster than real-time? This would be an important prerequisite for use as a robot control system.

      The model should be able to be run quite fast, as it involves only

      (1) Addition, subtraction, matrix multiplication, and sinusoidal computation on scalars (for the phase coordinator and optimal controller)

      (2) Neural network inference with a relatively small network (for the trajectory generator) Whether this can run in real-time depends on the hardware capabilities of the specific robot and the frequency requirements — it is possible to run this on a desktop or smaller embedded device.

      We do note that the model needs to first be set up and trained before it can be run, which takes some time (see panel D of Figure 1).

      L432: "...which is a popular technique in robotics.". Please cite references supporting this statement.

      We have added citations: the text and relevant citations are reproduced below:

      “... which is a popular technique in robotics (Hua et al., 2021; Johns, 2021)

      Hua J, Zeng L, Li G, Ju Z. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors. 2021; 21(4):1278

      Johns E. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In:

      2021 IEEE international conference on robotics and automation (ICRA) IEEE; 2021. p. 4613–4619

      L509: "We find that the phase offset across legs is not modulated across walking speeds in our dataset". This is a surprising result to the reviewer. Looking at Figure 6C, the reviewer understands that there are no drastic changes in coordinate with speed, but there are certainly some changes, e.g., L1-R3, L3-R1. In the reviewer's experience, even very small changes in interleg phasing can change the visual classification of walking from "tripod" to "tetrapod" or "metachronal". Furthermore, several leg pairs do not reside exactly at 0 or \pi radians apart, e.g., L1-L3, L2-L3, R1-R3, R2-R3. In conclusion, the reviewer thinks that setting the interleg coordination to tripod in all cases is a large assumption that requires stronger justification (or, should be eliminated altogether).

      We made a simplifying assumption of a tripod coordination across all speeds. The change in relative phase coordination across speeds is indeed relatively small and additionally we see little change in our results across forward speeds (see Figures 4B, 5C and 5D). 

      We have added text to clarify this assumption and what could be changed for future studies in the methods:

      “We estimate $\bar \phi_{ij}$ from the walking data by taking the circular mean over phase differences of pairs the legs during walking bouts. We find that the phase offset across legs is not strongly modulated across walking speeds in our dataset (see Appendix 2) so we model $\bar \phi_{ij}$ as a single constant independent of speed. In future studies, this could be a function of forward and rotation speeds to account for fine phase modulation differences.”

      L581: "of dimension...". Should the asterisk be replaced by \times? The asterisk makes the reviewer think of convolution. This change should be made throughout this paragraph.

      Good point, done.

      Figure 6: Rotational velocities in all 3 sections are reported in mm/s, but these units do not make sense. Rotational velocities must be reported in rad/s or deg/s.

      The rotation velocity of mm/s corresponded to the tangential velocity of the ball the fly walked on. We agree that this does not easily generalize across setups, so we have updated the figure rotation velocities in rad/s. 

      L619: The reviewer is unconvinced by using only 2 principal components of the data to compare the model and animal kinematics. The authors state on line 626 that the 2 principal components do not capture 56.9% of the variation in the data, which seems like a lot to the reviewer. This is even more extreme considering that the model has 20 joints, and the authors are reducing this to 2 variables; the reviewer can't see how any of the original waveforms, aside from the most fundamental frequencies, could possibly be represented in the PCA dataset. If the walking fly models looked similar to each other, the reviewer could accept that this method works. But the fact that this method says the kinematics are similar, but the motion is clearly different, leads the reviewer to suspect this method was used so the authors could state that the data was a good match.

      Our primary use of the KS metric was to indicate whether the simulated fly continues walking in the presence of perturbations, hence we limited the analysis of the KS to the first 2 principal components. 

      For completeness, we investigate the principal components in Appendix 9 and the effect of evaluating KS with different numbers of components in Appendix 10. 

      The results look similar across components for impulse perturbations. For stochastic perturbations, the range of similar walking decreases as we increase the number of components used to evaluate walking kinematics. Comparing this with Appendix 9 showing that higher components represent higher frequencies of the walking cycle, we conclude that at the edge of stability for delays (where sum of sensory and actuation delays are about 40ms), flies can continue walking but with impaired higher frequencies (relative to no perturbations) during and after perturbation. 

      We add text in the methods:

      “We chose 2 dimensions for PCA for two key reasons. First, these 2 dimensions alone accounted for a large portion of the variance in the data (52.7% total, with 42.1% for first component and 10.6% for second component)). There was a big drop in variance explained from the first to the second component, but no sudden drop in the next 10 components (see Appendix 9). Second, the KDE procedure only works effectively in low-dimensional spaces, and the minimal number of dimensions needed to obtain circular dynamics for walking is 2. We investigate the effect of varying the number of dimensions of PCA in Appendix 10.”

      (Note that we have corrected the percentage of variance accounted for by the principal components, as these numbers were from an older analysis prior to the first draft.)

      We also reference Appendix 10 in the results:

      “We observed that robust walking was not contingent on the specific values of motor and sensory delay, but rather the sum of these two values (Fig. 5E). Furthermore, as delay increases, higher frequencies of walking are impacted first before walking collapses entirely (Appendix 10).”

    1. eLife Assessment

      This valuable study presents a deep learning framework for predicting synergistic drug combinations for cancer treatment in the AstraZeneca-Sanger (AZS) DREAM Challenge dataset. However, the evidence on the generalizability of the model is incomplete, as part of the validation seems to be flawed by overfitting, and only a modest correlation between predictions and observations was observed in the second, more independent test set. The reported tool, DIPx, could be of use for personalized drug synergy prediction and exploring the activated pathways related to the effects of drug combinations.

    2. Reviewer #1 (Public review):

      The authors introduces DIPx, a deep learning framework for predicting synergistic drug combinations for cancer treatment using the AstraZeneca-Sanger (AZS) DREAM Challenge dataset. While the approach is innovative, I have following concerns and comments, and hopefully will improve the study's rigor and applicability, making it a more powerful tool in real clinical world.

      (1) In the abstract: "We trained and validated DIPx in the AstraZeneca-Sanger (AZS) DREAM Challenge dataset using two separate test sets: Test Set 1 comprised the combinations already present in the training set, while Test Set 2 contained combinations absent from the training set, thus indicating the model's ability to handle novel combinations". Test Set 1 comprises combinations already present in the training set, likely leading overfitting issue. The model might show inflated performance metrics on this test set due to prior exposure to these combinations, not accurately reflecting its true predictive power on unknown data, which is crucial for discovering new drug synergies. The testing approach reduces the generalizability of the model's findings to new, untested scenarios.

      (2) The model struggles with predicting synergies for drug combinations not included in its training data (showing only Spearman correlation 0.26 in Test Set 2). This limits its potential for discovering new therapeutic strategies. Utilizing techniques such as transfer learning or expanding the training dataset to encompass a wider range of drug pairs could help to address this issue.

      (3) The use of pan-cancer datasets, while offering broad applicability, may not be optimal for specific cancer subtypes with distinct biological mechanisms. Developing subtype-specific models or adjusting the current model to account for these differences could improve prediction accuracy for individual cancer types.

      (4) Line 127, "Since DIPx uses only molecular data, to make a fair comparison, we trained TAJI using only molecular features and referred to it as TAJI-M.". TAJI was designed to use both monotherapy drug-response and molecular data, and likely won't be able to reach maximum potential if removing monotherapy drug-response from the training model. It would be critical to use the same training datasets and then compare the performances. From the Figure 6 of TAJI's paper (Li et al., 2018, PMID: 30054332) , i.e., the mean Pearson correlation for breast cancer and lung cancer are around 0.5 - 0.6.

      The following 2 concerns had been include in the Discussion section which are great:

      (1) Training and validating the model using cell lines may not fully capture the heterogeneity and complexity of in vivo tumors. To increase clinical relevance, it would be beneficial to validate the model using primary tumor samples or patient-derived xenografts.

      (2) The Pathway Activation Score (PAS) is derived exclusively from primary target genes, potentially overlooking critical interactions involving non-primary targets. Including these secondary effects could enhance the model's predictive accuracy and comprehensiveness.

      Comments on revisions:

      The authors replied to my concerns but they did not address my comments/concerns. Especially for my concern #1: They trained and validated DIPx in the AstraZeneca-Sanger (AZS) DREAM Challenge dataset using two separate test sets: Test Set 1 comprised the combinations already present in the training set. Therefore, test Set 1 comprises combinations already present in the training set, likely leading overfitting issue but they claimed "There is no danger overfitting here" in their "Author Response" letter.

      All my other concerns are unchanged too.

    3. Reviewer #2 (Public review):

      Trac, Huang, et al used the AZ Drug Combination Prediction DREAM challenge data to make a new random forest-based model for drug synergy. They make comparisons to the winning method and also show that their model has some predictive capacity for a completely different dataset. They highlight the ability of the model to be interpretable in terms of pathway and target interactions for synergistic effects.

      In their revised manuscript, the authors attempt to address the points raised about a comparison to the full TAJI model and showing how molecular can be integrated into DIPx.

      (1) Their argument that "Using only molecular data allows for more convenient and intuitive inference of pathway importance compared to integrating multiple data types" is unconvincing. It's not clear how adding a data source here confounds pathway inference. They need to add examples.<br /> (2) They have revised the method of calculating p-values instead of bootstrapping them, so the new numbers appear a lot more meaningful now.<br /> (3) The performance on the O'Neill dataset shows the limitations of their training regime and shows the limits of the model in terms of picking new drug combinations. I would argue that is the very definition of overfitting, not being able to model any combination it has never seen.

    4. Reviewer #3 (Public review):

      Summary:

      Predicting how two different drugs act together by looking at their specific gene targets and pathways is crucial for understanding the biological significance of drug combinations. This study incorporates drug-specific pathway activation scores (PASs) to estimate synergy scores as one of the key advancements for synergy prediction. The new algorithm, Drug synergy Interaction Prediction (DIPx), developed in this study, uses gene expression, mutation profiles, and drug synergy data to train the model and predict synergy between two drugs. Comprehensive comparisons with another best-performing algorithm, TAIJI-M, highlight the potential of its capabilities.

      Strengths:

      DIPx uses target and driver genes to elucidate pathway activation scores (PASs) to predict drug synergy. This approach integrates gene expression, mutation profiles, and drug synergy data to capture information about the functional interactions between drug targets, thereby providing a potential biological explanation for the synergistic effects of combined drugs. DIPx's performance was tested using the AstraZeneca-Sanger (AZS) DREAM Challenge dataset, especially in Test Set 1, where the Spearman correlation coefficient between predicted and observed drug synergy was 0.50 (95% CI: 0.47-0.53). DIPx's ability to handle novel combinations, as evidenced by its performance in Test Set 2, indicates its potential for predictions of new and untested drug combinations.

      Weaknesses:

      While the DIPx algorithm shows promise in predicting drug synergy based on pathway activation scores, it's essential to consider its limitations. One limitation is that the availability of training data for specific drug combinations may influence its predictive capability. Further testing and experimental validation of the predictions in future studies would be necessary to fully assess the algorithm's generalizability and robustness.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors introduce DIPx, a deep learning framework for predicting synergistic drug combinations for cancer treatment using the AstraZeneca-Sanger (AZS) DREAM Challenge dataset. While the approach is innovative, I have the following concerns and comments which hopefully will improve the study's rigor and applicability, making it a more powerful tool in the real clinical world.

      We thank to the reviewer for recognizing the innovative aspects of DIPx and for sharing their valuable comments to further refine and strengthen our study. Those comments are carefully addressed in the following point-by-point response.

      (1) Test Set 1 comprises combinations already present in the training set, likely leading overfitting issue. The model might show inflated performance metrics on this test set due to prior exposure to these combinations, not accurately reflecting its true predictive power on unknown data, which is crucial for discovering new drug synergies. The testing approach reduces the generalizability of the model's findings to new, untested scenarios.

      From a clinical perspective, it is useful to test whether a known (previously tested) combination can work for a new patient, which is the purpose of Test Set 1. There is no danger overfitting here, because the test set is completely independent of the discovery set, so had we only discovered a false positive the test set would not have more than power than expected under the null. Predicting the effectiveness of unknown drug combinations (Test Set 2) is indeed an important and more challenging goal of synergy prediction, but it is statistically a distinct problem. The two test sets were previously designed by the AZS DREAM Challenge [PMID: 31209238].

      We have performed cross-validation on the dataset and demonstrated that the result of DIPx for Test Set 1 is not overfitting. Indeed, Figure 2—figure supplement 1 shows the 10-fold cross validation results for the training set. The median Spearman correlation between the predicted and observed Loewe scores across the 10 folds of cross-validation is 0.48, which is close to the correlation of 0.50 in Test Set 1 (red star).  We have added the cross-validation results to the “Validation and Comparisons in the AZS Dataset” section (page 4). 

      (2) The model struggles with predicting synergies for drug combinations not included in its training data (showing only a Spearman correlation of 0.26 in Test Set 2). This limits its potential for discovering new therapeutic strategies. Utilizing techniques such as transfer learning or expanding the training dataset to encompass a wider range of drug pairs could help to address this issue.

      We agree that this is an important limitation for the discovery of new therapeutic strategies. While transfer learning or expanding the training dataset could indeed help address this issue, implementing these approaches would require access to more comprehensive data, which is currently limited due to the scarcity of drug combination datasets. As more drug combination data become available in future, we plan to expand the training set to better cover a wider range of drug combinations and apply the transfer learning method to improve prediction accuracy. We have added a discussion on this in the Discussion Section.

      (3) The use of pan-cancer datasets, while offering broad applicability, may not be optimal for specific cancer subtypes with distinct biological mechanisms. Developing subtype-specific models or adjusting the current model to account for these differences could improve prediction accuracy for individual cancer types.

      We agree with the reviewer that the current settings of DIPx might not be optimal for specific cancers due to the cancer heterogeneity. However, building subtype-specific models is currently constrained by limitation of data availability, which in turn restricts their predictive power. In the Discussion section, we mention this as one of DIPx's limitations and suggest future improvements in cancer-specific models.

      (4) Line 127, "Since DIPx uses only molecular data, to make a fair comparison, we trained TAJI using only molecular features and referred to it as TAJI-M.". TAJI was designed to use both monotherapy drug-response and molecular data, and likely won't be able to reach maximum potential if removing monotherapy drug-response from the training model. It would be critical to use the same training datasets and then compare the performances. From Figure 6 of TAJI's paper (Li et al., 2018, PMID: 30054332) , i.e., the mean Pearson correlation for breast cancer and lung cancer is around 0.5 - 0.6.

      It is true that using monotherapy drug responses can enhance the performance of TAIJI as described in its original paper. In fact, TAIJI builds separate prediction modules for molecular data and monotherapy drug-response data, then combine their results to obtain the final prediction. In our paper we prioritize the exploration of molecular mechanisms in drug combinations while achieving performance comparable to the molecular model of TAIJI. DIPx can be expected to achieve similarly improved performance if we integrate the monotherapy drug response data using the same approach.

      My major concerns were listed in the public review. Here are some writing issues:

      (5) Some content in the Results section looks like a discussion: i.e, L129, "The extra information from the use of monotherapy data in TAJI is rather small, approximately 10% increase in the overall Spearman correlation, and, of course, we could also use such data in DIPx, so it is more convenient and informative to focus the comparisons on prediction based on molecular data alone."; L257, "As we discuss above, to get synergy, the two drugs in a combination theoretically should not have the same target. However, there is of course no guarantee that two drugs that do not share target genes can produce synergy. ".

      We have revised the texts and moved them to the Discussion section.  

      Reviewer #2 (Public Review):

      Trac, Huang, et al used the AZ Drug Combination Prediction DREAM challenge data to make a new random forest-based model for drug synergy. They make comparisons to the winning method and also show that their model has some predictive capacity for a completely different dataset. They highlight the ability of the model to be interpretable in terms of pathway and target interactions for synergistic effects. While the authors address an important question, more rigor is required to understand the full behavior of the model.

      We thank the reviewer for his/her time and effort in carefully reading the manuscript and acknowledging the significance of the study.

      Major Points

      (1) The authors compare DIPx to the winning method of the DREAm challenge, TAJI to show that from molecular features alone they retrain TAJI to create TAJI-M without the monotherapy data inputs. They mention that "of course, we could also use such data in DIPx...", but they never show the behaviour of DIPx with these data. The authors need to demonstrate that this statement holds true or else compare it to the full TAJI.

      This is similar to point 4 raised by Reviewer 1 regarding the exclusive use of molecular data in DIPx. In fact, TAIJI uses separate prediction modules for molecular data and drugresponse data which are then combined to obtain the final results. While integrating monotherapy drug data could enhance DIPx’s overall performance, for example, simply replacing TAIJI’s molecular model with DIPx in the full TAIJI to achieve comparable results, this is not the primary goal of DIPx. Our focus is on exploring the potential molecular mechanisms of drug action. Using only molecular data allows for more convenient and intuitive inference of pathway importance compared to integrating multiple data types.

      We have revised the related text with the discussion in section “Validation and comparisons in the AZS dataset” of the main text.

      (2) It would be neat to see how the DIPx feature importance changes with monotherapy input. For most realistic scenarios in which these models are used robust monotherapy data do exist.

      Indeed, some existing models incorporate monotherapy data into their predictions; for example, a recent study [PMID: 33203866] uses only monotherapy data to predict drug combinations. TAIJI, as discussed in Point 1, uses separate models for monotherapy and molecular data. In general, both data types can be integrated into a single prediction model, allowing for the consideration of feature importance from both. While such an approach can highlight features contributing to predictive performance, the significance of a monotherapy feature does not necessarily indicate the activated pathways of a synergistic drug combination, which is the primary focus of our study. For this reason, we have excluded monotherapy data from DIPx.

      (3) In Figure 2, the authors compare DIPx and TAJI-M on various test sets. If I understood correctly, they also bootstrapped the training set with n=100 and reported all the model variants in many of the comparisons. While this is a nice way of showing model robustness, calculating p-values with bootstrapped data does not make sense in my opinion as by increasing the value of n, one can make the p-value arbitrarily small.

      The p-value should only be reported for the original models.

      The reviewer is correct that we cannot compute the p-value by using an independent twosample test, because the bootstrap correlation values are based on the same data. However, p-values can still be computed to compare the two prediction models using the bootstrap. Theoretically, the bootstrap can be used to compute a confidence interval for the differential correlation in the test set. However, there is a close relationship between p-values and confidence intervals (see Pawitan, 2001, chapter 5; particularly p.134). Specifically, in this case, we compute the p-value as follows: (1) For each bootstrap, (i) compute the Spearman correlation between the predicted and observed scores in the test set for DIPx and TAIJI-M.

      Denote this by r1 and r2. (ii) compute the difference in the Spearman correlations d= (r1-r2). (2). Repeat the bootstrap n=100 times. (3). Compute the minimum of these two proportions:

      proportion of d<0 or proportion of d>0. (4). The two-sided p-value = 2x the minimum proportion in (3). To overcome the limited bootstrap sample size, we use the normal approximation in computing the proportions in (3). Note that in this method of computing the p-value, larger numbers of bootstrap replicates do not produce more significant results.

      We have re-computed the p-values using this method and added this text to the ‘Methods and Materials’ Section. 

      (4) From Figures 2 and 3, it appears DIPx is overfit on the training set with large gaps in Spearman correlations between Test Set 2/ONeil set and Test Set 1. It also features much better in cases where it has seen both compounds. Could the authors also compare TAJI on the ONeil dataset to show if it is as much overfit?

      The poor performance in ONeil dataset is not due to overfitting as such, but more likely due structural differences between the training and ONeil datasets.  (To investigate the overfitting issue, we have conducted a 10-fold cross validation in the AZS training set. The median correlation between the predicted and observed Loewe score across ten folds is 0.48, which is comparable to the median of 0.50 in the Test Set 1. Therefore, the model does not suffer from overfitting issue.  We have added this cross-validation result in the Section “Validation and Comparisons in the AZS Dataset” (page 4)).

      We have now obtained TAIJI’s results on the ONeil dataset. TAIJI-M relies on a gene-gene interaction network to integrate the indirect drug targeting effects. This approach limits its applicability to new datasets, as it can only predict synergy scores for drug combinations present in the training dataset. Among the set of drug combinations present in the training set (n = 1102), both DIPx and TAIJI-M perform poorly, with Spearman correlations between predicted and observed synergy scores of 0.09 and 0.05, respectively.

      (Additional note: The original version of TAIJI-M uses gene expression, CNV, mutation, and methylation data. However, there is no methylation data in the ONeil dataset, so we retrained TAIJI-M without the methylation features. According to the final report of TAIJI in the challenge (https://www.synapse.org/Synapse:syn5614689/wiki/396206), Guan et al. reported that methylation features do not contribute to prediction performance in the postchallenge analysis. This means that retraining TAIJI-M without the methylation data will not materially affect the comparison between DIPx and TAIJI-M on the ONeil dataset.)

      Minor Points:

      (5) Pg 4, line 130: Citation needed for 10% contribution of monotherapy.

      (6) The general language of this paper is informal at times. I request the authors to refine it a bit.

      We thank the reviewer for pointing this out. We have added the appropriate citation for the statement and carefully revised the text to make it more formal.

      Reviewer #3 (Public Review):

      Summary:

      Predicting how two different drugs act together by looking at their specific gene targets and pathways is crucial for understanding the biological significance of drug combinations. Such combinations of drugs can lead to synergistic effects that enhance drug efficacy and decrease resistance. This study incorporates drug-specific pathway activation scores (PASs) to estimate synergy scores as one of the key advancements for synergy prediction. The new algorithm, Drug synergy Interaction Prediction (DIPx), developed in this study, uses gene expression, mutation profiles, and drug synergy data to train the model and predict synergy between two drugs and suggests the best combinations based on their functional relevance on the mechanism of action. Comprehensive validations using two different datasets and comparing them with another best-performing algorithm highlight the potential of its capabilities and broader applications. However, the study would benefit from including experimental validation of some predicted drug combinations to enhance its reliability.

      Strengths:

      The DIPx algorithm demonstrates the strengths listed below in its approach for personalized drug synergy prediction. One of its strengths lies in its utilization of biologically motivated cancer-specific (driver genes-based) and drug-specific (target genes-based) pathway activation scores (PASs) to predict drug synergy. This approach integrates gene expression, mutation profiles, and drug synergy data to capture information about the functional interactions between drug targets, thereby providing a potential biological explanation for the synergistic effects of combined drugs. Additionally, DIPx's performance was tested using the AstraZeneca-Sanger (AZS) DREAM Challenge dataset, especially in Test Set 1, where the Spearman correlation coefficient between predicted and observed drug synergy was 0.50 (95% CI: 0.470.53). This demonstrates the algorithm's effectiveness in handling combinations already in the training set. Furthermore, DIPx's ability to handle novel combinations, as evidenced by its performance in Test Set 2, indicates its potential for extrapolating predictions to new and untested drug combinations. This suggests that the algorithm can adapt to and make accurate predictions for previously unencountered combinations, which is crucial for its practical application in personalized medicine. Overall, DIPx's integration of pathway activation scores and its performance in predicting drug synergy for known and novel combinations underscore its potential as a valuable tool for personalized prediction of drug synergy and exploration of activated pathways related to the effects of combined drugs.

      Weaknesses:

      While the DIPx algorithm shows promise in predicting drug synergy based on pathway activation scores, it's essential to consider its limitations. One limitation is that the algorithm's performance was less accurate when predicting drug synergy for combinations absent from the training set. This suggests that its predictive capability may be influenced by the availability of training data for specific drug combinations. Additionally, further testing and validation across different datasets (more than the current two datasets) would be necessary to assess the algorithm's generalizability and robustness fully. It's also important to consider potential biases in the training data and ensure that DIPx predictions are validated through empirical studies including experimental testing of predicted combinations. Despite these limitations, DIPx represents a valuable step towards personalized prediction of drug synergy and warrants continued investigation and improvement. It would benefit if the algorithm's limitations are described with some examples and suggest future advancement steps.

      We are grateful to the reviewer for the thoughtful and encouraging comments, and for the time and effort to read our manuscript. We have carefully addressed them in our revision.

      Reviewer #3 (Recommendations For The Authors):

      The authors could consider some of the recommendations below to further improve the DIPx algorithm and its application in personalized drug synergy prediction. Firstly, expanding the training dataset to include a broader range of drug combinations could improve the algorithm's predictive capabilities, especially for novel combinations. This would help address the observed decrease in performance when predicting drug synergy for combinations absent from the training set. This could help assess the robustness of the algorithm and provide a more comprehensive evaluation of its performance for untrained combinations to strengthen its application.

      We agree that expanding the training dataset with a broader range of drug combinations would likely improve performance. However, the vast number of possible combinations, along with the associated cost of the experiment, limits the availability of drug combination data. To increase the size of the training data, we could combine different studies, but data from different studies are often generated using different protocols and experimental settings, introducing biases that complicate the integration. As technology continues to advance, we anticipate that more standardized and comprehensive data will become available in the future, which will help address this issue.

      Furthermore, the authors may consider incorporating additional features or data sources, such as drug-specific characteristics, i.e., availability of the drug, to enrich the information utilized by the algorithm. This could potentially improve the accuracy of the predictions and provide a more holistic understanding of the factors contributing to drug synergy.

      Indeed, incorporating additional information such as monotherapy data and drug-specific characteristics, as in TAIJI’s approach, could enhance overall prediction performance. As discussed in Point 5 below, the current study is focused on exploring the potential molecular mechanisms of drug combinations, rather than optimizing overall prediction accuracy. However, in its application, it is natural to add the monotherapy or drug-specific information into the algorithm, as done in TAIJI.

      Finally, conducting experimental studies to validate the predictions generated by DIPx in laboratory-based cell lines would be essential to confirm its accuracy and reliability. This could involve a few drug IC50 experimental validations of predicted synergistic drug combinations and their associated pathway activations to strengthen the algorithm's clinical relevance. By considering these recommendations, the authors can further refine and advance the DIPx algorithm.

      We agree that laboratory-based validation, such as IC50 experiments for predicted synergistic drug combinations and pathway activations, would indeed strengthen the clinical relevance of the algorithm. We hope future studies can build on this work by incorporating this experimental validation.

      Below are my specific comments:

      Major comments:

      (1) The description of all the outputs of the DIPX algorithm is not clearly explained. It is unclear whether it provides only the Loewe score, the confidence score, the PAS score, or all of them. It is necessary to clarify the output of the proposed algorithm to guide the reader on what to expect while using it. The steps from PASs to synergy scores are not well explained.

      We apologize for the lack of clarity. Regarding the outputs of DIPx, for any triplet (drug A + drug B, cell line C), DIPx provides both the predicted Loewe score and the corresponding confidence score as the output. PASs are used as the input data for the random forest algorithm, which processes PASs into the synergy score. We do not provide the details in the manuscript, but refer to the article by Ishwaran H et al., (2021). We have revised the first paragraph of the 'A Pathway-Based Drug Synergy Prediction Model' section (page 3) and Figure 1 to improve the presentation of the method.

      (2) In Figure 1, the predicted Loewe score for the Capivasertib + Sapitinib combination is not provided. However, Figures 1e and 4a show the pathways with the highest contribution for this combination. What is the predicted Loewe score for the Capivasertib + Sapitinib combination?

      Figures 1e and 4a presents the pathways with the highest contribution for the combination which are identified based on the drug-combination data from 12 cell lines, not a single data point.

      We have added the median Loewe score (=7.6) across 12 cell lines in the test sets (Test 1 + Test 2) for the Capivasertib + Sapitinib combination in Figure 1e and reported related information for this combination in Supplementary Table S1. Additionally, we revised the 'Inference of the Mechanism of Action Based on PAS' section (page 7) to clarify the pathway importance inference.

      (3) In Figure 1d, the combination of doxorubicin + AZ12623380 is predicted to exhibit high Loewe synergy, with a confidence score of 0.33. It is important to provide details of this prediction, including the pathway predictions, and to explain why the model suggested high synergy. Although Figure 4f contains information, it seems to be listed for the observed Loewe score rather than the predicted score provided in Figure 1d. DIPx predicts the doxorubicin + AZ12623380 combination to be synergistic, while in Figure 4, it is labeled as a non-synergistic combination. It is necessary for the authors to clearly indicate which illustration represents the predicted outcome and which hypothesis is based on the observed Loewe score.

      In Figure 1d, we reported both predicted and observed Loewe score for the experiment (combination = doxorubicin + AZ12623380, cell line = SW900). Although the predicted score is high, a confidence score of 0.33 indicates that there is a low chance of the prediction is synergistic. And this is indeed confirmed by the non-synergistic observed score of -6, so it does not merit further investigation. This example highlights the value of the confidence score to supplement the predicted values. 

      (4) Figure 3 - The external validation using ONeil requires more rigorous analysis to understand the biological significance of the predictions. It is important to provide pathway activation scores and their potential mechanism of action predicted by the DIPx algorithm when working with a new dataset. Additionally, including the predictions of TAIJI-M on the ONeil dataset would be beneficial for comparing the performance of both algorithms on a new dataset.

      We have included an example of potential pathways related to the MK2206 + Erlotinib combination in the ONeil cohort, as inferred by DIPx, in the last paragraph of the 'Inference of the Mechanism of Action Based on PAS' section (page 9). In this example, we identify 'Metabolism by CYP Enzymes' as the most significant pathway associated with this combination, which aligns with previous studies that both MK2206 and Erlotinib are metabolized by the CYP enzyme families [PMID: 24387695].

      Regarding the prediction of TAIJI-M on the ONeil dataset, we have a similar request in question 4 from Reviewer 2, which we have carefully addressed above. Briefly, due to differences between two datasets, we retrained TAIJI-M without methylation data to enable prediction on the ONeil dataset. (As previously reported, methylation data did not significantly contribute to the results of TAIJI, and TAIJI-M can only predict synergy scores for drug combinations present in the training set.) Focusing on this subset of drug combinations, both TAIJI-M and DIPx perform poorly, with Spearman correlations of r=0.05 and r=0.09, respectively. The poor performance could be attributed to the limited overlap of drugs between the ONeil dataset and the AZS DREAM Challenge dataset.

      (5) TAIJI by Li et al., 2018 reported a high prediction correlation (0.53) in their study, while the modified version of TAIJI, TAJI-M, shows a lower prediction correlation in this study. The authors should clarify why the performance decreased when using the same dataset. Is it because only molecular data was used, excluding the monotherapy drug-response data? There is a spelling error in calling the algorithm - it is reported as TAIJI by Li et al., 2018, whereas this study calls it TAJI - an "I" is missing in TAIJI throughout the manuscript.

      Indeed, TAIJI-M has a lower prediction correlation (0.38) compared to the full TAIJI model (0.53), which includes the monotherapy data. Some studies such as [PMID: 33203866] even use only monotherapy data in prediction of drug combinations, suggesting the importance of monotherapy data in the drug-combination prediction. However, DIPx focuses on exploration of potential molecular mechanisms of drug combinations rather than overall prediction results, therefore, we exclude the monotherapy data from analysis. We have discussed on this in the 'Validation and Comparisons in the AZS Dataset' section (page 4).

      We thank the reviewer for pointing the spelling error for TAIJI; this has been corrected throughout the manuscript.

      (6) The authors should provide the predicted versus observed Loewe scores for all the combinations as a supplementary file. This would benefit the readers who want to replicate the results in the future. In the same way, including a sample output for the toy dataset on GitHub is required to assess the performance of the DIPx algorithm by a new user.

      All predicted and observed drug synergy scores are given in Supplementary Table S2. We also have already uploaded a simple example on our GitHub page, along with detailed instructions for users on how to run the method, including generating PAS and training the prediction model. Since we do not have permission to host data from the AZS DREAM Challenge and the ONeil datasets on our GitHub page, users can download these datasets separately and directly apply the provided code.

      (7) GitHub can include all the input and output data to reproduce the correlation plots in the manuscript. GitHub could also include the modified version of TAIJI-M and its corresponding input for comparison. The methods section should include how TAIJI was performed.

      We have uploaded all the codes and related data to the GitHub page to allow replication of all correlation plots in the manuscript. TAIJI-M represents the molecular model of the full TAIJI model. Both TAIJI-M and TAIJI are documented on the GitHub page of the original study. We have also included a link to the source code for TAIJI-M and TAIJI in the 'Data Availability' section.

      (8) Figure 5 - the data associated with this figure needs to be provided as supplementary listing the predicted values of Loewe scores for all the combinations.

      We report the associated data including the median of predicted and observed Loewe scores related to Figure 5c in Supplementary Table S2.

      Minor comments:

      (9) Abbreviations for the pathways are not included.

      We have included a list of abbreviations for all relevant pathways in Supplementary Table S5.

      (10) Line: 369. What is considered as bias correction? This needs to be explained.

      Bias correction refers to adjusting the original estimate of the Spearman correlation between the predicted and observed Loewe scores when there is a systematic difference between the estimates obtained from the bootstrap samples and the original correlation estimate. We revised the related text in page 13 to improve the explanation.

      (11) Line 364. Formulae or details for calculating actual predicted synergy (Ps) are missing.

      The predicted Loewe score, Ps, is the output of the regression random forest model. For simplicity, we do not describe the details in the manuscript, but refer to the description of the method article (Ishwaran H et al., 2021). We have revised the text accordingly.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Jocher, Janssen, et al examine the robustness of comparative functional genomics studies in primates that make use of induced pluripotent stem cell-derived cells. Comparative studies in primates, especially amongst the great apes, are generally hindered by the very limited availability of samples, and iPSCs, which can be maintained in the laboratory indefinitely and defined into other cell types, have emerged as promising model systems because they allow the generation of data from tissues and cells that would otherwise be unobservable.

      Undirected differentiation of iPSCs into many cell types at once, using a method known as embryoid body differentiation, requires researchers to manually assign all cell types in the dataset so they can be correctly analysed. Typically, this is done using marker genes associated with a specific cell type. These are defined a priori, and have historically tended to be characterised in mice and humans and then employed to annotate other species. Jocher, Janssen, et al ask if the marker genes and features used to define a given cell type in one species are suitable for use in a second species, and then quantify the degree of usefulness of these markers. They find that genes that are informative and cell type specific in a given species are less valuable for cell type identification in other species, and that this value, or transferability, drops off as the evolutionary distance between species increases.

      This paper will help guide future comparative studies of gene expression in primates (and more broadly) as well as add to the growing literature on the broader challenges of selecting powerful and reliable marker genes for use in single-cell transcriptomics.

      Strengths:

      Marker gene selection and cell type annotation is a challenging problem in scRNA studies, and successful classification of cells often requires manual expert input. This can be hard to reproduce across studies, as, despite general agreement on the identity of many cell types, different methods for identifying marker genes will return different sets of genes. The rise of comparative functional genomics complicates this even further, as a robust marker gene in one species need not always be as useful in a different taxon. The finding that so many marker genes have poor transferability is striking, and by interrogating the assumption of transferability in a thorough and systematic fashion, this paper reminds us of the importance of systematically validating analytical choices. The focus on identifying how transferability varies across different types of marker genes (especially when comparing TFs to lncRNAs), and on exploring different methods to identify marker genes, also suggests additional criteria by which future researchers could select robust marker genes in their own data.

      The paper is built on a substantial amount of clearly reported and thoroughly considered data, including EBs and cells from four different primate species - humans, orangutans, and two macaque species. The authors go to great lengths to ensure the EBs are as comparable as possible across species, and take similar care with their computational analyses, always erring on the side of drawing conservative conclusions that are robustly supported by their data over more tenuously supported ones that could be impacted by data processing artefacts such as differences in mappability, etc. For example, I like the approach of using liftoff to robustly identify genes in non-human species that can be mapped to and compared across species confidently, rather than relying on the likely incomplete annotation of the non-human primate genomes. The authors also provide an interactive data visualisation website that allows users to explore the dataset in depth, examine expression patterns of their own favourite marker genes and perform the same kinds of analyses on their own data if desired, facilitating consistency between comparative primate studies.

      We thank the Reviewer for their kind assessment of our work.

      Weaknesses and recommendations:

      (1) Embryoid body generation is known to be highly variable from one replicate to the next for both technical and biological reasons, and the authors do their best to account for this, both by their testing of different ways of generating EBs, and by including multiple technical replicates/clones per species. However, there is still some variability that could be worth exploring in more depth. For example, the orangutan seems to have differentiated preferentially towards cardiac mesoderm whereas the other species seemed to prefer ectoderm fates, as shown in Figure 2C. Likewise, Supplementary Figure 2C suggests a significant unbalance in the contributions across replicates within a species, which is not surprising given the nature of EBs, while Supplementary Figure 6 suggests that despite including three different clones from a single rhesus macaque, most of the data came from a single clone. The manuscript would be strengthened by a more thorough exploration of the intra-species patterns of variability, especially for the taxa with multiple biological replicates, and how they impact the number of cell types detected across taxa, etc.

      You are absolutely correct in pointing out that the large clonal variability in cell type composition is a challenge for our analysis. We also noted the odd behavior of the orangutan EBs, and their underrepresentation of ectoderm. There are many possible sources for these variable differentiation propensities: clone, sample origin (in this case urine) and individual. However, unfortunately for the orangutan, we have only one individual and one sample origin and thus cannot say whether this germ layer preference says something about the species or is due to our specific sample.

      Because of this high variability from multiple sources, getting enough cell types with an appreciable overlap between species was limiting to analyses. In order to be able to derive meaningful conclusions from intra-species analyses and the impact of different sources of variation on cell type propensity, we would need to sequence many more EBs with an experimental design that balances possible sources of variation. This would go beyond the scope of this study.

      Instead, here we control for intra-species variation in our analyses as much as possible: For the analysis of cell type specificity and conservation the comparison is relative for the different specificity degrees (Figure 3C).  For the analysis of marker gene conservation, we explicitly take intra-species variation into account (Figure 4D).

      The same holds for the temporal aspect of the data, which is not really discussed in depth despite being a strength of the design. Instead, days 8 and 16 are analysed jointly, without much attention being paid to the possible differences between them.

      Concerning the temporal aspect, indeed we knowingly omitted to include an explicit comparison of day 8 and day 16 EBs, because we felt that it was not directly relevant to our main message. Our pseudotime analysis showed that the differences of the two time points were indeed a matter of degree and not so much of quality. All major lineages were already present at day 8 and even though day 8 cells had on average earlier pseudotimes, there was a large overlap in the pseudotime distributions between the two sampling time points (Author response image 1). That is why we decided to analyse the data together.

      Are EBs at day 16 more variable between species than at day 8? Is day 8 too soon to do these kinds of analyses?

      When we started the experiment, we simply did not know what to expect. We were worried that cell types at day 8 might be too transient, but longer culture can also introduce biases. That is why we wanted to look at two time points, however as mentioned above the differences are in degree.

      Concerning the cell type composition: yes, day 16 EBs are more heterogeneous than day 8 EBs. Firstly, older EBs have more distinguishable cell types and hence even if all EBs had identical composition, the sampling variance would be higher given that we sampled a similar number of cells from both time points. Secondly, in order to grow EBs for a longer time, we moved them from floating to attached culture on day 8 and it is unclear how much variance is added by this extra handling step.

      Are markers for earlier developmental progenitors better/more transferable than those for more derived cell types?

      We did not see any differences in the marker conservation between early and late cell types, but we have too little data to say whether this carries biological meaning.

      Author response image 1.

      Pseudotime analysis for a differentiation trajectory towards neurons. Single cells were first aggregated into metacells per species using SEACells (Persad et al. 2023). Pluripotent and ectoderm metacells were then integrated across all four species using Harmony and a combined pseudotime was inferred with Slingshot (Street et al. 2018), specifying iPSCs as the starting cluster. Here, lineage 3 is shown, illustrating a differentiation towards neurons. (A) PHATE embedding colored by pseudotime (Moon et al. 2019). (B) PHATE embedding colored by celltype. (C) Pseudotime distribution across the sampling timepoints (day 8 and day 16) in different species.

      (2) Closely tied to the point above, by necessity the authors collapse their data into seven fairly coarse cell types and then examine the performance of canonical marker genes (as well as those discovered de novo) across the species. However some of the clusters they use are somewhat broad, and so it is worth asking whether the lack of specificity exhibited by some marker genes and driving their conclusions is driven by inter-species heterogeneity within a given cluster.

      Author response image 2.

      UMAP visualization for the Harmony-integrated dataset across all four species for the seven shared cell types, colored by cell type identity (A) and species (B).

      Good point, if we understand correctly, the concern is that in our relatively broadly defined cell types, species are not well mixed and that this in turn is partly responsible for marker gene divergence. This problem is indeed difficult to address, because most approaches to evaluate this require integration across species which might lead to questionable results (see our Discussion).

      Nevertheless, we attempted an integration across all four species. To this end, we subset the cells for the 7 cell types that we found in all four species and visualized cell types and species in the UMAPs above (Author response image 2).

      We see that cardiac fibroblasts appear poorly integrated in the UMAP, but they still have very transferable marker genes across species. We quantified integration quality using the cell-specific mixing score (cms) (Lütge et al. 2021) and indeed found that the proportion of well integrated cells is lowest for cardiac fibroblasts (Author response image 3A). On the other end of the cms spectrum, neural crest cells appear to have the best integration across species, but their marker transferability between species is rather worse than for cardiac fibroblasts (Supplementary Figure 9). Cell-type wise calculated rank-biased overlap scores that we use for marker gene conservation show the same trends (Author response image 3B) as the F1 scores for marker gene transferability.  Hence, given our current dataset we do not see any indication that the low marker gene conservation is a result of too broadly defined cell types.

      Author response image 3.

      (A) Evaluation of species mixing per cell type in the Harmony-integrated dataset, quantified by the fraction of cells with an adjusted cell-specific mixing score (cms) above 0.05. (B) Summary of rank-biased overlap (RBO) scores per cell type to assess concordance of marker gene rankings for all species pairs.

      Reviewer #2 (Public review):

      Summary:

      The authors present an important study on identifying and comparing orthologous cell types across multiple species. This manuscript focuses on characterizing cell types in embryoid bodies (EBs) derived from induced pluripotent stem cells (iPSCs) of four primate species, humans, orangutans, cynomolgus macaques, and rhesus macaques, providing valuable insights into cross-species comparisons.

      Strengths:

      To achieve this, the authors developed a semi-automated computational pipeline that integrates classification and marker-based cluster annotation to identify orthologous cell types across primates. This study makes a significant contribution to the field by advancing cross-species cell type identification.

      We thank the reviewer for their positive and thoughtful feedback.

      Weaknesses:

      However, several critical points need to be addressed.

      (1) Use of Liftoff for GTF Annotation

      The authors used Liftoff to generate GTF files for Pongo abelii, Macaca fascicularis, and Macaca mulatta by transferring the hg38 annotation to the corresponding primate genomes. However, it is unclear why they did not use species-specific GTF files, as all these genomes have existing annotations. Why did the authors choose not to follow this approach?

      As Reviewer 1 also points out, also we have observed that the annotation of non-human primates often has truncated 3’UTRs. This is especially problematic for 3’ UMI transcriptome data as the ones in the 10x dataset that we present here. To illustrate this we compared the Liftoff annotation derived from Gencode v32,  that we also used throughout our manuscript to the Ensembl gene annotation Macaca_fascicularis_6.0.111. We used transcriptomes from human and cynomolgus iPSC bulk RNAseq  (Kliesmete et al. 2024) using the Prime-seq protocol (Janjic et al. 2022) which is very similar to 10x in that it also uses 3’ UMIs. On average using Liftoff produces higher counts than the Ensembl annotation (Author response image 4A). Moreover, when comparing across species, using Ensembl for the macaque leads to an asymmetry in differentially expressed genes, with apparently many more up-regulated genes in humans. In contrast, when we use the Liftoff annotation, we detect fewer DE-genes and a similar number of genes is up-regulated in macaques as in humans (Author response image 4B). We think that the many more DE-genes are artifacts due to mismatched annotation in human and cynomolgus macaques. We illustrate this for the case of the transcription factor SALL4 in Author response image 4 C,D.  The Ensembl annotation reports 2 transcripts, while Liftoff from Gencode v32 suggests 5 transcripts, one of which has a longer 3’UTR. This longer transcript is also supported by Nanopore data from macaque iPSCs. The truncation of the 3’UTR in this case leads to underestimation of the expression of SALL4 in macaques and hence SALL4 is detected as up-regulated in humans (DESeq2: LFC= 1.34, p-adj<2e-9). In contrast, when using the Liftoff annotation SALL4 does not appear to be DE between humans and macaques (LFC=0.33, p.adj=0.20).

      Author response image 4. 

      (A) UMI-counts/ gene for the same cynomolgus macaque iPSC samples. On the x-axis the gtf file from Ensembl Macaca_fascicularis_6.0.111 was used to count and on the y-axis we used our filtered Liftoff annotation that transferred the human gene models from Gencode v32. (B) The # of DE-genes between human  and cynomolgus iPSCs detected with DESeq2. In Liftoff, we counted human samples using Gencode v32 and compared it to the Liftoff annotation of the same human gene models to macFas6. In Ensembl, we use Gencode v32 for the human and  Ensembl Macaca_fascicularis_6.0.111 for the Macaque. For both comparisons we subset the genes to only contain one to one orthologues as annotated in biomart. Up and down regulation is relative to human expression. C) Read counts for one example gene SALL4. Here we used in addition to the Liftoff and Ensembl annotation also transcripts derived from Nanopore cDNA sequencing of cynomolgus iPSCs. D) Gene models for SALL4 in the space of MacFas6 and a coverage for iPSC-Prime-seq bulk RNA-sequencing.

      (2) Transcript Filtering and Potential Biases

      The authors excluded transcripts with partial mapping (<50%), low sequence identity (<50%), or excessive length differences (>100 bp and >2× length ratio). Such filtering may introduce biases in read alignment. Did the authors evaluate the impact of these filtering choices on alignment rates?

      We excluded those transcripts from analysis in both species, because they present a convolution of sequence-annotation differences and expression. The focus in our study is on regulatory evolution and we knowingly omit marker differences that are due to a marker being mutated away, we will make this clearer in the text of a revised version.

      (3) Data Integration with Harmony

      The methods section does not specify the parameters used for data integration with Harmony. Including these details would clarify how cross-species integration was performed.

      We want to stress  that none of our conservation and marker gene analyses relies on cross-species integration. We only used the Harmony integrated data for visualisation in Figure 1 and the rough germ-layer check up in Supplementary Figure S3.  We will add a better description in the revised version.

      References

      Janjic, Aleksandar, Lucas E. Wange, Johannes W. Bagnoli, Johanna Geuder, Phong Nguyen, Daniel Richter, Beate Vieth, et al. 2022. “Prime-Seq, Efficient and Powerful Bulk RNA Sequencing.” Genome Biology 23 (1): 88.

      Kliesmete, Zane, Peter Orchard, Victor Yan Kin Lee, Johanna Geuder, Simon M. Krauß, Mari Ohnuki, Jessica Jocher, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2024. “Evidence for Compensatory Evolution within Pleiotropic Regulatory Elements.” Genome Research 34 (10): 1528–39.

      Lütge, Almut, Joanna Zyprych-Walczak, Urszula Brykczynska Kunzmann, Helena L. Crowell, Daniela Calini, Dheeraj Malhotra, Charlotte Soneson, and Mark D. Robinson. 2021. “CellMixS: Quantifying and Visualizing Batch Effects in Single-Cell RNA-Seq Data.” Life Science Alliance 4 (6): e202001004.

      Moon, Kevin R., David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, et al. 2019. “Visualizing Structure and Transitions in High-Dimensional Biological Data.” Nature Biotechnology 37 (12): 1482–92.

      Persad, Sitara, Zi-Ning Choo, Christine Dien, Noor Sohail, Ignas Masilionis, Ronan Chaligné, Tal Nawy, et al. 2023. “SEACells Infers Transcriptional and Epigenomic Cellular States from Single-Cell Genomics Data.” Nature Biotechnology 41 (12): 1746–57.

      Street, Kelly, Davide Risso, Russell B. Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2018. “Slingshot: Cell Lineage and Pseudotime Inference for Single-Cell Transcriptomics.” BMC Genomics 19 (1): 477.

    2. eLife Assessment

      The authors have generated important resources such as a reference dataset of early primate development by utilizing single-cell transcriptomic technology together with induced pluripotent stem cells (iPSCs) from four primate species: humans, orangutans, cynomolgus macaques, and rhesus macaques. By analyzing marker gene expression and cell types across species during undirected differentiation of iPSCs, the authors provide solid evidence that the transferability of marker genes decreases as the evolutionary distance between species increases. This work demonstrates the extended usage of iPSCs for broader fields, which will benefit several scientific communities including anthropology, comparative biology, and evolutionary biology.

    3. Reviewer #1 (Public review):

      Summary:

      Jocher, Janssen, et al examine the robustness of comparative functional genomics studies in primates that make use of induced pluripotent stem cell-derived cells. Comparative studies in primates, especially amongst the great apes, are generally hindered by the very limited availability of samples, and iPSCs, which can be maintained in the laboratory indefinitely and defined into other cell types, have emerged as promising model systems because they allow the generation of data from tissues and cells that would otherwise be unobservable.

      Undirected differentiation of iPSCs into many cell types at once, using a method known as embryoid body differentiation, requires researchers to manually assign all cell types in the dataset so they can be correctly analysed. Typically, this is done using marker genes associated with a specific cell type. These are defined a priori, and have historically tended to be characterised in mice and humans and then employed to annotate other species. Jocher, Janssen, et al ask if the marker genes and features used to define a given cell type in one species are suitable for use in a second species, and then quantify the degree of usefulness of these markers. They find that genes that are informative and cell type specific in a given species are less valuable for cell type identification in other species, and that this value, or transferability, drops off as the evolutionary distance between species increases.

      This paper will help guide future comparative studies of gene expression in primates (and more broadly) as well as add to the growing literature on the broader challenges of selecting powerful and reliable marker genes for use in single-cell transcriptomics.

      Strengths:

      Marker gene selection and cell type annotation is a challenging problem in scRNA studies, and successful classification of cells often requires manual expert input. This can be hard to reproduce across studies, as, despite general agreement on the identity of many cell types, different methods for identifying marker genes will return different sets of genes. The rise of comparative functional genomics complicates this even further, as a robust marker gene in one species need not always be as useful in a different taxon. The finding that so many marker genes have poor transferability is striking, and by interrogating the assumption of transferability in a thorough and systematic fashion, this paper reminds us of the importance of systematically validating analytical choices. The focus on identifying how transferability varies across different types of marker genes (especially when comparing TFs to lncRNAs), and on exploring different methods to identify marker genes, also suggests additional criteria by which future researchers could select robust marker genes in their own data.

      The paper is built on a substantial amount of clearly reported and thoroughly considered data, including EBs and cells from four different primate species - humans, orangutans, and two macaque species. The authors go to great lengths to ensure the EBs are as comparable as possible across species, and take similar care with their computational analyses, always erring on the side of drawing conservative conclusions that are robustly supported by their data over more tenuously supported ones that could be impacted by data processing artefacts such as differences in mappability, etc. For example, I like the approach of using liftoff to robustly identify genes in non-human species that can be mapped to and compared across species confidently, rather than relying on the likely incomplete annotation of the non-human primate genomes. The authors also provide an interactive data visualisation website that allows users to explore the dataset in depth, examine expression patterns of their own favourite marker genes and perform the same kinds of analyses on their own data if desired, facilitating consistency between comparative primate studies.

      Weaknesses and recommendations:

      (1) Embryoid body generation is known to be highly variable from one replicate to the next for both technical and biological reasons, and the authors do their best to account for this, both by their testing of different ways of generating EBs, and by including multiple technical replicates/clones per species. However, there is still some variability that could be worth exploring in more depth. For example, the orangutan seems to have differentiated preferentially towards cardiac mesoderm whereas the other species seemed to prefer ectoderm fates, as shown in Figure 2C. Likewise, Supplementary Figure 2C suggests a significant unbalance in the contributions across replicates within a species, which is not surprising given the nature of EBs, while Supplementary Figure 6 suggests that despite including three different clones from a single rhesus macaque, most of the data came from a single clone. The manuscript would be strengthened by a more thorough exploration of the intra-species patterns of variability, especially for the taxa with multiple biological replicates, and how they impact the number of cell types detected across taxa, etc.

      The same holds for the temporal aspect of the data, which is not really discussed in depth despite being a strength of the design. Instead, days 8 and 16 are analysed jointly, without much attention being paid to the possible differences between them. Are EBs at day 16 more variable between species than at day 8? Is day 8 too soon to do these kinds of analyses? Are markers for earlier developmental progenitors better/more transferable than those for more derived cell types?

      (2) Closely tied to the point above, by necessity the authors collapse their data into seven fairly coarse cell types and then examine the performance of canonical marker genes (as well as those discovered de novo) across the species. However some of the clusters they use are somewhat broad, and so it is worth asking whether the lack of specificity exhibited by some marker genes and driving their conclusions is driven by inter-species heterogeneity within a given cluster.

    4. Reviewer #2 (Public review):

      Summary:

      The authors present an important study on identifying and comparing orthologous cell types across multiple species. This manuscript focuses on characterizing cell types in embryoid bodies (EBs) derived from induced pluripotent stem cells (iPSCs) of four primate species, humans, orangutans, cynomolgus macaques, and rhesus macaques, providing valuable insights into cross-species comparisons.

      Strengths:

      To achieve this, the authors developed a semi-automated computational pipeline that integrates classification and marker-based cluster annotation to identify orthologous cell types across primates. This study makes a significant contribution to the field by advancing cross-species cell type identification.

      Weaknesses:

      However, several critical points need to be addressed.

      (1) Use of Liftoff for GTF Annotation

      The authors used Liftoff to generate GTF files for Pongo abelii, Macaca fascicularis, and Macaca mulatta by transferring the hg38 annotation to the corresponding primate genomes. However, it is unclear why they did not use species-specific GTF files, as all these genomes have existing annotations. Why did the authors choose not to follow this approach?

      (2) Transcript Filtering and Potential Biases

      The authors excluded transcripts with partial mapping (<50%), low sequence identity (<50%), or excessive length differences (>100 bp and >2× length ratio). Such filtering may introduce biases in read alignment. Did the authors evaluate the impact of these filtering choices on alignment rates?

      (3) Data Integration with Harmony

      The methods section does not specify the parameters used for data integration with Harmony. Including these details would clarify how cross-species integration was performed.

    1. eLife Assessment

      This is an important study that establishes how anti-sense oligonucleotides degrading a specific target protein called EMC10 can rescue neuronal function in models of chromosome 22.11.2 deletions. The authors use human iPSC-derived neurons and a mouse model to provide compelling data for the rescue of cellular and cognitive features of 22.11.2 phenotypes upon ASO regulation of EMC10. These pre-clinical data are of interest because they support reduction of ECM10 as a promising therapeutic strategy.

    2. Reviewer #1 (Public review):

      Summary:

      This is an important and very well-presented set of experiments following up on prior work from the lab investigating knock-down (KD) of EMC10 in restoration of neuronal and cognitive deficits in 22q11.2 Del models, including now both human iPSCs and a mouse model in vivo now with ASOs.

      The valuable progress in this current manuscript is the development of ASOs, and the proof of efficacy in vivo in mouse of the ASO in knock-down of EMC10 and amelioration of in vivo behavioral phenotypes.

      The experiments include: iPSC studies demonstrating elevations of EMC10 in a solid collection of paired iPSC lines. These studies also provide evidence of manipulation of EMC10 by overexpression and inhibition of miRNAs that exist in the 22q11 interval. The iPSC studies also nicely demonstrate rescue of impairments with KD of EMC10 in neuronal arborization as well as KCl induced neuronal activity. The major in vivo contributions reflect impressive demonstration of efficacy of two ASOs in vivo on both KD of EMC10 in vivo and through improvement in behavioral abnormalities in the 22q11 mouse in a range of different behaviors, including social behavior and learning behaviors.

      Overall, there are many strengths reflected in this study, including in particular the synergy between in vitro studies in human cell models and in vivo studies in the well characterized mouse model. The experiments are generally rigorously performed and well powered, and nicely presented. The claims with regard to the mechanisms of EMC10 elevations and the importance of restoration of EMC10 expression to neuronal morphology and behavior are well supported by the data. The work may be further supported in future studies, by investigation of rescue by ASOs of circuit dysfunction in vivo or ex vivo through electrophysiology in the mouse model. Also, in future studies, investigation of the mechanism by which EMC10, an ER protein involved in protein processing, may function in the observed neuronal abnormalities; however, these studies are clearly for future investigations.

      The potential impact of the work is found in the potential value of the ASO approach to the treatment of 22q11, or the pre-clinical evidence that knock-down of this protein may lead to some amelioration of cognitive symptoms. Overall, a very convincing and complementary set of experiments to support EMC10 KD as a therapeutic strategy.

      Review of revision: The authors have addressed the questions from the prior review.

    3. Author response:

      The following is the authors’ response to the original reviews.

      We appreciate that both reviewers found our findings significant and recognized the strength of the presented data in demonstrating the potential value of ASO-mediated Emc10 expression modulation for treating 22q11.2DS. We are grateful for the reviewers' valuable input and constructive suggestions, which we believe have significantly strengthened our manuscript. Below, we address the main points and concerns, followed by our point-by-point responses:

      Evaluation of ASO-Mediated Emc10 Reduction: We appreciate the feedback and the opportunity to clarify this point. While we agree that ASO-mediated reduction of Emc10 should ideally be evaluated at both the mRNA and protein levels, we would like to emphasize that this was indeed performed in our study. Specifically, we conducted both qRT-PCR and Western Blot (WB) assays on the same animal cohort, focusing on the left and right hippocampus (rather than the PFC) following ASO injection (see Figure S11C and D). We prioritized the hippocampus for the WB assay because our primary behavioral assays and observed phenotypes in this study are strongly hippocampus-centric. This approach reflects our aim to investigate Emc10's role in the brain regions most relevant to the observed phenotypes. We hope this clarification addresses the reviewer’s concerns. While protein-level analysis would ideally complement RNA measurements, the Emc10 antibodies available were suboptimal in specificity and sensitivity, requiring substantial optimization. Additionally, challenges in obtaining sufficient high-quality protein from small regions like the hippocampus limited the use of protein detection as a standalone method. We plan to refine antibody protocols or explore alternative methods in future work. Notably, in all instances where we performed parallel protein and RNA measurements in both, mouse brain tissue and human cell lines, there was excellent concordance between the datasets, strongly suggesting that mRNA levels are a reliable indicator of Emc10 protein levels in our model.

      ASO Neuronal Uptake: While ASO uptake by neurons in the brain can vary considerably depending on factors such as ASO chemistry, delivery method, target brain region, and cell type, our targeted delivery approach, ASO design optimization, and ASO screening strategy were specifically tailored to achieve uniform and efficient uptake across hippocampal and cortical regions, in both neurons and glia. The figures included in our manuscript at both low and high magnification (see Figure S14A) clearly display the extensive (over 97%) overlap of ASO-positive cells (green signal) with cells expressing the neuronal marker NeuN (red signal). While quantifying ASO-positive cells in different brain regions could add value, the robust diffusion of ASO into neurons and glia is effectively demonstrated in the current figures and indirectly supported by the robust downregulation of Emc10 in ASO-treated animals as shown by qRT-PCR assays of hippocampal and cortical brain regions.

      Transcriptomic Data in Mutant EMC10 NGN2-iNs: Reduction in EMC10 levels is not expected to directly affect transcription or to broadly reorganize the differential gene expression profile of the Q6/Q5 patient/control NGN2-iN lines. Accordingly, our transcriptional profiling was not designed to assess the direct impact of EMC10 deficiency on gene expression but rather to serve as an indirect measure of cellular pathways affected by the reduction in EMC10 levels in the patient Q6 line. We aimed to identify genes and related functional pathways differentially expressed between the Q6/Q5 patient/control lines, where these expression differences are either abolished or significantly attenuated in Q6/EMC10<sup>HET</sup> or Q6/EMC10<sup>HOM</sup> NGN2-iNs.

      Statistical Analysis: We have meticulously reviewed all statistical analyses in the manuscript to ensure their appropriateness and adherence to established practices. For Figure S2, we acknowledge that the statistical details were not fully specified in the figure legend, though they are provided for each miRNA in Supplemental Table S2. In the revised manuscript, we ensured that the statistical methods and corresponding values are clearly indicated for each comparison.

      We are confident that the revisions outlined above, along with the point-by-point responses provided below, will significantly strengthen our manuscript and address all the concerns raised by the reviewers. We would like to express our sincere thanks to the reviewers for their valuable feedback and constructive suggestions.

      Reviewer #1 (Recommendations For The Authors):

      My comments here are generally limited to minor comments that reflect possible small additions or edits to the manuscript:

      (1) Panel 1A is very small. Please consider making that bigger as space permits.

      We have increased the panel size of Figure 1A in the revised manuscript to improve its visibility and clarity.

      (2) Are you able to identify the dot that represents EMC10 in panel 1C? I understand that EMC10 is represented in Supplementary Figure 4A.

      We appreciate the reviewer's observation. In Figure 1C, the volcano plot depicts differentially expressed miRNAs in the Q5/Q6 neuronal samples, as identified through miRNA-sequencing. Since EMC10, as a protein-coding gene and a downstream target of miRNA dysregulation, is not included in this analysis. However, as the reviewer correctly notes, EMC10 gene expression is represented in the volcano plot in Supplementary Figure 4A, which displays differentially expressed genes identified through bulk RNA-seq analysis of the same neuronal samples. To avoid any confusion, we have clarified the title of Figure 1C to emphasize that it represents miRNA expression changes.

      (3) With regard to studies using iPSC. Some of the studies are executed across multiple distinct pairs and some are only done in a single pair. Overall, while results are coherent and often complimentary, would it be valuable for the authors to comment on experiments where studies in multiple pairs seemed particularly important, or others wherein it was less important?

      We thank the reviewer for this insightful question regarding our use of multiple versus single hiPSC pairs. Our investigation began with the Q5/Q6 sibling (dizygotic twin) pair, which shares the most similar genetic background. This minimized the impact of confounding genetic factors and provided a robust foundation for testing our hypothesis that EMC10 upregulation, driven by miRNA dysregulation, is a key consequence of the 22q11.2 deletion in human neurons, thus validating our previous findings from the Df(16)A<sup>+/-</sup> mouse model (Stark et al., 2008; Xu et al., 2013). To ensure the generalizability of our findings, we incorporated additional hiPSC lines from another sibling pair as well as a case/control pair, demonstrating that EMC10 upregulation is a consistent feature of 22q11.2DS. Subsequently, we focused on the well-matched Q5/Q6 pair for detailed morphological, functional, and genetic rescue experiments. This approach allowed us to perform in-depth studies while controlling for potential genetic confounders. By using both multiple and single hiPSC pairs, we balanced the need for generalizable findings with the practical considerations of conducting technically complex and resource-intensive experiments. This strategy enabled us to provide both broad and detailed insights into the mechanisms underlying 22q11.2DS. We have modified the introductory paragraph of the Results section to better highlight this issue.

      (4) While the majority of the experiments seem sufficiently powered to test the hypothesis in question in the iPSC studies, Figure 2B raises the question if the study replicates here were underpowered, and perhaps the authors might consider mentioning this, although this is a very minor comment.

      We thank the reviewer for raising this point. We acknowledge that the statistical power to detect a significant difference in pre-miR-485 levels in Figure 2B may be limited due to the relatively small sample size and the inherent variability in hiPSC-derived neuronal cultures. However, it is important to emphasize that the functional impact of miRNAs is primarily mediated by their mature transcript forms. Our miRNA-seq data (Supplementary Table 2 and Figure S2) did not show significant alterations in the levels of mature miR-485-5p or miR-485-3p. This finding aligns with the reported expression pattern of miR-485 in hiPSC-derived neurons, where relatively low levels are observed in early neuronal development, with increased expression occurring in older, more mature neurons (Soutschek et al. 2023; https://ethz-ins.org/igNeuronsTimeCourse/ database from the Institute of Neurogenomics, ETH Zurich). This database provides a valuable resource for examining gene expression dynamics during human neuronal differentiation. Given that our hiPSC-derived neurons were analyzed at a relatively early developmental stage (DIV8 for these experiments), it is likely that miR-485 expression had not yet reached levels sufficient to reveal significant differences. While we acknowledge the potential limitation in statistical power for detecting subtle changes in pre-miR-485 levels, the combined evidence suggests that miR-485 may not be a significant contributor to the observed phenotypes at this developmental stage.

      A paragraph has been added in the corresponding Results section to address this issue.

      (5) There are a few situations where the authors could help out the reader a little bit by providing more labels on the figures directly. For example: in Figure 2, there are expression levels, over-expression, and inhibition of miRNA but the X-axis is named with similar labels for the miRNAs in question for each of these distinct experiments. If the authors want to help the reader, they may consider labeling these panels with a descriptive title to reflect the experiment being done or use more descriptive terms in the X-axis panels. Again, this is minor. Similarly, in Figure 5, it might be helpful for the authors to help out the reader again with more labels on the panels, such as in Figures 5B, 5C, and 5D. Would they consider labeling these panels, HPC, PFC, SSC with the brain location as they did in Figure 4?

      We thank the reviewer for these helpful suggestions to improve the clarity of our figures. We have implemented the proposed changes. In Figure 2C-E, we have added specific titles to the panels to clearly distinguish between the different experimental conditions such as miRNA overexpression and inhibition. Similarly, in Figure 5, we labeled panels 5B, 5C, and 5D with the brain regions analyzed (HPC, PFC, SSC) to match the labeling used in Figure 4. We believe these revisions enhance the readability and overall interpretability of the figures, making it easier for readers to follow the experiments and results.

      (6) Figure 3: There is some overshoot of the data in EMC10 homozygous null, in panel 3E, and also, overshoot of the het in panel 3H. Would there be value in the authors commenting on the potential basis for this in the discussion? Some issues are minor, such as the lack of electrophysiological analysis of circuits in vivo or in ex vivo slices that may further support the proposed rescue.

      The reviewer correctly highlights the observation in Figures 3E and 3H, where the number of branch points in the Q6/EMC10<sup>HOM</sup> line exceeds wildtype levels and the calcium response in the Q6/EMC10<sup>HET</sup> and Q6/EMC10<sup>HOM</sup> lines surpasses that of the control. This overshoot is indeed intriguing and warrants discussion. EMC10 is part of the ER Membrane Complex (EMC), which plays a critical role in the proper folding and localization of various membrane proteins, including neurotransmitter receptors and ion channels such as voltage-gated calcium channels (Chitwood et al., 2018; Shurtleff et al., 2018; Chitwood and Hegde, 2019). In the context of the 22q11.2 deletion, EMC10 dysregulation may disrupt the proper localization of these proteins at the synapse, affecting both dendritic morphology and calcium signaling. The precise basis of this overshoot remains unclear. The overshoot may result from a dosage-sensitive inhibitory effect of Emc10, where both reduced and increased expression alter normal neuronal processes, with excessive responses potentially triggered upon gene restoration by the mutant system’s adaptation to dysfunction, leading to altered receptor sensitivity or signaling dynamics. This underscores the critical importance of precise Emc10 expression for proper neuronal development and function, in line with previous findings suggesting that EMC10 plays an auxiliary or modulatory role in EMC function. A short comment on the potential basis for this overshoot has been added in the corresponding Results section of the manuscript. Regardless of the underlying mechanisms, these findings emphasize the importance of precise titration of ASO constructs, rigorous gene dosage controls, and thorough analysis of context-specific responses to ensure both efficacy and safety in clinical applications.

      We also agree with the reviewer that electrophysiological studies, particularly in the 22q11.2 deletion mouse model, would provide valuable insights into the impact of EMC10 modulation by ASOs on neuronal activity and circuit function at the in vivo and ex vivo levels. Incorporating such experiments into future studies will allow us to assess synaptic transmission and plasticity, contributing to a more comprehensive understanding of the therapeutic potential of ASO-mediated EMC10 modulation in 22q11.2DS.

      (7) Did the authors take out the behavior studies further than 9 weeks? Would the authors consider commenting on what they speculate might be the duration of the treatment effect? For both mice and definitely humans.

      We thank the reviewer for raising the important question regarding the duration of the ASO treatment effect, which is crucial for translating our findings into clinically relevant therapies. While behavioral studies beyond 9 weeks were not conducted in this study, our in vivo experiments and findings from prior publications (detailed below) enable an informed speculative assessment.

      We utilized 2'-O-methoxyethyl (2'-MOE) modified ASOs, known for their enhanced binding affinity, nuclease resistance, and increased metabolic stability. In our in vivo post-injection screening of ASOs (Figure S13C), we predicted that Emc10 expression levels return to normal WT levels (~T100%) approximately 26 weeks post-treatment in Emc10<sup>ASO</sup> (#1466182) treated mice. This prediction is supported by our Emc10 expression profiles across various brain regions, which demonstrate robust repression of Emc10 lasting up to 10 weeks post-administration (Figure 6D-F). While these findings suggest that the treatment effect in our model could extend significantly beyond 10 weeks following a single ASO injection, further empirical validation is required through extended follow-up studies. Encouragingly, long-term effects of 2'-MOE ASOs have been observed in other neurological disorders (Kordasiewicz et al., 2012; Scoles et al., 2017; Finkel et al., 2017; Darras et al., 2019). However, factors such as ASO distribution, target cell turnover, and disease-specific pathophysiology could influence the duration of the effect. To address these uncertainties, we have added a paragraph in the Discussion section emphasizing the need for additional studies, including extended follow-up periods and eventual clinical trials, to determine the specific duration of effect for our Emc10<sup>ASO</sup> constructs in treating 22q11.2DS.

      Reviewer #2 (Recommendations For The Authors):

      (1) It is acknowledged that the iPSC-derived cells in Figure 1 are no longer progenitors, but differentiation markers for astrocytes and glia are also needed in Figure 1b to establish that equal rates of differentiation have occurred across genotypes.

      We thank the reviewer for raising this important point about ensuring equal rates of differentiation across genotypes. As the reviewer notes, we employed a well-established protocol for directed differentiation of hiPSCs into cortical neurons using a combination of small molecule inhibitors, as previously described by Qi et al. (2017). This protocol has been extensively validated and is known to robustly generate cortical neurons while actively suppressing glial differentiation, as evidenced by the lack of upregulation of glial markers such as GFAP, AQP4, or OLIG2 in the original study. Given the established neuronal specificity of this protocol and our focus on neuronal phenotypes, we prioritized the confirmation of successful neuronal differentiation using the established neuronal markers TUJ1 and TBR1. Therefore, additional markers for astrocytes and glia are not included in this figure, as we did not expect significant glial differentiation under these conditions. A sentence has been added in the corresponding Results section to address this issue.

      (2) For the RNA-seq experiments outlined in Figures 3J and K, a more comprehensive analysis is needed of the genes disrupted in the parental Q6 line relative to the het and homo lines. What percent are rescued, unaffected, vs uniquely disrupted?

      Reduction in EMC10 levels is not expected to directly affect transcription or broadly reorganize the gene expression profile of the Q6/Q5 NGN2-iN lines. Our transcriptional profiling was not designed to assess the direct impact of EMC10 deficiency on gene expression but rather to measure the cellular pathways affected by reduced EMC10 in the patient Q6 line. We identified genes differentially expressed between the Q6 (patient) and Q5 (control) lines, whose expression differences were either abolished or significantly attenuated ("rescued") in the Q6/EMC10<sup>HET</sup> or Q6/EMC10<sup>HOM</sup> lines. In the Q6/EMC10<sup>HET</sup> line, 237 DEGs (6%) were rescued, while in the Q6/EMC10<sup>HOM</sup> line, 382 DEGs (11%) were rescued. Importantly, further analysis revealed 103 shared rescued DEGs in these lines, which was statistically significant (enrichment factor = 1.7; p < 0.0001, based on a hypergeometric test). We added a new figure panel (Figure 3L) to visualize the significant overlap of rescued DEGs from the Q6/EMC10<sup>HET</sup> and Q6/EMC10<sup>HOM</sup> lines. This overlap suggests these genes play a critical role in biological pathways impacted by EMC10 levels, particularly in nervous system development, as indicated by our functional annotation analysis. We also performed protein-protein interaction (PPI) network analysis to explore the functional relationships among these 103 shared DEGs (Figure S8). Future studies will further investigate these gene sets to gain deeper insights into the molecular mechanisms underlying 22q11.2DS and the role of EMC10.

      (3) The authors claim that 50% EMC10 loss in adult mice is safe and should be toned down. EMC10 knockout mice have motor, anxiety, and social phenotypes. It would be unique amongst highly dosage-sensitive genes (MeCP2, CDKL5, TCF4, FMR1, etc.) for there to only be a neurodevelopmental component. In all those cases, and others, the effects of over and under-expression are reversible into adulthood. Establishing the range in adults is critical to establishing therapeutic utility. Absent a detailed examination of non-cognitive phenotypes, this claim cannot be made.

      The reviewer raises an important point about the potential effects of EMC10 reduction in adult mice and the need to establish a safe therapeutic window by evaluating both cognitive and non-cognitive phenotypes. We agree that such a comprehensive evaluation is critical for assessing the safety and translational potential of Emc10-targeting therapies. While the International Mouse Genotyping Consortium reported motor and anxiety phenotypes in homozygous Emc10 knockout mice, these data are unpublished and based on a relatively small number of animals. Furthermore, in our previous work (Diamantopoulou et al., 2017), we demonstrated that complete Emc10 loss does not impair cognition or social behavior, as assessed by prepulse inhibition (PPI), working memory (WM), and social memory (SM) assays (see Figure 3A-D; Diamantopoulou et al., 2017). Additionally, heterozygous Emc10 mice, which exhibit a ~50% reduction in Emc10 expression similar to that achieved with our ASO treatment, showed no evidence of motor deficits or anxiety-like behavior. Specifically, Emc10<sup>+/-</sup> mice displayed locomotor activity comparable to WT mice in the open field (OF) test (Figure S4A, Diamantopoulou et al., 2017). Moreover, genetic normalization of Emc10 expression in Df(16)A<sup>+/-</sup> mice demonstrated no signs of anxiety-like behavior, as assessed by the OF test (Figure S4A) and elevated plus maze (EPM) (Figure S4B; Diamantopoulou et al., 2017). To further support these findings, we have added new data to the current manuscript (see Figure S10J) showing that TAM treatment-mediated restoration of Emc10 levels in the brain of adult Df(16)A<sup>+/-</sup> mice did not affect the time that mutant mice spent in the center area of the OF (Fig. S10J), suggesting that Emc10 reduction does not influence anxiety-related behavior. These results suggest that a 50% reduction in EMC10 expression is unlikely to result in motor or anxiety-like phenotypes in adult mice. Finally, as noted in the manuscript, in addition to prior findings from animal models, a substantial number of relatively rare LoF variants or potentially damaging missense variants have been identified in the human EMC10 gene among likely healthy individuals in gnomAD, a database largely devoid of individuals known to be affected by severe neurodevelopmental disorders (NDDs).

      Nevertheless, the Discussion has been revised to underscore the importance of establishing a more detailed safety profile, including non-cognitive phenotypes, to fully validate the therapeutic potential of Emc10-targeting approaches. It also highlights the need for future studies to expand on these evaluations, addressing this critical aspect and laying a stronger foundation for advancing these findings into clinical drug development

      (4) Supplemental Figure 10: The protein validation of Emc10 knockout following tamoxifen injection needs to be validated in all brain regions, not just the PFC. This is particularly important as the rest of the paper focuses on HPC-mediated phenotypes.

      First, we want to emphasize that we conducted both qRT-PCR and WB assays on the same animal cohort, specifically examining the left and right hippocampus following ASO injection (see Figure S11C and D). This approach is crucial, given the central role of hippocampus in the phenotypes investigated in our ASO-mediated Emc10 knockdown experiments.

      The reviewer raises an important point regarding the validation of EMC10 reduction at the protein level across all relevant brain regions using the Emc10 conditional knockout strain. We agree that such validation would ideally confirm the efficacy of our tamoxifen-induced knockout model comprehensively. However, we hope the reviewer appreciates that obtaining sufficient high-quality protein for WB analysis from smaller brain regions like the hippocampus poses a significant technical challenge. This difficulty is further compounded by the need to reserve the same samples for qRT-PCR to ensure consistency between mRNA and protein measurements. Importantly, our data from ASO-mediated Emc10 knockdown experiments (Figures S11C-D) demonstrate a clear and consistent correlation between reductions in Emc10 mRNA and protein levels in both the left and right hippocampus. Furthermore, in our constitutive Emc10-knockout mouse model (Diamantopoulou et al., 2017; see Figure S1A-B), we observed a strong agreement between mRNA and protein levels, supporting the reliability of mRNA data as a proxy for EMC10 protein levels in our experiments. Importantly, in all instances where we performed parallel protein and RNA measurements in human cell lines, there was excellent concordance between the datasets. Thus, while we acknowledge the limitations of relying primarily on mRNA data, we are confident that the Emc10 mRNA expression data in Figure S10 accurately reflect protein-level changes across brain regions in our conditional knockout model. To address this concern more fully in the future, we are working to refine antibody detection and optimize our protein extraction protocols to enable more routine and precise protein-level validation across smaller brain regions. We appreciate the reviewer’s feedback and will continue to refine our methodologies to strengthen the robustness of our findings.

      (5) Figure 3: 1 way ANOVA would be more appropriate to analyze the data in B-G than t-tests.

      We appreciate the suggestion of the reviewer. As mentioned above, we carefully selected statistical tests appropriate for each analysis. For Figure 3B-G, we chose to use pairwise t-tests to address specific hypotheses regarding the disease phenotype and rescue effects. This approach is consistent with prior experimental studies in the field, including our own (e.g., Xu et al., 2013; Figure 7H-I). Importantly, most of our t-tests yielded highly significant results (p < 0.001 or p < 0.01), reinforcing the robustness of our findings.

      (6) Figure 5-6: Protein data is needed to complement the mRNA knockdown data.

      We agree with the reviewer on the importance of protein-level validation to complement the mRNA knockdown data. As mentioned in our response to Reviewer’s Comment (4), in all instances where we performed parallel protein and RNA measurements, either in mouse brain or human cell lines, we observed excellent concordance between the datasets. This supports the reliability of our mRNA data as a proxy for protein changes. Nevertheless, we acknowledge the value of including protein validation in future experiments and will consider incorporating it to further strengthen our findings.

      (7) The use of additional phenotypic measures is applauded in Figure 6, however, to appropriately interpret the data more is needed. Shao et al 2021 (Figure S9) show data from the International Mouse Genotyping Consortium claiming EMC10 KO mice have gait, activity, and anxiety phenotypes. All of these parameters could impact the SM assay and the y-maze assay. Changes in SM interaction time could be linked to anxiety or motor impairments, but interpreted as cognitive deficits because these symptoms were not assessed. At a minimum, discussion is needed about this limitation, as well as the inclusion of distance explored in the SM and Y-maze assays.

      We thank the reviewer for their insightful comment regarding the potential influence of locomotor, gait, or anxiety phenotypes on the observed deficits in the SM and Y-maze assays. The behavioral phenotypes reported for Emc10 knockout mice by the International Mouse Genotyping Consortium (https://www.mousephenotype.org/data/genes/MGI:1916933) were limited to homozygous female mice and based on a small sample size (4–6 females) compared to a larger WT control group. Moreover, these data are unpublished and thus challenging to evaluate fully. Importantly, no abnormal behaviors were reported for Emc10 heterozygous knockout mice in these datasets. Additionally, the claim by Shao et al. (2021) regarding cognitive impairments in Emc10 knockout mice based on our previous work (Diamantopoulou et al., 2017) is inaccurate.

      Our analysis of both the constitutive Emc10 knockout model (Diamantopoulou et al., 2017) and the current conditional Emc10 heterozygous knockout model consistently demonstrates that Emc10 reduction does not affect locomotor activity or anxiety-like behavior. In our earlier characterization of constitutive heterozygous Emc10 knockout mice (Emc10<sup>+/-</sup>), we observed no signs of anxiety-like behavior or motor impairments in OF assays (see Figure 2A-B and Figure S4A, Diamantopoulou et al., 2017). Similarly, results from Df(16)A<sup>+/-</sup> mice with genetically normalized Emc10 expression [Df(16)A<sup>+/-</sup>; Emc10<sup>+/-</sup>] also showed no indications of anxiety-like behavior or locomotor changes in the OF and EPM assays (see Figure S4A-B, Diamantopoulou et al., 2017). Consistent with these findings, our current data from Df(16)A<sup>+/-</sup> mice with conditional Emc10 reduction in the brain show no significant differences in locomotor activity and anxiety-related measures as assessed by OF assays (Figure S10J). Furthermore, total arm entries in Y-maze assays conducted in Df(16)A<sup>+/-</sup> mice treated with Emc10 ASOs were comparable to controls (Figures S14C and G-H), providing additional support for the conclusion that locomotor activity remains unaffected in these models.

      We further appreciate the reviewer’s suggestion that changes in social interaction time during the SM assay could be influenced by anxiety or motor impairments. However, we consider this scenario unlikely in our model. Interaction times during the first trial of the SM assay, which measures general social interest, are comparable between Df(16)A<sup>+/-</sup> mice with reduced Emc10 expression (either genetically or through ASO treatment) and WT controls (see Figures 4E, 5E, and S10G). These findings indicate that our mouse models do not exhibit inherent difficulties in initiating social interaction, as might be expected if motor impairments or heightened anxiety were present. Reduced social interaction is commonly used as a behavioral marker for anxiety in rodent studies (reviewed by Bailey and Crawley, Anxiety-Related Behaviors in Mice, 2009). “Anxious” mice typically exhibit decreased social interaction, spending less time engaging with other mice compared to non-anxious counterparts. However, the specific deficit we observe in the second trial of the SM assay—when mice are reintroduced to a familiar juvenile—is indicative of impaired social recognition memory, as previously documented for Df(16)A<sup>+/-</sup> mice (Piskorowski et al., 2016; Donegan et al., 2020). This deficit is distinct from the general social avoidance typically associated with heightened anxiety.

      Based on our comprehensive assessment of locomotor activity, anxiety-related behaviors, and social interaction, we conclude that the observed rescue of social memory and spatial memory deficits in mice with reduced Emc10 expression is most likely due to improved cognitive function rather than alterations in motor or anxiety-related domains.

      (8) For ASO optimization experiments, it is not sufficient to claim robust uptake. A quantitative measure is needed using the PO antibody showing what percentage of cells were positive for the ASO. Since the contention is that only Emc10 in excitatory neurons is important, it would be helpful if this also included a breakdown of ASO uptake in excitatory and inhibitory neurons and astrocytes.

      We thank the reviewer for highlighting the importance of quantifying ASO uptake and assessing cell-type specificity. To address this, we have added new data to the panel, as shown in the high-magnification images in Figure S14A. These images provide evidence that a large majority of NeuN-positive neurons exhibit a strong ASO signal. Specifically, we observed widespread ASO uptake (green) that extensively colocalized with the neuronal marker NeuN (red) in both the hippocampus and prefrontal cortex. Quantitative analysis of this overlap indicates that over 97% of NeuN-positive neurons were ASO-positive, demonstrating efficient neuronal uptake. This robust neuronal uptake aligns with the significant normalization of Emc10 levels and the behavioral improvements observed in ASO-treated Df(16)A<sup>+/-</sup> mice, further supporting the functional efficacy of our approach in modulating Emc10 expression within the relevant neuronal populations. Overall, the observed ASO uptake in neurons, as demonstrated by IHC, combined with RNA assays and the behavioral improvements in treated mice, strongly supports the efficacy of our approach in targeting Emc10 expression in the intended neuronal populations.

      (9) An interpretation is needed in Figure S3 as to why ~50% of the pathways increased are also present on the decreased list. Ie. G1/transition, viral reproductive process, pos regulator of cell stress, etc. 4/10 GO terms are present in both increased and decreased groups in A and 7/10 in B.

      We thank the reviewer for pointing out the overlap between pathways enriched in both the upregulated and downregulated miRNA groups in Figure S3. This overlap likely reflects the complex nature of miRNA regulation, where individual miRNAs can target multiple genes within a pathway, and single genes can be regulated by multiple miRNAs, sometimes with opposing effects (reviewed in Bartel, 2009; Bartel, 2018). For example, in the “G1/S transition” pathway, upregulated miRNAs such as miR-92a-3p, miR-92b-3p, and miR-34a-5p may promote the transition by targeting cell cycle regulators like FBXW7, CDKN1C, and CDK6 (Zhou et al., 2015; Zhao et al., 2021; Oda et al., 2024). Conversely, downregulated miRNAs such as miR-143-3p and miR-200b are known to suppress the transition by targeting genes such as HK2 and GATA-4 (Zhou et al., 2015; Yao et al., 2013). Our analysis identified overlapping predicted target genes for both upregulated and downregulated miRNAs, supporting the notion that many genes are subject to complex regulation by multiple miRNAs with potentially synergistic or antagonistic effects. Thus, the enrichment of certain GO terms in both groups likely reflects this intricate interplay of miRNA-mediated gene regulation. Future investigations focusing on specific miRNA-target interactions within these pathways will be critical to fully elucidate the underlying mechanisms and better understand the functional consequences of these opposing regulatory effects.

      Minor Concerns:

      (1) Define SM before using it.

      We have defined the SM assay in the main text upon its first mention, where we describe the assay and its relevance to cognitive function (see page 11 of the revised manuscript).

      (2) Statistics have been run in Figure S2, but not presented. The text only states that the differences between groups are significant. Please add in.

      We have revised the legend of Figure S2 to include the specific statistical test used (students t-tests) and the corresponding p-values.

      (3) The switch from ASO1 to ASO2 between Figures 5 and 6 needs more discussion. Why were new ASOs generated when ASO1 worked?

      We thank the reviewer for their question regarding the transition from Emc10<sup>ASO1</sup> to Emc10<sup>ASO2</sup> between Figure 4 and Figures 5-6. Emc10<sup>ASO1</sup> served as our initial proof-of-concept ASO construct, successfully demonstrating the feasibility of inhibiting Emc10 mRNA expression and providing evidence for behavioral rescue in our mouse model. As outlined in the manuscript, Emc10<sup>ASO2</sup> targets a different region of the Emc10 transcript (intron 1, Figure 5A) compared to Emc10<sup>ASO1</sup> (intron 2, Figure 4A). This distinction provides an additional layer of validation for our targeting strategy and ensures specificity in modulating Emc10 expression. In addition, Emc10<sup>ASO1</sup> exhibited limited distribution in the brain, primarily targeting the hippocampus with weaker inhibition of Emc10 in other regions such as the cortex (Figure 4C, right panel). Emc10<sup>ASO2</sup> overcame this limitation and achieve broader brain distribution, as demonstrated by the qRT-PCR data in Figure 5C. Given that 22q11.2DS can affect multiple brain regions and cognitive domains beyond the hippocampus, achieving broader distribution of the ASO is critical for a more comprehensive assessment of therapeutic potential.

      (4) Page 3: Define "LoF"

      We have defined Loss-of-Function (LoF) in the main text where it is first mentioned in the Introduction, where we discuss the potential of using LoF mutations to devise therapeutic interventions (see page 3 of the revised manuscript).

      References

      Bailey and Crawley, Anxiety-Related Behaviors in Mice, In: Methods of Behavior Analysis in Neuroscience. 2nd edition. Boca Raton (FL): CRC Press/Taylor & Francis; Chapter 5, (2009).

      Bartel, MicroRNAs: target recognition and regulatory functions, Cell 136(2):215-33, (2009).

      Bartel, Metazoan MicroRNAs, Cell, 173(1):20-51, (2018).

      Chitwood et al., EMC Is Required to Initiate Accurate Membrane Protein Topogenesis, Cell 175, 1507-1519 e1516, (2018).

      Chitwood and Hegde, The Role of EMC during Membrane Protein Biogenesis, Trends Cell Biol. (5):371-384, (2019).

      Darras et al., Nusinersen in later-onset spinal muscular atrophy: Long-term results from the phase 1/2 studies, Neurology 92(21), (2019).

      Diamantopoulou et al., Loss-of-function mutation in Mirta22/Emc10 rescues specific schizophrenia-related phenotypes in a mouse model of the 22q11.2 deletion, Proc Natl Acad Sci U S A 114, E6127-E6136, (2017).

      Donegan et al., Coding of social novelty in the hippocampal CA2 region and its disruption and rescue in a 22q11.2 microdeletion mouse model, Nat Neurosci 23, 1365-1375, (2020).

      Finkel et al., Nusinersen versus Sham Control in Infantile-Onset Spinal Muscular Atrophy, N Engl J Med 377(18):1723-1732, (2017).

      Kordasiewicz et al., Sustained therapeutic reversal of Huntington's disease by transient repression of huntingtin synthesis, Neuron 74(6):1031-44, (2012).

      Oda et al., MicroRNA-34a-5p: A pivotal therapeutic target in gallbladder cancer, Mol Ther Oncol, 32(1):200765, (2024).

      Piskorowski et al., Age-Dependent Specific Changes in Area CA2 of the Hippocampus and Social Memory Deficit in a Mouse Model of the 22q11.2 Deletion Syndrome. Neuron 89, 163-176, (2016).

      Qi et al., Combined small-molecule inhibition accelerates the derivation of functional cortical neurons from human pluripotent stem cells. Nat Biotechnol 35, 154-163, (2017).

      Scoles et al., Antisense oligonucleotide therapy for spinocerebellar ataxia type 2, Nature 44(7650):362-366, (2017).

      Shao et al., A recurrent, homozygous EMC10 frameshift variant is associated with a syndrome of developmental delay with variable seizures and dysmorphic features, Genet Med 23, 1158-1162, (2021).

      Shurtleff et al., The ER membrane protein complex interacts cotranslationally to enable biogenesis of multipass membrane proteins, Elife 7, (2018).

      Soutschek et al., A human-specific microRNA controls the timing of excitatory synaptogenesis, bioRxiv, (2023).

      Stark et al., Altered brain microRNA biogenesis contributes to phenotypic deficits in a 22q11-deletion mouse model. Nat Genet 40, 751-760, (2008).

      Xu et al., Derepression of a neuronal inhibitor due to miRNA dysregulation in a schizophrenia-related microdeletion, Cell 152, 262-275, (2013).

      Yao et al., miR-200b targets GATA-4 during cell growth and differentiation, RNA Biol.10(4):465-8, (2013).

      Zhao et al., miR-92b-3p Regulates Cell Cycle and Apoptosis by Targeting CDKN1C, Thereby Affecting the Sensitivity of Colorectal Cancer Cells to Chemotherapeutic Drugs, Cancers 2;13(13):3323, (2021).

      Zhou et al., miR-92a is upregulated in cervical cancer and promotes cell proliferation and invasion by targeting FBXW7, Biochem Biophys Res Commun 458(1):63-9, (2015).

      Zhou et al., MicroRNA-143 acts as a tumor suppressor by targeting hexokinase 2 in human prostate cancer, Am J Cancer Res. 5(6):2056-6 (2015).

    1. eLife Assessment

      This study presents an important finding on the involvement of a Caspase 3-dependent pathway in the elimination of synapses for retinogeniculate circuit refinement and eye-specific territory segregation. This work fits well with the concept of "synaptosis" which has been proposed in the past. The evidence supporting the claims of the authors is convincing, demonstrating that caspase-3 activation is essential for microglial elimination of synapses during both brain development and neurodegeneration. The work will be of interest to investigators studying cell death pathways, neurodevelopment, and neurodegenerative disease.

    2. Reviewer #2 (Public review):

      This manuscript by Yu et al. demonstrates that activation of caspase-3 is essential for synapse elimination by microglia, but not by astrocytes. This study also reveals that caspase-3 activation-mediated synapse elimination is required for retinogeniculate circuit refinement and eye-specific territories segregation in dLGN in an activity-dependent manner. Inhibition of synaptic activity increases caspase-3 activation and microglial phagocytosis, while caspase-3 deficiency blocks microglia-mediated synapse elimination and circuit refinement in the dLGN. The authors further demonstrate that caspase-3 activation mediates synapse loss in AD, loss of caspase-3 prevented synapse loss in AD mice. Overall, this study reveals that caspase-3 activation is an important mechanism underlying the selectivity of microglia-mediated synapse elimination during brain development and in neurodegenerative diseases.

      A previous study (Gyorffy B. et al., PNSA 2018) has shown that caspase-3 signal correlates with C1q tagging of synapses (mostly using in vitro approaches), which suggests that caspase-3 would be an underlying mechanism of microglial selection of synapses for removal. The current study provides convincing in vivo evidence demonstrating that caspase-3 activation is essential for microglial elimination of synapses during both brain development and neurodegeneration.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this manuscript, the authors study the effects of synaptic activity on the process of eye-specific segregation, focusing on the role of caspase 3, classically associated with apoptosis. The method for synaptic silencing is elegant and requires intrauterine injection of a tetanus toxin light chain into the eye. The authors report that this silencing leads to increased caspase 3 in the contralateral eye (Figure 1) and demonstrate evidence of punctate caspase 3 that does not overlap neuronal markers like map2. However, the quantifications showing increased caspase 3 in the silenced eye (done at P5) are complicated by overlap with the signal from entire dying cells in the thalamus. The authors also show that global caspase 3 deficiency impairs the process of eye-specific segregation and circuit refinement (Figures 3-4).

      The reviewer states: “this silencing leads to increased caspase 3 in the contralateral eye”. We observed increased caspase-3 activity, not protein levels, in the contralateral dLGN, not eye.

      The reviewer states: “and demonstrate evidence of punctate caspase 3 that does not overlap neuronal markers like map2”. We do not believe that this statement is accurate, as we show that the punctate active caspase-3 signals overlap with the dendritic marker MAP2 (Figure S4A).

      The reviewer also states: “, the quantifications showing increased caspase 3 [activity] in the silenced [dLGN] (done at P5) are complicated by overlap with the signal from entire dying cells in the thalamus”. We do not believe that this statement is accurate. The apoptotic neurons we observed are relay neurons (confirmed by their morphology and positive staining of NeuN – Figure S4B-C) located in the dLGN (the dLGN is clearly labeled by expression of fluorescent proteins in RGCs, and only caspase-3 activity in the dLGN area is analyzed), not “cells” of unknown lineage (as suggested by the reviewer) in the general “thalamus” area (as suggested by the reviewer). If the dying cells were non-neuronal cells, that would indeed confound our quantification and conclusions, but that is not the case.

      We argue that whole-cell caspase-3 activation in dLGN relay neurons is a bona fide response to synaptic silencing by TeTxLC and therefore should be included in the quantification. We have two sets of controls: one is between the strongly inactivated dLGN and the weakly inactivated dLGN in the same TeTxLC-injected animal; and the second is between the dLGN of TeTxLC-injected animals and mock-injected animals. In both controls, only the dLGNs receiving strong synapse inactivation have more apoptotic dLGN relay neurons, demonstrating that these cells occur because of synapse inactivation. It is also unlikely that our perturbation is causing cell death through a non-synaptic mechanism. Since mock injections do not cause apoptosis in dLGN neurons, this phenomenon is not related to surgical damage. TeTxLC is injected into the eyes and only expressed in presynaptic RGCs, not in postsynaptic relay neurons, so this phenomenon is also unlikely to be caused by TeTxLC-related toxicity. Furthermore, if apoptosis of dLGN relay neurons is not related to synapse inactivation, then when TeTxLC is injected into both eyes, one would expect to see either the same amount or more apoptotic relay neurons, but we instead observed a reduction in dLGN neuron apoptosis, suggesting that synapse-related mechanisms are responsible. Considering the above, occasional whole-cell caspase-3 activation in relay neurons in TeTxLC-inactivated dLGN is causally linked to synapse inactivation and should be included in the quantification.

      We also revised the manuscript to better explain the possible mechanistic connection between localized caspase-3 activity and whole-cell caspase-3 activity. We propose that whole-cell caspase-3 activation occurs because of uncontrolled accumulation of localized caspase-3 activation. Please see line 127-140 and line 403-413 for details.

      Additionally, we would like to clarify that we are not claiming that synapse inactivation leads to only localized caspase-3 activation or only whole-cell caspase-3 activation, as is suggested by the editors and reviewers in the eLife assessment. We have clearly stated in the manuscript that both types of signals were observed. However, we reasoned that, because whole-cell caspase-3 activation in unperturbed dLGNs – which undergo normal synapse elimination – is infrequently observed, whole-cell caspase-3 activation may not be a significant driver of synapse elimination during normal development. In this revision, we included a new experiment to corroborate this hypothesis. If whole-cell caspase-3 activation in dLGN relay neurons is a prevalent phenomenon during normal development, such caspase-3 activity would lead to significant death of dLGN relay neurons during normal development. Consequently, if we block caspase-3 activation by deleting caspase-3, the number of relay neurons in the dLGN should increase. However, in support of our hypothesis, we observed comparable numbers of relay neurons in Casp3<sup>+/+</sup> and Casp3<sup>-/-</sup> mice. Please see Figure S7 for details.

      The authors also report that "synapse weakening-induced caspase-3 activation determines the specificity of synapse elimination mediated by microglia but not astrocytes" (abstract). They report that microglia engulf fewer RGC axon terminals in caspase 3 deficient animals (Figure 5), and that this preferentially occurs in silenced terminals, but this preferential effect is lost in caspase 3 knockouts. Based on this, the authors conclude that caspase 3 directs microglia to eliminate weaker synapses. However, a much simpler and critical experiment that the authors did not perform is to eliminate microglia and show that the caspase 3 dependent effects go away. Without this experiment, there is no reason to assume that microglia are directing synaptic elimination.

      The reviewer states: “microglia engulf fewer RGC axon terminals in caspase 3 deficient animals (Figure 5), and that this preferentially occurs in silenced terminals, but this preferential effect is lost in caspase 3 knockouts”. We are not sure what the reviewer means by “this preferentially occurs in silenced terminals”. Our results show that microglia preferentially engulf silenced terminals, and such preference is lost in caspase-3 deficient mice (Figure 6).

      We do not understand the experiment where the reviewer suggested to: “eliminate microglia and show that the caspase 3 dependent effects go away”. To quantify caspase-3 dependent engulfment of synaptic material by microglia or preferential engulfment of silenced terminals by microglia, microglia must be present in the tissue sample. If we eliminate microglia, neither of these measurements can be made. What could be measured if microglia are eliminated is the refinement of retinogeniculate pathway. This experiment would test whether microglia are required for caspase-3 dependent phenotypes. This is not a claim made in the manuscript. Instead, we claimed caspase-3 is required for microglia to engulf weak synapses, as supported by the evidence presented in Figure 6.

      We did not claim that “microglia are directing synaptic elimination”. Our claim is that synapse inactivation induces caspase-3 activity, and caspase-3 activation in turn leads to engulfment of weak synapses by microglia. Based on this model, it is the neuronal activity that fundamentally directs synapse elimination. Synapse engulfment by microglia is only a readout we used to measure the outcome of activity-dependent synapse elimination. We have revised all sections in the manuscript that are related to synapse engulfment by microglia to emphasize the logic of this model.

      We have also revised the abstract and title of the paper to better align it with our main claims, removed the reference to astrocytes, and clarified that microglia engulfment measurements are used as readouts of synapse elimination.

      Finally, the authors also report that caspase 3 deficiency alters synapse loss in 6-month-old female APP/PS1 mice, but this is not really related to the rest of the paper.

      We respectfully disagree that Figure 7 is not related to the rest of the paper. Many genes involved in postnatal synapse elimination, such as C1q and C3, have been implicated in neurodegeneration. It is therefore natural and important to ask whether the function of caspase-3 in regulating synaptic homeostasis extends to neurodegenerative diseases in adult animals. The answer to this question may have broad therapeutic impacts.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Yu et al. demonstrates that activation of caspase-3 is essential for synapse elimination by microglia, but not by astrocytes. This study also reveals that caspase 3 activation-mediated synapse elimination is required for retinogeniculate circuit refinement and eye-specific territories segregation in dLGN in an activity-dependent manner. Inhibition of synaptic activity increases caspase-3 activation and microglial phagocytosis, while caspase-3 deficiency blocks microglia-mediated synapse elimination and circuit refinement in the dLGN. The authors further demonstrate that caspase-3 activation mediates synapse loss in AD, loss of caspase-3 prevented synapse loss in AD mice. Overall, this study reveals that caspase-3 activation is an important mechanism underlying the selectivity of microglia-mediated synapse elimination during brain development and in neurodegenerative diseases.

      Strengths:

      A previous study (Gyorffy B. et al., PNSA 2018) has shown that caspase-3 signal correlates with C1q tagging of synapses (mostly using in vitro approaches), which suggests that caspase-3 would be an underlying mechanism of microglial selection of synapses for removal. The current study provides direct in vivo evidence demonstrating that caspase-3 activation is essential for microglial elimination of synapses in both brain development and neurodegeneration.

      The paper is well-organized and easy to read. The schematic drawings are helpful for understanding the experimental designs and purposes.

      Weaknesses:

      It seems that astrocytes contain large amounts of engulfed materials from ipsilateral and contralateral axon terminals (Figure S11B) and that caspase-3 deficiency also decreased the volume of engulfed materials by astrocytes (Figures S11C, D). So the possibility that astrocyte-mediated synapse elimination contributes to circuit refinement in dLGN cannot be excluded.

      We would like to clarify that we do not claim that astrocytes are unimportant for synapse elimination or circuit refinement. We acknowledge that the claim made in the original submitted manuscript that caspase-3 does not regulate synapse elimination by astrocytes lacks strong supporting evidence. We have removed this claim and revised the section related to synapse engulfment by astrocytes to provide a more rigorous interpretation of our data. We also removed the section in discussion regarding distinct substrate preferences of microglia and astrocytes.

      Does blocking single or dual inactivation of synapse activity (using TeTxLC) increase microglial or astrocytic engulfment of synaptic materials (of one or both sides) in dLGN?

      We assume that by “blocking single or dual inactivation of synapse activity”, the reviewer refers to inactivating retinogeniculate synapses from one or both eyes.

      We showed that inactivating retinogeniculate synapses from one eye (single inactivation) increases engulfment of inactive synapses by microglia (Figure 6). We did not measure synapse engulfment by microglia while inactivating retinogeniculate synapses from both eyes (dual inactivation). However, based on the total active caspase-3 signal (Figure 2) in the dual inactivation scenario, we do not expect to see an increase in engulfment of synaptic material by microglia.

      We did not measure astrocyte-mediated engulfment with single or dual inactivation, as we did not see a robust caspase-3 dependent phenotype in synapse engulfment by astrocytes.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the Authors):

      (1) Figure 1 - It is not clear from this figure whether the authors are measuring caspase 3 in dendritic compartments or in dying relay neurons in the thalamus. The authors state that "either" whole cell death (1B) or smaller punctate signals (1F) were observed. When quantifying "photons" in Figure 1E, it appears most of the signal captured will be of dying relay neurons. What determined which signal was observed, and what is being quantified in Figure 1E? This also applies to the quantifications being reported in Figure 2.

      The quantification includes both types of signals – it is sum of all active caspase-3 signal within the dLGN boundary. We note that there is a significant amount of punctate signal in the TeTxLC-inactivated dLGN. Unfortunately, due to file compression, these signals are not clearly visible in the submitted manuscript file. We have provided high resolution figures in this revision.

      As argued above in the response to the public review, apoptotic relay neurons in TeTxLC-inactivated dLGN (not the general thalamus area) occur as a direct consequence of synapse inactivation. Therefore, active caspase-3 signals in these relay neurons should be included in the quantification.

      We believe it is the extent of synapse inactivation (i.e., the number of synapses that are inactivated) that determines whether dLGN relay neuron apoptosis occurs or not. Such apoptosis is expected considering the nature of the apoptosis signaling cascade. In the intrinsic apoptosis pathway, release of cytochrome-c from mitochondria induces cleavage of the initiator caspase, caspase-9, and caspase-9 in turn cleaves the executioner caspases, caspase-3/7, which causes apoptosis. Caspase-3 can cleave upstream factors in the apoptosis pathway, leading to explosive amplification of caspase-3 activity (McComb et al., DOI: 10.1126/sciadv.aau9433). When a relay neuron receives a few inactivated synapses, caspase-3 activation in the postsynaptic dendrite can remain local (as we observed in Figure 1), constrained by mechanisms such as proteasomal degradation of cleaved caspase-3 (Erturk et al., DOI: 10.1523/JNEUROSCI.3121-13.2014). However, when a relay neuron receives many inactivated synapses, the cumulative caspase-3 activity induced in the dendrite can overwhelm negative regulation and lead to significantly higher levels of caspase-3 activity in entire dendrites (Figure S4B) through positive feedback amplification, eventually leading to caspase-3 activation in entire relay neurons. Please see line 127-140 and line 403-413 for our discussion in the main text.

      (2) Figure 5 - Figures 5c-d and Fig 6 are confounded by pseudoreplication, whereby performing statistics on 50-60 microglia inflates statistical significance. Could the authors show all these data per mouse?

      If we understand the reviewer correctly, the reviewer is suggesting that reporting measurements from multiple microglia in one animal constitutes pseudo-replication. This is correct in a strict sense, as microglia in the same animal are more likely to be similar than microglia from different animals. In the revised version, we have plotted the data by animal in Figure S11 and S13. The observations remain valid. However, we would like to point out that averaging measurements from all microglia in each animal and report by mouse is very conservative, as measurements from microglia in the same animal still vary greatly due to cell-to-cell differences.

      (3) Although the authors are not the only ones to use this strategy, it is worth noting that performing all microglial experiments in Cx3cr1 heterozygotes could lead to alterations in microglial function that may not be reflective of their homeostatic roles.

      We acknowledge that Cx3cr1 heterozygosity could cause alterations in microglial physiology.

      While Cx3cr1 heterozygosity may impact microglia physiology, we note that the engulfment assay in Figure 5 is comparing microglia in Cx3cr1<sup>+/-</sup>; Casp3<sup>+/-</sup> and Cx3cr1<sup>+/-</sup>; Casp3<sup>-/-</sup> animals. Therefore, the impact of Cx3cr1 heterozygosity is controlled for in our experiment, and the observed difference in engulfed synaptic material in microglia is an effect specific to caspase-3 deficiency. However, we acknowledge that this difference could be quantitatively affected by Cx3cr1 heterozygosity.

      It is important to note that we did not perform all microglia engulfment analyses using Cx3cr1<sup>+/-</sup> mice. We have edited the manuscript to make this more clear. In the activity-dependent microglia engulfment analysis performed in Figure 6, we used Casp3<sup>+/+</sup> and Casp3<sup>-/-</sup> animals and detected microglia with anti-Iba1 immunostaining. Therefore, the impact of Cx3cr1 heterozygosity is not a problem for this experiment.

      Minor:

      (1) Figures are presented out of order, which makes the manuscript difficult to follow.

      We have revised text regarding the segregation analysis to align with the order of figures.

      (2) Figure S3 is very confusing- the terms "left" and "right" are used in three or four partly overlapping contexts (which eye, which injection, which panel or subpanel of the figure is being referred to). Would this not be more appropriately analyzed with a repeated measures ANOVA (multiple comparisons not necessary) rather than multiple separate T-tests?

      We have revised Figure S3 and S5 with better annotation and legends.

      Yes, it is possible to use repeated measure two-way ANOVA. The analysis reports significant effect from genotypes, with a dF of 1, SoS and MoS of 0.0001081, F(1,13) = 7.595, and p = 0.0164. We used multiple separate t-tests because we wanted to show how genotype effects change with increasing thresholds, whereas two-way ANOVA only provides one overall p-value.

      (3) Could the authors clarify why the percentage overlap (in the controls) is so different between Figure 3C and Figure S3C, and why different thresholds are applied?

      This difference is primary due to difference in age. Figure 3 and Figure S5 are acquired at age of P10, while Figure S3 is acquired at P8. While the segregation process is largely complete by P8, the segregation continues from P8 to P10. Therefore, overlap measured at P10 will be lower than that measured at P8. If we compare overlap at the same threshold (e.g., 10%) and at the same age in Figure 3 and S5, the overlap is very similar.

      The choice of threshold is related to the methods of labeling. In Figure 3, RGC terminals are labeled with AlexaFlour conjugated cholera toxin subunit-beta (CTB). In Figure S3 and S5, RGC axons are labeled by expression of fluorescent proteins. Labeling with CTB only labels membrane surfaces but yields stronger and slightly different signals at fine scales than labeling with fluorescent protein which are cell fillers. For Figure S3 and S5 (which use fluorescent protein labeling), higher thresholds such as those used in Figure 3 (which use CTB labeling) can be applied and the same trend still holds, but the data will be noisier. Regardless of the small difference in thresholds used, the important observation is that the defects in TeTxLC-injected or caspase-3 deficient animals are clear across multiple thresholds.

      (4) Many describe the eye-specific segregation process as being complete "between P8-10". Other studies have quantified ESS at P10 (Stevens 2007). The authors state they did all quantifications at P8 (l. 82) and refer to Figure 3, but Figure 3 shows images from P10, whereas Figure S3 shows data from P8.

      We did not say we performed all quantification at P8. In line 85, we said “To validate the efficacy of our synapse inactivation method, we injected AAV-hSyn-TeTxLC into the right eye of wildtype E15 embryos and analyzed the segregation of eye-specific territories at postnatal day 8 (P8), when the segregation process is largely complete”. The age of postnatal day 8 in this context is specifically referring to the experiment shown in Figure S3. For the segregation analysis in Figure 3, we specifically stated that the experiment was conducted at P10 (line 277).

      Although the experiment in Figure S3 is conducted at P8, and Figure S5 and Figure 3 show results at P10, each dataset always included appropriate age-matched controls.  P8 is generally considered an age where segregation is mostly complete and sufficient for us to assess the potency of TeTxLC-delivered AAV on eye segregation.  We don’t think performing the experiment shown in Figure S3 at P8 impacts the interpretation of the data.

      (5) Is Figure 6 also using Cx3cr1 GFP to label microglia? This is not clarified.

      We apologize for this oversight. In Figure 6 microglia are labeled by anti-Iba1 immunostaining. We have clarified this in figure legends and text.

      Reviewer #2 (Recommendations for the Authors):

      (1) The authors quantified the caspase-3 activity using immunostaining and confocal microscopy (Figures 1B-E). They may need to verify the result (increased level of activated caspase-3 upon synapse inactivation) using alternative methods, such as western blotting.

      Both western blot and immunostaining are based on antibody-antigen interaction. These two methods are not likely sufficiently independent. Additionally, to perform a western blot, we would need to surgically collect the TeTxLC-inactivated dLGN to avoid sample contamination from other brain regions. Such collection at the age we are interested in (P5) is very challenging. We have tested the anti-cleaved caspase-3 antibody using caspase-3 deficient mice and we can confirm it is a highly specific antibody that doesn’t generate signal in the caspase-3 deficient tissue samples.

      (2) Does caspase-3 deficiency alter the density of microglia or astrocytes in dLGN?

      No. Neither the density of microglia nor astrocytes changed with caspase-3 deficiency. In the case of microglia, we find that the mean density of microglia per unit area of dLGN is virtually the same in wild type and caspase-3 deficient mice (two-tailed t test P = 0.8556, 6 wild type and 5 Casp3<sup>-/-</sup> mice). Some overviews showing microglia in dLGNs of wildtype and caspase-3 deficient mice can be found in Figure S10.  Similarly for astrocytes, we did not observe overt changes in astrocytes dLGN density linked to caspase-3 deficiency.

      (3) During dLGN eye-specific segregation in normal developing animals, did the authors observe different levels of activated caspase-3 in different regions (territories)?

      For normal developing animals, the activated caspase-3 signal is generally sparse, and it is difficult to distinguish whether the signal is related to synapse elimination. For animals receiving TeTxLC-injection, we did notice that in the dLGN contralateral to the injection, where most inactivated synapses are located, the punctate caspase-3 signal tends to concentrate on the ventral-medial side of the dLGN (Figure 1B), which is the region preferentially innervated by the contralateral eye.

      (4) Recording of NMDAR-mediated synaptic currents may not be necessary for demonstrating that caspase 3 is essential for dLGN circuit refinement. In addition, the PPR may not necessarily reflect the number of innervations that a dLGN neuron receives. Instead, showing the changes in the frequency of mEPSCs (or synapse/spine density) may be more supportive.

      Thank you for the comment. We have performed the suggested mEPSC measurements and reported the results in revised Figure 4D-F.

      (5) Why is caspase 3 activation enhanced (compared to control) only at 4 months of age, when A-beta deposition has not formed yet, but not at later time points in AD mice (Figure S17)?

      A prevailing hypothesis in the field is that the form of A-beta that is most neurotoxic is the soluble oligomeric form, not the fibril form that leads to plaque deposition. As the oligomeric form appears before plaque deposition, the enhanced caspase-3 activation we observed at 4-month may reflect an increase in oligomeric A-beta, which occurs before any visible A-beta plaque formation.

      (6) The manuscript can be made more concise, and the figures more organized.

      We removed superfluous details and corrected text-figure mismatches in the revised manuscript to improve readability.

    1. eLife Assessment

      This valuable study reports on the characteristics of premotor cortical population activity during the execution and observation of a moderately complex reaching and grasping task. By using new variants of well-established techniques to analyse neural population activity, the authors provide solid evidence that while the geometry of neural population activity changes between execution and observation, their dynamics are largely preserved. Although these findings are novel and robust, pending additional controls and analyses, the authors should further clarify the functional implications of their findings.

    2. Reviewer #2 (Public review):

      The authors investigated the similarity (or lack thereof) of neural dynamics while monkeys reached to and manipulated one of 4 objects in each trial, compared to observing similar movements performed by experimenters. They focused on mirror neurons (MNs) and rather convincingly showed that MNs dynamics are dissimilar during executing vs. observing actions. The manuscript has improved quite significantly compared to the previous version and I congratulate the authors for that. However, there are still a few points I would like to raise that I think will improve the manuscript scientifically and make it more pleasant to read.

      - I appreciate the nicely compiled literature review which provides the context for the manuscript.<br /> - Message: The takeaway message of the paper is inconsistent and changes throughout the paper. To me, the main takeaway is that observation and execution subspaces progress during the trial (Fig 4), and that they are distinct processes and rather dissimilar, as stated in #440-441, #634-635, etc. But the title of the paper implies the opposite. Some of the interpretations of the results (e.g., Fig 8) also imply similarity of dynamics.<br /> - Readability: I have many issues with the readability/organisation of the paper. Unfortunately, I still find the quality of data presentation low. Below I list a few points:<br /> (1) In 5 sessions out of 9, there are fewer than 20 neurons categorised as AE. This means this population is under-sampled in the data which makes applying any neural population techniques questionable. Moreover, the relevance of the AE analysis is also sometimes unclear: In Fig 4, the AE-related panels are just referred to once in the paper. Yet AE results are presented right next to the main results throughout the paper.<br /> (2) Figures are low resolution and pixelated. There are some faded horizontal and vertical lines in Fig1B that are barely visible. Moreover, it may be my personal preference, but I think Fig1 is more confusing than helpful. Although panel A shows some planes rotating, indicating time-varying dynamics, I couldn't understand what more panel B is trying to convey. The arrow of time is counterclockwise, but the planes progress clockwise (i > ii > iii). Similarly, panel C just seems to show some points being projected to orthogonal subspaces (even though later in the paper we'll see that observation and execution subspaces are not orthogonal), and the CCA subspace illustrated in the same high-d space, which mathematically may be inaccurate, as CCA projects the data to a new space.<br /> In Fig 2A, the objects are too small and pixelated as well. I suggest an overhaul of the figures to make the paper more accessible.<br /> (3) Clarity of the text: The manuscript text could be more concise, to the point, avoiding repetitions, self-consistent, and simply readable. To name a few issues: Single letter acronyms were used to refer to trial epochs (I/G/M/H). M alone has been re-defined 13 different times in the text as in: ...Movement (M)..., excluding every related figure. The acronym (I) refers to the instruction epoch, the high-d space in Fig 1, and panel I of some figures. The acronym MN for Mirror Neurons was defined 4 separate times in the text yet spelled out as Mirror Neuron more than 2 dozen times. CD is defined in the caption of Fig 3 and never used, despite condition-dependent being a common term in the text. Many sentences, e.g., "In contrast, throughout..." in #265-#269, and "To summarize,..." in #270-#275, are too long with difficult wording. To get the point from these sentences, I had to read them many times, and go back and forth between them and the figure. Rewriting such sentences makes the manuscript much more accessible.<br /> - Figure 3: It appears that the condition independent signal has been calculated by subtracting the average of the 4 neural trajectories in Fig 3A, corresponding to different objects. Whereas #133 suggests that it should be calculated by subtracting the average firing rate of different conditions. Assuming I got the methods right, dynamics being "knotted" (#234) after removing the condition independent signal could be because they are similar, so subtracting the condition independent signal leaves us with the noise component. This matters for the manuscript especially since this is the reason for performing the more sensitive instantaneous subspaces.<br /> - Decoding results: I appreciate that the authors improved the decoding results in this version of the manuscript. Now it is much more interesting. However oddly, it appears that only data from 1 monkey is shown. #370 says the results from the other 2 are similar. The decoding data from every monkey must be shown. If the results are similar, they must be at least in Supplements. Currently, only 1 session (out of 3) in the Observation condition seems to decode the object type. This effect, if consistent across animals and session, is very interesting on its own and challenges other claims in the paper.<br /> - Figure8: I reiterate the issue #7 in my previous review. I appreciate the authors clearing some methods, but my concern persists. As per line #839, spiking activity has been smoothed with a 50ms kernel. Thus, unless trial data is concatenated, I suspect the 100ms window used for this analysis is too short (small sample size), thus the correlation values (CCs) might be spurious. References cited in this section use a smaller smoothing kernel (30ms) and a much longer window (~450ms).<br /> Moreover, I don't know why the authors chose to show correlation values in 3D space! Values of Fig8C-red are impossible to know. Furthermore, the manuscript insists on CC values of the Hold period being high, which is probably correct. But I wonder why the focus on the Hold period? I think the most relevant epoch for analysing the MNs is the Movement where the actual action happens. Interestingly, in the movement epoch, the CC values are visibly low. The reason why Hold results are more important and why the CCs in Movement are so low should be clarified in the text. Especially, statements like that in #661 seem particularly unjustified.

    3. Reviewer #3 (Public review):

      In their study, Zhao et al. investigated the population activity of mirror neurons (MNs) in the premotor cortex of monkeys either executing or observing a task consisting of reaching to, grasping, and manipulating various objects. The authors proposed an innovative method for analyzing the population activity of MNs during both execution and observation trials. This method enabled to isolate the condition dependent variance in neural data and to study its temporal evolution over the course of single trials. The method proposed by the authors consists of building a time series of "instantaneous" subspaces with single time step resolution, rather than a single subspace spanning the entire task duration. As these subspaces are computed on an instant time basis, projecting neural activity from a given task time into them results in latent trajectories that capture condition-dependent variance while minimizing the condition-independent one. Authors then analyzed the time evolution of these instantaneous subspaces and revealed that a progressive shift is present in subspaces of both execution and observation trials, with slower shifts during the grasping and manipulating phases compared to the initial preparation phase. Finally, they compared the instantaneous subspaces between execution and observation trials and observed that neural population activity did not traverse the same subspaces in these two conditions. However, they showed that these distinct neural representations can be aligned with Canonical Correlation Analysis, indicating dynamic similarities of neural data when executing and observing the task. The authors speculated that such similarities might facilitate the nervous system's ability to recognize actions performed by oneself or another individual.

      Unlike other areas of the brain, the analysis of neural population dynamics of premotor cortex MNs is not well established. Furthermore, analyzing population activity recorded during non-trivial motor actions, distinct from the commonly used reaching tasks, serves as a valuable contribution to computational neuroscience. This study holds particular significance as it bridges both domains, shedding light on the temporal evolution of the shift in neural states when executing and observing actions. The results are moderately robust, and the proposed analytical method could potentially be used in other neuroscience contexts.

    4. Reviewer #4 (Public review):

      Summary:

      In this study, the authors explore the neural dynamics of mirror neurons in the premotor cortex, focusing on the relationship between neural activity during action execution and observation. The study presents a rich dataset from three monkeys, with recordings from two regions per monkey. The authors use a method to analyze instantaneous neural subspaces and track their temporal evolution. Consistent with prior literature, they report that execution and observation subspaces remain largely distinct throughout the trial. However, after applying canonical correlation analysis, they observe a notable alignment between execution and observation activities, suggesting the presence of shared neural codes. The study is well-designed, and the analyses are thoroughly documented, occasionally overly so in the main text. While most findings are compelling, I find the conclusions drawn from Figure 8 less convincing. Specifically, I am skeptical about the application of CCA in this context and the subsequent interpretations regarding execution-observation similarity, which is a central claim of the manuscript.

      • The authors cite Safaie et al. 2023 as a precedent for applying CCA to align neural population dynamics. However, in that study, CCA was used to align neural dynamics across different animals, a justifiable approach given that neural trajectories exist in separate neural state spaces for each animal. Here, CCA is applied to align execution and observation activities within the same neural state space of the same MNs. I find this application of CCA less well-justified, as it may overestimate execution-observation similarity.<br /> • The control conditions presented in Figures 8C and 8D are somewhat reassuring, as they show that the similarity introduced by CCA is not universally high. However, these controls appear to be limited to the Hold epoch. It remains unclear whether the same holds true for the Go and Movement epochs.<br /> • In Figure 5, the authors display low-dimensional representations of four objects across task epochs during execution (A) and observation (B). The diagonals of the matrices reveal clear differences between execution and observation configurations across all four epochs. The authors suggest using CCA to align these configurations; however, this alignment seems to require time-specific application of CCA for each epoch (as demonstrated in Figure 8 for the Hold epoch). The need for time-specific adjustments likely depends on the fact that execution and observation subspaces are continuously shifting over time (as authors show in Figure 4), but this approach appears to be a strained attempt to demonstrate similarity between execution and observation codes.<br /> • The authors themselves offer an alternative hypothesis (line 730): that "PM MN population activity during action observation, rather than representing movements made by another individual similar to one's own movements, instead may represent different movements one might execute oneself in response to those made by another individual". This interpretation appears more congruent with the data presented.<br /> • In the end, I am left with a sense of ambiguity: which analysis should be considered more reliable, the negligible correspondence between execution and observation activity depicted in Figure 7, or the considerable similarity shown in Figure 8? The authors should address this apparent contradiction and provide a clearer discussion to reconcile these findings.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Major changes in the revised manuscript include:

      (1) The distinction between condition-dependent versus condition-independent variation in neural activity has been clarified. 

      (2) Principal angle calculations have been added. 

      (3) Neurons modulated during action execution but not during action observation have been analyzed to compare and contrast with mirror neurons. 

      (4) Canonical correlation analysis has been extended to three dimensions. 

      (5) Speculations have been moved to and modified in the Discussion. 

      (6) Computational details have been expanded in the Methods.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary and strengths. This paper starts with an exceptionally fair and balanced introduction to a topic, the mirror neuron literature, which is often debated and prone to controversies even in the choice of the terminology. In my opinion, the authors made an excellent job in this regard, and I really appreciated it. Then, they propose a novel method to look at population dynamics to compare neural selectivity and alignment between execution and observation of actions performed with different types of grip. 

      Thank you.

      Weakness.

      Unfortunately, the goal and findings within this well-described framework are less clear to me. The authors aimed to investigate, using a novel analytic approach, whether and to what extent a match exists between population codes and neural dynamics when a monkey performs an action or observes it performed by an experimenter. This motivation stems from the fact that the general evidence in the literature is that the match between visual and motor selectivity of mirror neuron responses is essentially at a chance level. While the approach devised by the author is generally well-described and understandable, the main result obtained confirms this general finding of a lack of matching between the two contexts in 2 out of the three monkeys. Nevertheless, the authors claim that the patterns associated with execution and observation can be re-aligned with canonical correlation, indicating that these distinct neural representations show dynamical similarity that may enable the nervous system to recognize particular actions. This final conclusion is hardly acceptable to me, and constitutes my major concern, at least without a more explicit explanation: how do we know that this additional operation can be performed by the brain? 

      Point taken.  In the Discussion, we now have clarified that this is our speculation rather than a conclusion and we also offer an alternative interpretation (lines 724 to 744):

      “One classic interpretation of similar latent dynamics in the PM MN population during execution and observation would be that this similarity provides a means for the brain to recognize similar movements performed by the monkey during execution and by the experimenter during observation. Through some process akin to a communication subspace (Semedo et al., 2019), brain regions beyond PM might recognize the correspondence between the latent dynamics of the executed and observed actions.

      Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021). Though neurons active only during observation of others (AO units) have been hypothesized to drive observation activity in MNs, the present AO populations were too small to analyze with the approaches we applied here.  Nevertheless, the similar relative organization of the execution and observation population activity in PM MNs revealed here by alignment of their latent dynamics through CCA could constitute a correspondence between particular movements that might be made by the subject in response to particular movements made by the other individual, i.e. responsive movements which would not necessarily be motorically similar to the observed movements.”

      Is this a computational trick to artificially align something that is naturally non-aligned, or can it capture something real and useful? 

      We feel this is more than a trick.  In the Introduction, we now have clarified (lines 166 to 170):

      “Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.”

      In the Results we give the follow example (lines 446 to 455):

      “Such alignment would indicate that neural representations of trials involving the four objects bore a similar relationship to one another in neural space during execution and observation, even though they occurred in different subspaces.  For example, the trajectories of PMd+M1 neuron populations recorded from two different monkeys during center-out reaching movements could be aligned well (Safaie et al., 2023).  CCA showed, for example, that in both brains the neural trajectory for the movement to the target at 0° was closer to the trajectory for movement to the target at 45° than to the trajectory for the movement to the target at 180°. Relationships among these latent dynamic representations of the eight movements thus were similar even though the neural populations were recorded from two different monkeys.”

      And in the Discussion we now compare (lines 677 to 686):

      “Corresponding neural representations of action execution and observation during task epochs with higher neural firing rates have been described previously in PMd MNs and in PMv MNs using representational similarity analysis RSA (Papadourakis and Raos, 2019).  And during force production in eight different directions, neural trajectories of PMd neurons draw similar “clocks” during execution, cooperative execution, and passive observation (Pezzulo et al., 2022).  Likewise in the present study, despite execution and observation trajectories progressing through largely distinct subspaces, in all three monkeys execution and observation trajectory segments showed some degree of alignment, particularly the Movement and Hold segments (Figure 8C), indicating similar relationships among the latent dynamic representations of the four RGM movements during execution and observation.”

      Based on the accumulated evidence on space-constrained coding of others' actions by mirror neurons (e.g., Caggiano et al. 2009; Maranesi et al. 2017), recent evidence also cited by the authors (Pomper et al. 2023), and the most recent views supported even by the first author of the original discovery (i.e., Vittorio Gallese, see Bonini et al. 2022 on TICS), it seems that one of the main functions of these cells, especially in monkeys, might be to prepare actions and motor responses during social interaction rather than recognizing the actions of others - something that visual brain areas could easily do better than motor ones in most situations. In this perspective, and given the absence of causal evidence so far, the lack of visuo-motor congruence is a potentially relevant feature of the mechanism rather than something to be computationally cracked at all costs. 

      We agree that this perspective provides a valuable interpretation of our findings.  In the Discussion, we have added the following paragraph (lines 730 to 744):

      “Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021). Though neurons active only during observation of others (AO units) have been hypothesized to drive observation activity in MNs, the present AO populations were too small to analyze with the approaches we applied here.  Nevertheless, the similar relative organization of the execution and observation population activity in PM MNs revealed here by alignment of their latent dynamics through CCA could constitute a correspondence between particular movements that might be made by the subject in response to particular movements made by the other individual, i.e. responsive movements which would not necessarily be motorically similar to the observed movements.”

      Specific comments on Results/Methods: 

      I can understand, based on the authors' hypothesis, that they employed an ANOVA to preliminarily test whether and which of the recorded neurons fit their definition of "mirror neurons". However, given the emphasis on the population level, and the consolidated finding of highly different execution and observation responses, I think it could be interesting to apply the same analysis on (at least also) the whole recorded neuronal population, without any preselection-based on a single neuron statistic. Such preselection of mirror neurons could influence the results of EXE-OBS comparisons since all the neurons activated only during EXE or OBS are excluded. Related to this point, the authors could report the total number of recorded neurons per monkey/session, so that also the fraction of neurons fitting their definition of mirror neuron is explicit. 

      We are aware that a number of recent studies from other laboratories already have analyzed the entire population of neurons during execution versus observation, without selectively analyzing neurons active during both execution and observation (Jiang et al., 2020; Albertini et al., 2021). However, our focus lies not in how the entire PM neural population encodes execution versus observation, but in the differential activity of the mirror neuron subpopulation in these two contexts.  Our new Table 2 presents the numbers of mirror neurons (MN), action execution only neurons (AE), action observation only neurons (AO), and neurons not significantly task-related during either execution or observation (NS).  Although we often recorded substantial numbers of AE neurons, very few AO neurons were found in our recordings.  In analyzing the AE subpopulation, we found unexpected differences in canonical correlation alignment between and within the MN and AE neuron populations. In view of the editors’ comments that “…the reviewers provided several specific recommendations of new analyses to include. However, now the paper feels extremely long…”. We have chosen to focus on comparing AE neurons with MNs.  

      Furthermore, the comparison of the dynamics of the classification accuracy in figures 4 and 5, and therefore the underlying assumption of subspaces shift in execution and observation, respectively, reveal substantial similarities between monkeys despite the different contexts, which are clearly greater than the similarities among neural subspaces shifts across task epochs: to me, this suggests that the main result is driven by the selected neural populations in different monkeys/implants rather than by an essential property of the neuronal dynamics valid across animals. Could the author comment on this issue? This could easily explain the "strange" result reported in figure 6 for monkey T. 

      We have taken the general approach of emphasizing findings common across individual animals, but also reporting individual differences.  We have added the following in the Discussion (lines 645 to 654):

      “We did not attempt to classify neurons in our PM MN populations as strictly congruent, broadly congruent, or non-congruent.  Nevertheless, the minimal overlap we found in instantaneous execution and observation subspaces would be consistent with a low degree of congruence in our PM MN populations.  Particularly during one session monkey T was an exception in this regard, showing a considerable degree of overlap between execution and observation subspaces, not unlike the shared subspace found in other studies that identified orthogonal execution and observation subspaces as well (Jiang et al., 2020).  Although our microelectrode arrays were placed in similar cortical locations in the three monkeys, by chance monkey T’s PM MN population may have included a substantial proportion of congruent neurons.”

      Reviewer #2 (Public Review): 

      In this work, the authors set out to identify time-varying subspaces in the premotor cortical activity of monkeys as they executed/observed a reach-grasp-hold movement of 4 different objects. Then, they projected the neural activity to these subspaces and found evidence of shifting subspaces in the time course of a trial in both conditions, executing and observing. These shifting subspaces appear to be distinct in execution and observation trials. However, correlation analysis of neural dynamics reveals the similarity of dynamics in these distinct subspaces. Taken together, Zhao and Schieber speculate that the condition-dependent activity studied here provides a representation of movement that relies on the actor. 

      This work addresses an interesting question. The authors developed a novel approach to identify instantaneous subspaces and decoded the object type from the projected neural dynamics within these subspaces. As interesting as these results might be, I have a few suggestions and questions to improve the manuscript: 

      (1) Repeating the analyses in the paper, e.g., in Fig5, using non-MN units only or the entire population, and demonstrating that the results are specific to MNs would make the whole study much more compelling. 

      We have added analyses of those non-MNs modulated significantly during action execution but not during observation, which we refer to as AE neurons.  The additional findings from these analyses are spread throughout the manuscript:

      Lines 284-293:

      “We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.” 

      Lines 411-419:

      “During execution trials, classification accuracy for AE populations (Figure 6I-L) showed a time course quite similar to that for MN populations, though amplitudes were lower overall, most likely because of the smaller population sizes. During observation, AE populations showed only low-amplitude, short-lived peaks of classification accuracy around times I, G, M, and H (Figure 6 – figure supplement 1).  Given that individual AE neurons showed no statistically significant modulation during observation trials, even these small peaks might not have been expected.  Previous studies have indicated, however, that neurons not individually related to task events nevertheless may contribute to a population response (Shenoy et al., 2013; Cunningham and Yu, 2014; Gallego et al., 2017; Jiang et al., 2020).”

      Lines 495-508:

      “Although MNs are known to be present in considerable numbers in both the primary motor cortex and premotor cortex (see Introduction), most studies of movement-related cortical activity in these areas make no distinction between neurons with activity only during action execution (AE neurons) and those with activity during both execution and observation (MNs).  This reflects an underlying assumption that during action execution, mirror neurons function in parallel with AE neurons, differing only during observation.  We therefore tested the hypothesis that MN and AE neuron execution trajectory segments from the same session would align well.  Figure 8C (blue) shows the mean CCs between MN and AE execution trajectory segments across 8 alignments (MN/AE; 2 R, 3 T, 3 F), which reached the highest values for the Hold segments .  All three of these coefficients were substantially lower than those for the MN execution vs. observation alignments given above.  Surprisingly, the alignment of AE neuron execution trajectory segments with those of the simultaneously recorded MN population was weaker than the alignment of MN trajectories during execution vs. observation.

      Did these differences in MN:1/2, MN:E/O, and MN/AE alignment result from consistent differences in their respective patterns of co-modulation, or from of greater trial-by-trial variability in the patterns of co-modulation among MNs during observation than during execution, and still greater variability among AE neurons during execution?  The bootstrapping approach we used for CCA (see Methods) enabled us to evaluate the consistency of relationships among trajectory segments across repeated samplings of trials recorded from the same neuron population in the same session and in the same context (execution or observation).  We therefore performed 500 iterations of CCA between two different random samples of MN execution (MN:E/E), MN  observation (MN:O/O), or AE execution (AE:E/E) trajectory segments from a given session (2 R, 3 T, 3 F). This within-group alignment of MN execution trajectory segments from the same session (Figure 8D, MN:E/E, gray, Hold: () was as strong as between session alignment (Figure 8C, MN/1:2, black).  But within-group alignment of MN observation trajectory segments (Figure 8D, MN:O/O, orange, Hold: () was lower than that found with MN execution segments (Figure 8C, MN:E/O, red, .  Likewise, within-group alignment of AE neuron trajectory segments (Figure 8D, AE:E/E, light blue, Hold: () was lower than their alignment with MN execution segments (Figure 8C, MN/AE, blue, Hold: ().  Whereas MN execution trajectories were relatively consistent within sessions, MN observation trajectories and AE execution trajectories were less so.”

      And in the Discussion we now suggest (lines 682 to 698):

      “Based on the assumption that AE neurons and MNs function as a homogenous neuron population during action execution, we had expected AE and MN execution trajectory segments to align closely.  During execution trials, the progression of instantaneous condition-dependent subspaces and of classification accuracy in AE populations was quite similar to that in MN populations.  We were surprised to find, therefore, that alignment between execution trajectory segments from AE populations and from the simultaneously recorded MN populations was even lower than alignment between MN execution and observation segments (Figure 8C, blue versus red).  Moreover, whereas within-group alignment of MN execution trajectory segments was high, within-group alignment of AE neuron execution trajectory segments was low (Figure 8D, gray versus light blue).  These findings indicate that the predominant patterns of co-modulation among MNs during execution are quite consistent within sessions, but the patterns of comodulation among AE neurons are considerably more variable.  Together with our previous finding that modulation of MNs leads that of non-mirror neurons in time, both at the single neuron level and at the population level (Mazurek and Schieber, 2019), this difference in consistency versus variability leads us to speculate that during action execution, while MNs carry a consistent forward model of the intended movement, AE neurons carry more variable feedback information.”

      (2) The method presented here is similar and perhaps related to principal angles (https://doi.org/10.2307/2005662). It would be interesting to confirm these results with principal angles. For instance, instead of using the decoding performance as a proxy for shifting subspaces, principal angles could directly quantify the 'shift' (similar to Gallego et al, Nat Comm, 2018). 

      Point taken.  We now have calculated the principal angles as a function of time and present them as a new section of the Results including new figure 4 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described in subsection “Subspace Comparisons—Principal Angles”

      Relatedly, why the decoding of the 'object type' is used to establish the progressive shifting of the subspaces? I would be interested to see the authors' argument. 

      We have clarified the reason for our decoding analysis as follows (lines 295 to 297):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.”

      And… (lines 332 to 348):

      “Decodable information changes progressively during both execution and observation 

      As RGM trials proceeded in time, the condition-dependent neural activity of the PM MN population thus changed in two ways.  First, the instantaneous condition-dependent subspace shifted, indicating that the patterns of firing-rate co-modulation among neurons representing the four different RGM movements changed progressively, both during execution and during observation.  Second, as firing rates generally increased, the neural trajectories representing the four RGM movements became progressively more separated, more so during execution than during observation. 

      To evaluate the combined effects of these two progressive changes, we clipped 100 ms single-trial trajectory segments beginning at times I, G, M, or H, and projected these trajectory segments from individual trials into the instantaneous 3D subspaces at 50 ms time steps.  At each of these time steps, we trained a separate LSTM decoder to classify individual trials according to which of the four objects was involved in that trial.  We expected that the trajectory segments would be classified most accurately when projected into instantaneous subspaces near the time at which the trajectory segments were clipped.  At other times we reasoned that classification accuracy would depend both on the similarity of the current instantaneous subspace to that found at the clip time as evaluated by the principal angle (Figure 4), and on the separation of the four trajectories at the clip time (Figure 5).”

      The object type should be much more decodable during movement or hold, than instruction, which is probably why the chance-level decoding performance (horizontal lines) is twice the instruction segment for the movement segment. 

      Indeed, the object type is more decodable during the movement and hold than during instruction or delay epochs.

      (3) Why aren't execution and observation subspaces compared together directly? Especially given that there are both types of trials in the same session with the same recorded population of neurons. Using instantaneous subspaces, or the principal angles between manifolds during exec trials vs obs trials.

      Point taken.  We now have added comparison of the execution and observation subspaces using the principal angles between instantaneous subspaces (lines 421 to 436):

      “Do PM mirror neurons progress through the same subspaces during execution and observation?

      Having found that PM mirror neuron populations show similar progressive shifts in their instantaneous neural subspace during execution and observation of RGM trials, as well as similar changes in decodable information, we then asked whether this progression passes through similar subspaces during execution and observation.  To address this question, we first calculated the principal angles between the instantaneous mirror-neuron execution subspace at selected times I, G, M, or H and the entire time series of instantaneous mirror-neuron observation subspaces (Figure 7A-D).  Conversely, we calculated the principal angles between the instantaneous observation subspaces at selected times I, G, M, or H and the entire time series of instantaneous execution subspaces (Figure 7E-H).  Although the principal angles were slightly smaller than might be expected from chance alone, indicating some minimal overlap of execution and observation instantaneous subspaces, the instantaneous observation subspaces did not show any progressive shift toward the I, G, M, or H execution subspace (Figure 7A-D), nor did the instantaneous execution subspaces shift toward the I, G, M, or H observation subspace (Figure 7E-H).”

      (4) The definition of the instantaneous subspaces is a critical point in the manuscript. I think it is slightly unclear: based on the Methods section #715-722 and the main text #173-#181, I gather that the subspaces are based on trial averaged neural activity for each of the 4 objects, separately. So for each object and per timepoint, a vector of size (1, n) -n neurons- is reduced to a vector of (1, 2 or 3 -the main text says 2, methods say 3-) which would be a single point in the low-d space. Is this description accurate? This should be clarified in the manuscript.  

      In the Methods, we now have clarified (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, W, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, W_i, forming a time series of filters (Figure 1B).”

      (5) Isn't the process of projecting segments of neural dynamics and comparing the results equivalent to comparing the projection matrices in the first place? If so, that might have been a more intuitive avenue to follow. 

      As described in more detail in our responses to item 2, above, we have added analyses of principal angles to compare the projection matrices directly.  However, “the process of projecting segments of neural dynamics and comparing the results” incorporates the progressively increasing separation of the trajectory segments and hence is not simply equivalent to comparing the subspaces with principal angles.

      (6) Lines #385-#389: This process seems unnecessarily complicated. Also, given the number of trials available, this sometimes doesn't make sense. E.g. Monkey R exec has only 8 trials of one of the objects, so bootstrapping 20 trials 500 times would be spurious. Why not, as per Gallego et al, Nat Neurosci 2020 and Safaie et al, Nat 2023 which are cited, concatenate the trials? 

      In the Methods we now clarify that (lines 953 to 969):

      “To provide an estimate of variability, we used a bootstrapping approach to CCA.  From each of two data sets we randomly selected 20 trials involving each target object (totaling 80 trials) with replacement, clipped trajectory segments from each of those trials for 100 ms (100 points at 1 ms intervals) after the instruction onset, go cue, movement onset, or beginning of the final hold, and performed CCA as described above. (Note that because session 1 from monkey R included only 8 button trials (Table 1), we excluded this session from CCA analyses.)  With 500 iterations, we obtained a distribution of the correlation coefficients (CCs) between the two data sets in each of the three dimensions of the aligned subspace, which permitted statistical comparisons. We then used this approach to evaluate alignment of latent dynamics between different sessions (e.g. execution trials on two different days), between different contexts (e.g. execution and observation), and between different neural populations (e.g. MNs and AE neurons).This bootstrapping approach further enabled us to assess the consistency of relationships among neural trajectories within a given group—i.e. the same neural population during the same context (execution or observation) in the same session—by drawing two separate random samples of 80 trials from the same population, context, and session (Figure 8D), which would not have been possible had we concatenated trajectory segments from all trials in the session (Gallego et al., 2020; Safaie et al., 2023).”

      And we report results that could not have been obtained by concatenating all the trials (lines 522 to 541):

      “Did these differences in MN:1/2, MN:E/O, and MN/AE alignment result from consistent differences in their respective patterns of co-modulation, or from of greater trial-by-trial variability in the patterns of co-modulation among MNs during observation than during execution, and still greater variability among AE neurons during execution?  The bootstrapping approach we used for CCA (see Methods) enabled us to evaluate the consistency of relationships among trajectory segments across repeated samplings of trials recorded from the same neuron population in the same session and in the same context (execution or observation).  We therefore performed 500 iterations of CCA between two different random samples of MN execution (MN:E/E), MN  observation (MN:O/O), or AE execution (AE:E/E) trajectory segments from a given session (2 R, 3 T, 3 F). This within-group alignment of MN execution trajectory segments from the same session (Figure 8D, MN:E/E, gray, Hold: () was as strong as between session alignment (Figure 8C, MN/1:2, black).  But within-group alignment of MN observation trajectory segments (Figure 8D, MN:O/O, orange, Hold: () was lower than that found with MN execution segments (Figure 8C, MN:E/O, red, .  Likewise, within-group alignment of AE neuron trajectory segments (Figure 8D, AE:E/E, light blue, Hold: () was lower than their alignment with MN execution segments (Figure 8C, MN/AE, blue, Hold: ().  Whereas MN execution trajectories were relatively consistent within sessions, MN observation trajectories and AE execution trajectories were less so.”

      Because only 8 button trials were available in Session 1 from Monkey R, we excluded this session from the CCA analyses.  Sessions 2 and 3 from monkey R provide valid results, however.  For example, we now state explicitly (lines 468 to 472):

      “As a positive control, we first aligned MN execution trajectory segments from two different sessions in the same monkey (which we abbreviate as MN:1/2).  The 2 sessions in monkey R provided only 1 possible comparison, but the 3 sessions in monkeys T and F each provided 3 comparisons.  For each of these 7 comparisons, we found the bootstrapped average of CC1, of CC2, and of CC3.”

      (7) Related to the CCA analysis, what behavioural epoch has been used here, the same as the previous analyses, i.e. 100ms? how many datapoint is that in time? Given that CCA is essentially a correlation value, too few datapoints make it rather meaningless. If that's the case, I encourage using, let's say, one window combined of I and G until movement, and one window of movement and hold, such that they are both easier to interpret. Indeed low values of exec-exec in CC2 compared to Gallego et al, Nat Neurosci, 2020 might be a sign of a methodological error. 

      In the Methods described for CCA, we now have clarified that (lines 953 to 961):

      “To provide an estimate of variability, we used a bootstrapping approach to CCA.  From each of two data sets we randomly selected 20 trials involving each target object (totaling 80 trials) with replacement, clipped trajectory segments from each of those trials for 100 ms (100 points at 1 ms intervals) after the instruction onset, go cue, movement onset, or beginning of the final hold, and performed CCA as described above. (Note that because session 1 from monkey R included only 8 button trials (Table 1), we excluded this session from CCA analyses.)  With 500 iterations, we obtained a distribution of the correlation coefficients (CCs) between the two data sets in each of the three dimensions of the aligned subspace, which permitted statistical comparisons.”

      And in the Results we report that (lines 475 to 480):

      “The highest values for MN:1/2 correlations were obtained for the Movement trajectory segments .  These values indicate consistent relationships among the Movement neural trajectory segments representing the four different RGM movements from session to session, as would have been expected from previous studies (Gallego et al., 2018; Gallego et al., 2020; Safaie et al., 2023).”

      Reviewer #3 (Public Review): 

      Summary: 

      In their study, Zhao et al. investigated the population activity of mirror neurons (MNs) in the premotor cortex of monkeys either executing or observing a task consisting of reaching to, grasping, and manipulating various objects. The authors proposed an innovative method for analyzing the population activity of MNs during both execution and observation trials. This method enabled to isolate the condition-dependent variance in neural data and to study its temporal evolution over the course of single trials. The method proposed by the authors consists of building a time series of "instantaneous" subspaces with single time step resolution, rather than a single subspace spanning the entire task duration. As these subspaces are computed on an instant time basis, projecting neural activity from a given task time into them results in latent trajectories that capture condition-dependent variance while minimizing the condition-independent one. The authors then analyzed the time evolution of these instantaneous subspaces and revealed that a progressive shift is present in subspaces of both execution and observation trials, with slower shifts during the grasping and manipulating phases compared to the initial preparation phase. Finally, they compared the instantaneous subspaces between execution and observation trials and observed that neural population activity did not traverse the same subspaces in these two conditions. However, they showed that these distinct neural representations can be aligned with Canonical Correlation Analysis, indicating dynamic similarities of neural data when executing and observing the task. The authors speculated that such similarities might facilitate the nervous system's ability to recognize actions performed by oneself or another individual. 

      Strengths: 

      Unlike other areas of the brain, the analysis of neural population dynamics of premotor cortex MNs is not well established. Furthermore, analyzing population activity recorded during non-trivial motor actions, distinct from the commonly used reaching tasks, serves as a valuable contribution to computational neuroscience. This study holds particular significance as it bridges both domains, shedding light on the temporal evolution of the shift in neural states when executing and observing actions. The results are moderately robust, and the proposed analytical method could potentially be used in other neuroscience contexts. 

      Weaknesses: 

      While the overall clarity is satisfactory, the paper falls short in providing a clear description of the mathematical formulas for the different methods used in the study. 

      We have added the various mathematical formulas in the Methods.

      For Cumulative Separation (lines 864 to 871): 

      “To quantify the separation between the four trial-averaged trajectory segments involving the different objects in a given instantaneous subspace, we then calculated their cumulative separation (𝐶𝑆) as: 

      where d<sub>ij</sub>(t) is the 3-dimensional Euclidean distance between the i<sup>th</sup> and j<sup>th</sup> trajectories at time point 𝑡. We summed the 6 pairwise distances between the 4 trajectory segments across time points and normalized by the number of time points, 𝑇 = 100.  The larger the 𝐶𝑆, the greater the separation of the trajectory segments.”

      For principal angles (lines 877 to 884): 

      For example, given the 3-dimensional instantaneous subspace at the time of movement onset, W<sub>M</sub> and at any other time, W<sub>i</sub>, we calculated their 3x3 inner product matrix and performed singular value decomposition to obtain:

      where 3x3 matrices P<sub>M</sub> and W<sub>P</sub> define new manifold directions which successively minimize the 3 principal angles specific to the two subspaces being compared. The elements of diagonal matrix 𝐶 then are the ranked cosines of the principal angles, 𝜃𝑖 , ordered from smallest to largest: 

      For CCA (lines 945 to 952): 

      “CCA was performed as follows: The original latent dynamics, L<sub>A</sub> and L<sub>B</sub>, first were transformed and decomposed as and .  The first m = 3 column vectors of each 𝑄𝑖 provide an orthonormal basis for the column vectors of (where 𝑖 = 𝐴, 𝐵).  Singular value decomposition on the inner product matrix of  𝑄𝐴 and 𝑄𝐵 then gives , and new manifold directions that maximize pairwise correlations are provided by and .  We then projected the original latent dynamics into the new, common subspace: .  Pairwise correlation coefficients between the aligned latent dynamics sorted from largest to smallest then are given by the elements of the diagonal matrix .”

      Moreover, it was not immediately clear why the authors did not consider a (relatively) straightforward metric to quantity the progressive shift of the instantaneous subspaces, such as computing the angle between consecutive subspaces, rather than choosing a (in my opinion) more cumbersome metric based on classification of trajectory segments representing different movements. 

      Point taken.  We now have calculated the principal angles as a function of time and present them as a new section of the Results including new figure 4 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described in subsection “Subspace Comparisons—Principal Angles”

      Specific comments: 

      In the methods, it is stated that instantaneous subspaces are found with 3 PCs. Why does it say 2 here?  

      We now have clarified. (lines 295 to 310):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.  To illustrate this increasing separation, we clipped 100 ms segments of high-dimensional MN population trial-averaged trajectories beginning at times I, G, M, and H, for trials involving each of the four objects.  We then projected the set of four object-specific trajectory segments clipped at each time into each of the four instantaneous 3D subspaces at times I, G, M, and H.  This process was repeated separately for execution trials and for observation trials.  

      For visualization, we projected these trial-averaged trajectory segments from an example session into the PC1 vs PC2 planes (which consistently captured > 70% of the variance) of the I, G, M, or H instantaneous 3D subspaces.  In Figure 5, the trajectory segments for each of the four objects (sphere – purple, button – cyan, coaxial cylinder – magenta, perpendicular cylinder – yellow) sampled at different times (rows) have been projected into each of the four instantaneous subspaces defined at different times (columns).  Rather than appearing knotted as in Figure 3, these short trajectory segments are distinct when projected into each instantaneous subspace.”

      And in the legend for Figure 5 we now clarify that:

      “Each set of these four segments then was projected into the PC1 vs PC2 plane of the instantaneous 3D subspace present at four different times (columns: I, G, M, H).”

      Another doubt on how instantaneous subspaces are computed: in the methods you state that you apply PCA on trial-averaged activity at each 50ms time step. From the next sentence, I gather that you apply PCA on an Nx4 data matrix (N being the number of neurons, and 4 being the trial-averaged activity of the four objects) every 50 ms. Is this right? It would help to explicitly specify the dimensions of the data matrix that goes into PCA computation. 

      We apologize for this confusion.  Although the LSTM decoding was performed in 50 ms time steps, the instantaneous subspaces were calculated at 1 ms intervals. In the Methods we now have clarified (lines 849 to 759):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, W, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, W_i, forming a time series of filters (Figure 1B).”

      It would help to include some equations in the methods section related to the LSTM decoding. Just to make sure I understood correctly: after having identified the instantaneous subspaces (every 50 ms), you projected the Instruction, Go, Movement, and Holding segments from individual trials (each containing 100 samples, since they are sampled from a 100ms window) onto each instantaneous subspace. So you have four trajectories for each subspace. In the methods, it is stated that a single LSTM classifier is trained for each subspace. Do you also have a separate classifier for each trajectory segment? What is used as input to the classifier? Each trajectory segment should be a 100x3 matrix once projected in an instantaneous subspace. Is that what (each of) the LSTMs take as input? And lastly, what is the LSTM trained to predict exactly? Just a label indicating the type of object that was manipulated in that trial? I apologize if I overlooked any detail, but I believe a clearer explanation of the LSTM, preferably with mathematical formulas, would greatly help readers understand this section. 

      LSTM decoding is not readily described with a set of equations.  However, we have expanded our description to provide the information requested (lines 910 to 937):

      “Decodable information—LSTM

      As illustrated schematically in Figure 1B, the same segment of high-dimensional neural activity projected into different instantaneous subspaces can generate low-dimensional trajectories of varying separation.  The degree of separation among the projected trajectory segments will depend, not only on their separation at the time when the segments were clipped, but also on the similarity of the subspaces into which the trajectory segments are projected.  To quantify the combined effects of trajectory separation and projection into different subspaces, we projected high-dimensional neural trajectory segments (each including 100 points at 1 ms intervals) from successful trials involving each of the four different target objects into time series of 3-dimensional instantaneous subspaces at 50 ms intervals. In each of these instantaneous subspaces, the neural trajectory segment from each trial thus became a 100 point x 3 dimensional matrix.  For each instantaneous subspace in the time series, we then trained a separate long short-term memory (LSTM, (Hochreiter and Schmidhuber, 1997)) classifier to attribute each of the neural trajectories from individual trials to one of the four target object labels: sphere, button, coaxial cylinder, or perpendicular cylinder. Using MATLAB’s Deep Learning Toolbox, each LSTM classifier had 3 inputs (instantaneous subspace dimensions), 20 hidden units in the bidirectional LSTM layer, and a softmax layer preceding the classification layer which had 4 output classes (target objects). The total number of successful trials available in each session for each object is given in Table 1.  To avoid bias based on the total number of successful trials, we used the minimum number of successful trials across the four objects in each session, selecting that number from the total available randomly with replacement. Each LSTM classifier was trained with MATLAB’s adaptive moment estimation (Adam) optimizer on 40% of the selected trials, and the remaining 60% were decoded by the trained classifier.  The success of this decoding was used as an estimate of classification accuracy from 0 (no correct classifications) to 1 (100% correct classifications). This process was repeated 10 times and the mean ± standard deviation across the 10 folds was reported as the classification accuracy at that time.  Classification accuracy of trials projected into each instantaneous subspace at 50 ms intervals was plotted as a function of trial time.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Here are some more specific comments. 

      Abstract. Line 41. "same action" is not justified, there is plenty of evidence showing that the action does not need to be the same (or it has not even to be an action), rephrasing or substituting with "similar" is necessary, especially in the light of the subsequent sentence (which is totally correct). 

      Thank you for pointing this out.  As recommended, we have changed “same” to “similar” (lines 40 to 41):  

      “Many neurons in the premotor cortex show firing rate modulation whether the subject performs an action or observes another individual performing a similar action.”

      Introduction. A relevant, missing reference in the otherwise exhaustive introduction is Albertini et al. 2021 J Neurophysiol, showing that neural dynamics and similarities between biological and nonbiological movements in premotor areas are greater than those between the same executed and observed movements. 

      Thank you for pointing out this important finding.  After revision, we felt it was now cited most appropriately in the revised Discussion as follows (lines 730 to 736):

      “Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021)."

      In Line 85, the sentence about Papadourakis and Raos 2019 has to be generalized to PMv, as they show that the proportion of congruent MNs is at chance in both PMd and PMv. 

      Point taken.  We have rephrased this sentence as follows (lines 88 to 89): 

      “And in both PMv and PMd, the proportion of congruent neurons may not be different from that expected by chance alone (Papadourakis and Raos, 2019).”

      Lines 122-132. The initial sentence was unclear to me at first glance. I was wondering how subspaces could be "at other times over the course of the trial" if they are instantaneous. I could imagine that the subspaces referred to corresponding behavioral intervals of execution and observation conditions (and this may be what they will later call "condition dependent" activity), but nevertheless, they could hardly be understood as "instantaneous". I grasped the author's idea only when reading the results, with the statement "no-time dependent variance is captured". The idea is to take a static snapshot of the evolution of population activity at each checkpoint (i.e. I, G, M, and H): I suggest clarifying this point immediately in the introduction to improve readability. 

      We have clarified this point by adding two paragraphs to the Introduction first defining condition independent versus condition-dependent variance and then explaining the use of instantaneous subspaces (lines 125 to 153):

      “A relevant but often overlooked aspect of such dynamics in neuron populations active during both execution and observation has to do with the distinction between condition independent and condition-dependent variation in neuronal activity (Kaufman et al., 2016; Rouse and Schieber, 2018).  The variance in neural activity averaged across all the conditions in a given task context is condition-independent.  For example, in an 8-direction center-out reaching task, averaging a unit’s firing rate as a function of time across all 8 directions may show an initially low firing rate that increases prior to movement onset, peaks during the movement, and then declines during the final hold, irrespective of the movement direction.  Subtracting this condition-independent activity from the unit’s firing rate during each trial gives the remaining variance, and averaging separately across trials in each of the 8 directions then averages out noise variance, leaving the condition-dependent variance that represents the unit’s modulation among the 8 directions (conditions). Alternatively, condition-independent, condition dependent, and noise variance can be partitioned through demixed principal component analysis (Kobak et al., 2016; Gallego et al., 2018).  The extent to which neural dynamics occur in a subspace shared by execution and observation versus subspaces unique to execution or observation may differ for the condition-independent versus condition-dependent partitions of neural activity.  Here, we tested the hypothesis that the condition-dependent activity of PM mirror neuron populations progresses through distinct subspaces during execution versus observation, which would indicate distinct patterns of co-modulation amongst mirror neurons during execution versus observation.

      Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.”

      Results. 

      Regarding the execution-observation alignment, as explained in my initial comment, it does not sound convincing. Applying a CCA to align EXE and OBS activities (which the authors had just shown being essentially not aligned), even separately for each epoch segment (line 396), seems to be a trick to show that they nonetheless share some similarities. Couldn't this be applied to any pairs of differently encoded conditions to create some sort of artificial link between them? Is the similarity in the neural data or rather in the method used to realign them? 

      CCA would not align arbitrary sets of neural data.  The similarity is in the data, not in the method.  For example, in an 8-direction center-out task, the neural representation of movement to the 45° target is between the neural representations of the 0° and the 90° targets.  If the same is true in a second data set, then CCA will give high correlation coefficients.  But if in the second data set the neural representation of the 45° target is between the 135° and 180° targets, CCA will give low correlation coefficients. 

      In the end, what does this tell us about the brain? 

      In the Introduction we now clarify that (lines 166 to 170):

      “Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.”

      And in the Results (lines 449 to 455):

      “For example, the trajectories of PMd+M1 neuron populations recorded from two different monkeys during center-out reaching movements could be aligned well (Safaie et al., 2023).  CCA showed, for example, that in both brains the neural trajectory for the movement to the target at 0° was closer to the trajectory for movement to the target at 45° than to the trajectory for the movement to the target at 180°. Relationships among these latent dynamic representations of the eight movements thus were similar even though the neural populations were recorded from two different monkeys.”

      In relation to Figure 8 (lines 461 to 467)

      “But when both sets of trajectory segments are projected into another common subspace identified with CCA, as shown in Figure 8B, a similar relationship among the neural representations of the four movements during execution and observation is revealed.  In both behavioral contexts the neural representation of movements involving the sphere (purple) is now closest to the representation of movements involving the coaxial cylinder (magenta) and farthest from that of movements involving the button (cyan). The two sets of trajectory segments are more or less “aligned.”

      And in the Discussion (lines 665 to 674):

      “Corresponding neural representations of action execution and observation during task epochs with higher neural firing rates have been described previously in PMd MNs and in PMv MNs using representational similarity analysis RSA (Papadourakis and Raos, 2019).  And during force production in eight different directions, neural trajectories of PMd neurons draw similar “clocks” during execution, cooperative execution, and passive observation (Pezzulo et al., 2022).  Likewise in the present study, despite execution and observation trajectories progressing through largely distinct subspaces, in all three monkeys execution and observation trajectory segments showed some degree of alignment, particularly the Movement and Hold segments (Figure 12A), indicating similar relationships among the latent dynamic representations of the four RGM movements during execution and observation.”

      Concerning the discussion, I would like to reconsider it after having seen the authors' response to the comments above and to my general concern about the relevance of the findings from the neurophysiological point of view. 

      Certainly, please do.

      Reviewer #2 (Recommendations For The Authors): 

      Here are a few issues that I want to bring to the authors' attention (in no particular order): 

      • I am not clear on what is meant by "condition-dependent". Is the condition exec vs obs, or the object types? 

      In the Introduction, we now clarify (lines 125 to 144): 

      “A relevant but often overlooked aspect of such dynamics in neuron populations active during both execution and observation has to do with the distinction between condition independent and condition-dependent variation in neuronal activity (Kaufman et al., 2016; Rouse and Schieber, 2018).  The variance in neural activity averaged across all the conditions in a given task context is condition-independent.  For example, in an 8-direction center-out reaching task, averaging a unit’s firing rate as a function of time across all 8 directions may show an initially low firing rate that increases prior to movement onset, peaks during the movement, and then declines during the final hold, irrespective of the movement direction.  Subtracting this condition-independent activity from the unit’s firing rate during each trial gives the remaining variance, and averaging separately across trials in each of the 8 directions then averages out noise variance, leaving the condition-dependent variance that represents the unit’s modulation among the 8 directions (conditions). Alternatively, condition-independent, condition dependent, and noise variance can be partitioned through demixed principal component analysis (Kobak et al., 2016; Gallego et al., 2018).  The extent to which neural dynamics occur in a subspace shared by execution and observation versus subspaces unique to execution or observation may differ for the condition-independent versus condition-dependent partitions of neural activity.  Here, we tested the hypothesis that the condition-dependent activity of PM mirror neuron populations progresses through distinct subspaces during execution versus observation, which would indicate distinct patterns of co-modulation amongst mirror neurons during execution versus observation.”

      And in the Results, we have added a new Figure 3 to illustrate condition-independent versus conditiondependent activity using an example from the present data sets (lines 208 to 236): 

      “Condition-dependent versus condition-independent neural activity in PM MNs

      Whereas a large fraction of condition-dependent neural variance during reaching movements without grasping can be captured in a two-dimensional subspace (Churchland et al., 2012; Ames et al., 2014), condition-dependent activity in movements that involve grasping is more complex (Suresh et al., 2020). In part, this may reflect the greater complexity of controlling the 24 degrees of freedom in the hand and wrist as compared to the 4 degrees of freedom in the elbow and shoulder (Sobinov and Bensmaia, 2021).  Figure 3 illustrates this complexity in a PM MN population during the present RGM movements.  Here, PCA was performed on the activity of a PM MN population across the entire time course of execution trials involving all four objects.  The colored traces in Figure 3A show neural trajectories averaged separately across trials involving each of the four objects and then projected into the PC1 vs PC2 plane of the total neural space.  Most of the variance in these four trajectories is comprised of a shared rotational component.  The black trajectory, obtained by averaging trajectories from trials involving all four objects together, represents this condition-independent (i.e. independent of the object involved) activity.  The condition-dependent (i.e. dependent on which object was involved) variation in activity is reflected by the variation in the colored trajectories around the black trajectory.  The condition-dependent portions can be isolated by subtracting the black trajectory from each of the colored trajectories. The resulting four condition dependent trajectories have been projected into the PC1 vs PC2 plane of their own common subspace in Figure 3B.  Rather than exhibiting a simple rotational motif, these trajectories appear knotted. To better understand how these complex, condition-dependent trajectories progress over the time course of RGM trials, we chose to examine time series of instantaneous subspaces.”

      While there is an emphasis on the higher complexity of manipulating objects compared to just reaching movements in the Abstract, the majority of the analysis relates to the instruction, movement initiation, and grasp, and there is no specific analyses looking at manipulation and how those presumably more complex dynamics compare to the reaching dynamics, and how they differ from reaching in the mirror neurons. 

      We have clarified that (lines 178 to 187):

      “Because we chose to study relatively naturalistic movements, the reach, grasp, and manipulation components were not performed separately, but rather in a continuous fluid motion during the movement epoch of the task sequence (Figure 2B).  In previous studies involving a version of this task without separate instruction and delay epochs, we have shown that joint kinematics, EMG activity, and neuron activity in the primary motor cortex, all vary throughout the movement epoch in relation to both reach location and object grasped, with location predominating early in the movement epoch and object predominating later (Rouse and Schieber, 2015, 2016a, b).  The present task, however, did not dissociate the reach, the hand shape used to grasp the object, and the manipulation performed on the object.”

      • The analysis in Fig3C,D is interesting, however, in my opinion, requires control. For instance, what would these values look like if you projected the segments to a subspace defined by the activity during the entire length of the trial, or if you projected the activity during intertrials, just to get a sense of how meaningful these values are? 

      This material is now presented in Figure 5 – figure supplement 1.  In the legend to this figure supplement, we have clarified that (lines 327 to 328):

      “CS values, which we use only to characterize the phenomenon of trajectory separation,….”

      • MN is used (#85) before definition (#91). Similar for RGM, I believe. 

      Thanks for catching this problem.  We have now defined these abbreviations at first use as follows:

      In lines 89 to 92:

      “Though many authors apply the term mirror neurons strictly to highly congruent neurons, here we will refer to all neurons modulated during both contexts—execution and observation—as mirror neurons (MNs).”

      And in lines 148 to 150:

      We identified separate time series for execution trials and for observation trials, both involving four different reach-grasp-manipulation (RGM) movements.”

      • I believe in the Intro when presenting the three hypotheses, there is a First, and a Third, but no Second. 

      We have revised this part of the Introduction without numbering our hypotheses as follows (lines 145 to 173):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.

      We then tested the hypothesis that the condition-dependent subspace shifts progressively over the time course of behavioral trials (Figure 1A) by calculating the principal angles between four selected instantaneous subspaces that occurred at times easily defined in each behavioral trial—instruction onset (I), go cue (G), movement onset (M), and the beginning of the final hold (H)—and every other instantaneous subspace in the time series.  Initial analyses showed that condition-dependent neural trajectories for the four RGM movements tended to separate increasingly over the course of behavioral trials.  We therefore additionally examined the combined effects of i) the progressively shifting subspaces and ii) the increasing trajectory separation, by decoding neural trajectory segments sampled for 100 msec after times I, G, M, and H and projected into the time series of instantaneous subspaces (Figure 1B).

      Finally, we used canonical correlation to ask whether the prevalent patterns of mirror neuron co-modulation showed similar relationships among the four RGM movements during execution and observation (Figure 1C).  Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.  And finally, because we previously have found that during action execution the activity of PM mirror neurons tends to lead that of non-mirror neurons which are active only during action execution (AE neurons) (Mazurek and Schieber, 2019), we performed parallel analyses of the instantaneous state space of PM AE neurons.”

      • The use of the term 'instantaneous subspaces' in the abstract confused me initially, as I wasn't sure what it meant. It might be a good idea to define or rephrase it. 

      In the Abstract we now state (lines 51 to 52):

      “Rather than following neural trajectories in subspaces that contain their entire time course, we identified time series of instantaneous subspaces …”

      And in the Introduction, we have clarified (lines 145 to 153):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.”

      And in the Methods (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, 𝑊, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, 𝑊𝑖, forming a time series of filters (Figure 1B).”

      Reviewer #3 (Recommendations For The Authors): 

      (1) Page 4, lines 127-131. In the introduction, it was not immediately clear to me what you meant by 'separation' and 'decoding' of the projected neural activity. You do mention that you are separating/decoding trajectory segments representing different movements at the end of this paragraph, but at this point of the paper it was not very clear to me what those different movements were (I only understood that after reading the results section). I suggest briefly expanding on these concepts here. 

      To clarify these points in the Introduction, we have expanded exposition of these concepts (lines 145 to 163):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.

      We then tested the hypothesis that the condition-dependent subspace shifts progressively over the time course of behavioral trials (Figure 1A) by calculating the principal angles between four selected instantaneous subspaces that occurred at times easily defined in each behavioral trial—instruction onset (I), go cue (G), movement onset (M), and the beginning of the final hold (H)—and every other instantaneous subspace in the time series.  Initial analyses showed that condition-dependent neural trajectories for the four RGM movements tended to separate increasingly over the course of behavioral trials.  We therefore additionally examined the combined effects of i) the progressively shifting subspaces and ii) the increasing trajectory separation, by decoding neural trajectory segments sampled for 100 msec after times I, G, M, and H and projected into the time series of instantaneous subspaces (Figure 1B).”

      (2) Page 6, line 175. In the methods, it is stated that instantaneous subspaces are found with 3 PCs. Why does it say 2 here? 

      Thank you for noticing this discrepancy.  In the Methods, we have clarified that the instantaneous subspaces are 3-dimensional (see our reply to the next comment), but in Figure 5 (previously Figure 3), for purposes of visualization, we are projecting trajectory segments into the PC1-PC2 plane (lines 295 to 308):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.  To illustrate this increasing separation, we clipped 100 ms segments of high-dimensional MN population trial-averaged trajectories beginning at times I, G, M, and H, for trials involving each of the four objects.  We then projected the set of four object-specific trajectory segments clipped at each time into each of the four instantaneous 3D subspaces at times I, G, M, and H.  This process was repeated separately for execution trials and for observation trials.  

      For visualization, we projected these trial-averaged trajectory segments from an example session into the PC1 vs PC2 planes (which consistently captured > 70% of the variance) of the I, G, M, or H instantaneous 3D subspaces.  In Figure 5, the trajectory segments for each of the four objects (sphere – purple, button – cyan, coaxial cylinder – magenta, perpendicular cylinder – yellow) sampled at different times (rows) have been projected into each of the four instantaneous subspaces defined at different times (columns).”

      And in the legend for Figure 5 we now clarify that:

      “Each set of these four segments then was projected into the PC1 vs PC2 plane of the instantaneous 3D subspace present at four different times (columns: I, G, M, H).”

      Another doubt on how instantaneous subspaces are computed: in the methods you state that you apply PCA on trial-averaged activity at each 50ms time step. From the next sentence, I gather that you apply PCA on an Nx4 data matrix (N being the number of neurons, and 4 being the trial-averaged activity of the four objects) every 50 ms. Is this right? It would help to explicitly specify the dimensions of the data matrix that goes into PCA computation. 

      Thank you for catching an error: The instantaneous subspaces were computed at 1 ms intervals. (It is the LSTM decoding that was done in 50 ms time steps).  We have clarified how the instantaneous subspaces were computed in the Methods (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, 𝑊, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, 𝑊𝑖, forming a time series of filters (Figure 1B).”

      (3) Page 7, line 210-212. I am not sure if I missed it in the discussion, but have you speculated on why the greatest separation in observation trials was observed during the holding phase while in execution trials during the movement phase? 

      This was a consistent finding, and we therefore point it out as a difference between execution and observation.  Of course, this reflects greater condition-dependent variance in the PM MN population in the movement epoch than in the hold epoch during execution, whereas the reverse is true during observation.  We have no clear speculation as to why this occurs, however.

      (4) Figure 3. Add a legend with color scheme for each object in panels A and B. Also, please specify what metric is represented by the colorbar of panels C, D, E, F (write it down next to the colorbar itself and not just in the caption). 

      This is now Figure 5.  We have added a color legend for A and B.  Panels C, D, E, and F, now have been moved to Figure 5 – figure supplement 1, where we have indicated that the colorbar represents cumulative separation.

      (5) Page 9, line 228. I found the description of this decoding analysis a bit confusing initially (and perhaps still do), this should be clarified. 

      We have clarified our decoding analysis in the Methods (lines 910 to 937):

      “Decodable information—LSTM

      As illustrated schematically in Figure 1B, the same segment of high-dimensional neural activity projected into different instantaneous subspaces can generate low-dimensional trajectories of varying separation.  The degree of separation among the projected trajectory segments will depend, not only on their separation at the time when the segments were clipped, but also on the similarity of the subspaces into which the trajectory segments are projected.  To quantify the combined effects of trajectory separation and projection into different subspaces, we projected high-dimensional neural trajectory segments (each including 100 points at 1 ms intervals) from successful trials involving each of the four different target objects into time series of 3-dimensional instantaneous subspaces at 50 ms intervals. In each of these instantaneous subspaces, the neural trajectory segment from each trial thus became a 100 point x 3 dimensional matrix.  For each instantaneous subspace in the time series, we then trained a separate long short-term memory (LSTM, (Hochreiter and Schmidhuber, 1997)) classifier to attribute each of the neural trajectories from individual trials to one of the four target object labels: sphere, button, coaxial cylinder, or perpendicular cylinder. Using MATLAB’s Deep Learning Toolbox, each LSTM classifier had 3 inputs (instantaneous subspace dimensions), 20 hidden units in the bidirectional LSTM layer, and a softmax layer preceding the classification layer which had 4 output classes (target objects). The total number of successful trials available in each session for each object is given in Table 1.  To avoid bias based on the total number of successful trials, we used the minimum number of successful trials across the four objects in each session, selecting that number from the total available randomly with replacement. Each LSTM classifier was trained with MATLAB’s adaptive moment estimation (Adam) optimizer on 40% of the selected trials, and the remaining 60% were decoded by the trained classifier.  The success of this decoding was used as an estimate of classification accuracy from 0 (no correct classifications) to 1 (100% correct classifications). This process was repeated 10 times and the mean ± standard deviation across the 10 folds was reported as the classification accuracy at that time.  Classification accuracy of trials projected into each instantaneous subspace at 50 ms intervals was plotted as a function of trial time.”

      (6) Page 9, line 268. This might be trivial, but can you speculate on why the accuracy for Instruction segments had a lower peak compared to the rest of the segments? Is it because there is less 'distinct' information embedded in neural data about the type of object manipulated until you are actually reaching toward it or holding it? The latter seems straightforward, but the former not so much. 

      Thank you for asking this question.  We have added the following speculations (lines 592 to 604): 

      “Short bursts of “signal” related discharge are known to occur in a substantial fraction of PMd neurons beginning at latencies of ~60 ms following an instructional stimulus (Weinrich et al., 1984; Cisek and Kalaska, 2004).  Here we found that the instantaneous subspace shifted briefly toward the subspace present at the time of instruction onset (I), similarly during execution and observation.  This brief trough in principal angle (Figure 4A) and the corresponding peak in classification accuracy (Figure 7A) in part may reflect smoothing of firing rates with a 50 ms Gaussian kernel.  We speculate, however, that the early rise of this peak at the time of instruction onset also reflects the anticipatory activity often seen in PMd neurons in expectation of an instruction, which may not be entirely non-specific, but rather may position the neural population to receive one of a limited set of potential instructions (Mauritz and Wise, 1986). We attribute the relatively low amplitude of peak classification accuracy for Instruction trajectory segments to the likely possibility that only the last 40 ms of our 100 ms Instruction segments captured signal related discharge.”

      (7) Figure 8. Shouldn't the plots in panel A resemble those in Figure 3? Here you are projecting the hold trajectory segments into the subspace at time H, which should be the same as in Fig. 3A/B bottom right panel. 

      The previous Figure 8 is now Figure 8 panels A and B, and the previous Figure 3 is now Figure 5.  The data used in these two figures come from two different recording sessions in two different monkeys. The current Figure 8A,B uses data from monkey F, session 2; whereas Figure 5 uses data from monkey T, session 3, which we now state in the legend to each figure, respectively.  Consequently, the relative arrangement of the trajectory segments in the instantaneous subspace at time H differs.  The session used in Figure 8A,B, which we now show in three dimensions, better illustrates how CCA identifies a common subspace in which execution versus observations segments show alignment (Figure 8B) that was not evident in their original subspaces (Figure 8A).

      (8) Page 14, line 369. Are you computing CCA using only 2 components? I thought the subspaces were 3 dimensional. Why not align all three dimensions? 

      We have expanded this analysis to use all three dimensions, as illustrated in Figure 8 above.

      (9) Page 14, line 407. Does this mean that instantaneous subspaces between execution and observation trials are more similar to each other during the Movement and Holding phase? Is this related to the fact that in those moments there is a smaller progressive shift of the subspaces within execution and observation trials? 

      Our new analyses of principal angles (see our reply to your comment 11, below) show that the progressive shifting of the instantaneous subspace continues through the movement and hold epochs.  We now discuss this better alignment of the Movement and Hold trajectory segments as follows (lines 656 to 664):

      “Given the complexity of condition-dependent neural trajectories across the entire time course of RGM trials (Figure 3B), rather than attempting to align entire neural trajectories, we applied canonical correlation to trajectory segments clipped for 100 ms following four well defined behavioral events: Instruction onset, Go cue, Movement onset, and the beginning of the final Hold.  In all cases, alignment was poorest for Instruction segments, somewhat higher for Go segments, and strongest for Movement and Hold segments.  This progressive increase in alignment likely reflects a progressive increase in the difference between average neuron firing rates for trials involving different objects (Figure 6) relative to the trial-by-trial variance in firing rate for a given object.”

      (10) page 15, line 431. Typo, it should be Table 3. 

      We have removed Table 3 which no longer applies.

      (11) A more general observation: did you try to compute another metric to assess the progressive shift of subspaces over time? I am thinking of something like computing the principal angles between consecutive subspaces. If it is true that the shifts happen over time, but it slows down during movement and hold, you should be able to conclude it from principal angles as well. Am I missing something? Is there any reason you went with classification accuracy instead of a metric like this?  

      Point taken.  We now have calculated the principal angles as a function of time and have presented them as a new section of the Results including new Figure 4 and Figure 4 – figure supplement 3 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described is subsection “Subspace Comparisons—Principal Angles”

      Is there any reason you went with classification accuracy instead of a metric like this? 

      We now point out that (lines 295 to 297):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.”

      And we further clarify this as follows (lines 331 to 348):

      “Decodable information changes progressively during both execution and observation 

      As RGM trials proceeded in time, the condition-dependent neural activity of the PM MN population thus changed in two ways.  First, the instantaneous condition-dependent subspace shifted, indicating that the patterns of firing-rate co-modulation among neurons representing the four different RGM movements changed progressively, both during execution and during observation.  Second, as firing rates generally increased, the neural trajectories representing the four RGM movements became progressively more separated, more so during execution than during observation. 

      To evaluate the combined effects of these two progressive changes, we clipped 100 ms single-trial trajectory segments beginning at times I, G, M, or H, and projected these trajectory segments from individual trials into the instantaneous 3D subspaces at 50 ms time steps.  At each of these time steps, we trained a separate LSTM decoder to classify individual trials according to which of the four objects was involved in that trial.  We expected that the trajectory segments would be classified most accurately when projected into instantaneous subspaces near the time at which the trajectory segments were clipped.  At other times we reasoned that classification accuracy would depend both on the similarity of the current instantaneous subspace to that found at the clip time as evaluated by the principal angle (Figure 4), and on the separation of the four trajectories at the clip time (Figure 5).”

    1. eLife Assessment

      This important study provides solid evidence that glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, plays a role in the differentiation of intestinal cells. Mutations in GlcT compromise Notch signaling in the Drosophila intestinal stem cell lineage resulting in the formation of enteroendocrine tumors, and preliminary data suggests that a homolog of glucosylceramide synthase also influences Notch signaling in the mammalian intestine. While the outstanding strengths of the initial genetic and downstream pathway analyses are noted, there are weaknesses in the data regarding the potential role of this pathway in Delta trafficking. Nevertheless, this study opens the way for future mechanistic studies addressing how specific lipids modulate Notch signalling activity.

    2. Reviewer #1 (Public review):

      Summary:

      From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in EE tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect the endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.

      Strengths:

      Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. I do, however, have a few comments below.

      Weaknesses:

      (1) The authors bring up the intriguing idea that GlcT could be a way to link diet to cell fate choice. Unfortunately, there are no experiments to test this hypothesis.

      (2) Why do the authors think that UCCG knockout results in goblet cell excess and not in the other secretory cell types?

      (3) The authors should cite other EMS mutagenesis screens done in the fly intestine.

      (4) The absence of a phenotype using NRE-Gal4 is not convincing. This is because the delay in its expression could be after the requirement for the affected gene in the process being studied. In other words, sufficient knockdown of GlcT by RNA would not be achieved until after the relevant signaling between the EB and the ISC occurred. Dl-Gal4 is problematic as an ISC driver because Dl is expressed in the EEP.

      (5) The difference in Rab5 between control and GlcT-IR was not that significant. Furthermore, any changes could be secondary to increases in proliferation.

    3. Reviewer #2 (Public review):

      Summary:

      This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, which act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.

      Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady-state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.

      Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.

      Strengths:

      The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.

      The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.

      Weaknesses:

      This study is not, however, without caveats and several specific conclusions are not fully convincing.

      First, the conclusion that GlcT is specifically required in Intestinal Stem Cells (ISCs) is not fully convincing for technical reasons: NRE-Gal4 may be less active in GlcT mutant cells, and the knock-down of GlcT using Dl-Gal4ts may not be restricted to ISCs given the perdurance of Gal4 and of its downstream RNAi.

      Second, the results from the antibody uptake assays are not clear.: i) the levels of internalized Delta were not quantified in these experiments; ii) additionally, live guts were incubated with anti-Delta for 3hr. This long period of incubation indicated that the observed results may not necessarily reflect the dynamics of endocytosis of antibody-bound Delta, but might also inform about the distribution of intracellular Delta following the internalization of unbound anti-Delta. It would thus be interesting to examine the level of internalized Delta in experiments with shorter incubation time. Overall, the proposed working model needs to be solidified as important questions remain open, including: is the endo-lysosomal system, i.e. steady-state distribution of endo-lysosomal markers, affected by the Mac-Cer deficiency? Is the trafficking of Notch also affected by the Mac-Cer deficiency? is the rate of Delta endocytosis also affected by the Mac-Cer deficiency? are the levels of cell-surface Delta reduced upon the loss of Mac-Cer?

      Third, while the mouse results are potentially interesting, they seem to be relatively preliminary, and future studies are needed to test whether the level of Notch receptor activation is reduced in this model.

    4. Reviewer #3 (Public review):

      Summary:

      In this paper, Tang et al report the discovery of a Glycoslyceramide synthase gene, GlcT, which they found in a genetic screen for mutations that generate tumorous growth of stem cells in the gut of Drosophila. The screen was expertly done using a classic mutagenesis/mosaic method. Their initial characterization of the GlcT alleles, which generate endocrine tumors much like mutations in the Notch signaling pathway, is also very nice. Tang et al checked other enzymes in the glycosylceramide pathway and found that the loss of one gene just downstream of GlcT (Egh) gives similar phenotypes to GlcT, whereas three genes further downstream do not replicate the phenotype. Remarkably, dietary supplementation with a predicted GlcT/Egh product, Lactosyl-ceramide, was able to substantially rescue the GlcT mutant phenotype. Based on the phenotypic similarity of the GlcT and Notch phenotypes, the authors show that activated Notch is epistatic to GlcT mutations, suppressing the endocrine tumor phenotype and that GlcT mutant clones have reduced Notch signaling activity. Up to this point, the results are all clear, interesting, and significant. Tang et al then go on to investigate how GlcT mutations might affect Notch signaling, and present results suggesting that GlcT mutation might impair the normal endocytic trafficking of Delta, the Notch ligand. These results (Fig X-XX), unfortunately, are less than convincing; either more conclusive data should be brought to support the Delta trafficking model, or the authors should limit their conclusions regarding how GlcT loss impairs Notch signaling. Given the results shown, it's clear that GlcT affects EE cell differentiation, but whether this is via directly altering Dl/N signaling is not so clear, and other mechanisms could be involved. Overall the paper is an interesting, novel study, but it lacks somewhat in providing mechanistic insight. With conscientious revisions, this could be addressed. We list below specific points that Tang et al should consider as they revise their paper.

      Strengths:

      The genetic screen is excellent.

      The basic characterization of GlcT phenotypes is excellent, as is the downstream pathway analysis.

      Weaknesses:

      (1) Lines 147-149, Figure 2E: here, the study would benefit from quantitations of the effects of loss of brn, B4GalNAcTA, and a4GT1, even though they appear negative.

      (2) In Figure 3, it would be useful to quantify the effects of LacCer on proliferation. The suppression result is very nice, but only effects on Pros+ cell numbers are shown.

      (3) In Figure 4A/B we see less NRE-LacZ in GlcT mutant clones. Are the data points in Figure 4B per cell or per clone? Please note. Also, there are clearly a few NRE-LacZ+ cells in the mutant clone. How does this happen if GlcT is required for Dl/N signaling?

      (4) Lines 222-225, Figure 5AB: The authors use the NRE-Gal4ts driver to show that GlcT depletion in EBs has no effect. However, this driver is not activated until well into the process of EB commitment, and RNAi's take several days to work, and so the author's conclusion is "specifically required in ISCs" and not at all in EBs may be erroneous.

      (5) Figure 5C-F: These results relating to Delta endocytosis are not convincing. The data in Fig 5C are not clear and not quantitated, and the data in Figure 5F are so widely scattered that it seems these co-localizations are difficult to measure. The authors should either remove these data, improve them, or soften the conclusions taken from them. Moreover, it is unclear how the experiments tracing Delta internalization (Fig 5C) could actually work. This is because for this method to work, the anti-Dl antibody would have to pass through the visceral muscle before binding Dl on the ISC cell surface. To my knowledge, antibody transcytosis is not a common phenomenon.

      (6) It is unclear whether MacCer regulates Dl-Notch signaling by modifying Dl directly or by influencing the general endocytic recycling pathway. The authors say they observe increased Dl accumulation in Rab5+ early endosomes but not in Rab7+ late endosomes upon GlcT depletion, suggesting that the recycling endosome pathway, which retrieves Dl back to the cell surface, may be impaired by GlcT loss. To test this, the authors could examine whether recycling endosomes (marked by Rab4 and Rab11) are disrupted in GlcT mutants. Rab11 has been shown to be essential for recycling endosome function in fly ISCs.

      (7) It remains unclear whether Dl undergoes post-translational modification by MacCer in the fly gut. At a minimum, the authors should provide biochemical evidence (e.g., Western blot) to determine whether GlcT depletion alters the protein size of Dl.

      (8) It is unfortunate that GlcT doesn't affect Notch signaling in other organs on the fly. This brings into question the Delta trafficking model and the authors should note this. Also, the clonal marker in Figure 6C is not clear.

      (9) The authors state that loss of UGCG in the mouse small intestine results in a reduced ISC count. However, in Supplementary Figure C3, Ki67, a marker of ISC proliferation, is significantly increased in UGCG-CKO mice. This contradiction should be clarified. The authors might repeat this experiment using an alternative ISC marker, such as Lgr5.

    5. Author response:

      We would like to express our gratitude to all three reviewers for their time and valuable feedback on the manuscript. Below, we provide our point-by-point responses to their comments. Additionally, we summarize here the experiments we plan to conduct in accordance with the reviewers' suggestions:

      Revision plan 1. To include live imaging of Dl/Notch trafficking in normal and GlcT mutant ISCs.

      We agree that the effect of GlcT mutation on Dl trafficking was not convincingly demonstrated in our previous work. Although we attempted live imaging of the intestine using GFP tagged at the C-terminal of Dl, the fluorescent signal was regrettably too weak for reliable capture. In this revision, we will optimize the imaging conditions to determine if this issue can be resolved. Alternatively, we will transiently express GFP/RFP-tagged Dl in both normal and mutant ISCs to investigate the trafficking dynamics through live imaging.

      Revision plan 2. To update and improve the presentation of the data regarding the features of early/late/recycling endosomes in GlcT mutant ISCs.

      Our analysis of Rab5 and Rab7 endosomes in both normal and GlcT mutant ISCs revealed that Dl tends to accumulate in Rab5 endosomes in GlcT mutant ISCs. To strengthen our findings, we will include additional quantitative data and conduct further analysis on recycling endosomes labeled with Rab11-GFP. We acknowledge that this portion of the data is not entirely convincing, and in accordance with the reviewers' suggestions, we will revise our conclusions to present a more tempered interpretation.

      Revision plan 3. To include western blot analysis of Dl in normal and GlcT mutant ISCs.

      While we propose that MacCer may function as a component of lipid rafts, facilitating the anchorage of Dl on the membrane and its proper endocytosis, it is also possible that it acts as a substrate for the modification of Dl, which is essential for its functionality. To investigate this further, we will conduct Western blot analysis to determine whether the depletion of GlcT alters the protein size of Dl.

      Please find our detailed point-by-point responses below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in EE tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect the endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.

      Strengths:

      Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. I do, however, have a few comments below.

      Weaknesses:

      (1) The authors bring up the intriguing idea that GlcT could be a way to link diet to cell fate choice. Unfortunately, there are no experiments to test this hypothesis.

      We indeed attempted to establish an assay to investigate the impact of various diets (such as high-fat, high-sugar, or high-protein diets) on the fate choice of ISCs. Subsequently, we intended to examine the potential involvement of GlcT in this process. However, we observed that the number or percentage of EEs varies significantly among individuals, even among flies with identical phenotypes subjected to the same nutritional regimen. We suspect that the proliferative status of ISCs and the turnover rate of EEs may significantly influence the number of EEs present in the intestinal epithelium, complicating the interpretation of our results. Consequently, we are unable to conduct this experiment at this time. The hypothesis suggesting that GlcT may link diet to cell fate choice remains an avenue for future experimental exploration.

      (2) Why do the authors think that UCCG knockout results in goblet cell excess and not in the other secretory cell types?

      This is indeed an interesting point. In the mouse intestine, it is well-documented that the knockout of Notch receptors or Delta-like ligands results in a classic phenotype characterized by goblet cell hyperplasia, with little impact on the other secretory cell types. This finding aligns very well with our experimental results, as we noted that the numbers of Paneth cells and enteroendocrine cells appear to be largely normal in UGCG knockout mice. By contrast, increases in other secretory cell types are typically observed under conditions of pharmacological inhibition of the Notch pathway.

      (3) The authors should cite other EMS mutagenesis screens done in the fly intestine.

      To our knowledge, the EMS screen on 2L chromosome conducted in Allison Bardin’s lab is the only one prior to this work, which leads to two publications (Perdigoto et al., 2011; Gervais, et al., 2019). We will include citations for both papers in the revised manuscript.

      (4) The absence of a phenotype using NRE-Gal4 is not convincing. This is because the delay in its expression could be after the requirement for the affected gene in the process being studied. In other words, sufficient knockdown of GlcT by RNA would not be achieved until after the relevant signaling between the EB and the ISC occurred. Dl-Gal4 is problematic as an ISC driver because Dl is expressed in the EEP.

      We agree that the lack of an observable phenotype using NRE-Gal4 might be attributed to a delay in its expression, which could result in missing the critical window necessary for effective GlcT knockdown. Consequently, we cannot rule out the possibility that GlcT may also play a role in early EBs or EEPs. We will revise our manuscript to present a more cautious conclusion on this issue.

      (5) The difference in Rab5 between control and GlcT-IR was not that significant. Furthermore, any changes could be secondary to increases in proliferation.

      We agree that it is possible that the observed increase in proliferation could influence the number of Rab5+ endosomes, and we will temper our conclusions on this aspect accordingly. However, it is important to note that, although the difference in Rab5+ endosomes between the control and GlcT-IR conditions appeared mild, it was statistically significant and reproducible. As we have indicated earlier, we plan to further analyze Rab11+ endosomes, as this additional analysis may provide further support for our previous conclusions.

      Reviewer #2 (Public review):

      Summary:

      This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, which act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.

      Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady-state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.

      Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.

      Strengths:

      The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.

      The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.

      Weaknesses:

      This study is not, however, without caveats and several specific conclusions are not fully convincing.

      First, the conclusion that GlcT is specifically required in Intestinal Stem Cells (ISCs) is not fully convincing for technical reasons: NRE-Gal4 may be less active in GlcT mutant cells, and the knock-down of GlcT using Dl-Gal4ts may not be restricted to ISCs given the perdurance of Gal4 and of its downstream RNAi.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We will revise our manuscript to present a more cautious conclusion and explicitly describe this possibility in the updated version.

      Second, the results from the antibody uptake assays are not clear.: i) the levels of internalized Delta were not quantified in these experiments; ii) additionally, live guts were incubated with anti-Delta for 3hr. This long period of incubation indicated that the observed results may not necessarily reflect the dynamics of endocytosis of antibody-bound Delta, but might also inform about the distribution of intracellular Delta following the internalization of unbound anti-Delta. It would thus be interesting to examine the level of internalized Delta in experiments with shorter incubation time.

      We thank the reviewer for these excellent questions. In our antibody uptake experiments, we noted that Dl reached its peak accumulation after a 3-hour incubation period. We recognize that quantifying internalized Dl would enhance our analysis, and we will include the corresponding statistical graphs in the revised version of the manuscript. In addition, we agree that during the 3-hour incubation, the potential internalization of unbound anti-Dl cannot be ruled out, as it may influence the observed distribution of intracellular Dl. To address this concern, we plan to supplement our findings with live imaging experiments to capture the dynamics of Dl endocytosis in GlcT mutant ISCs.

      Overall, the proposed working model needs to be solidified as important questions remain open, including: is the endo-lysosomal system, i.e. steady-state distribution of endo-lysosomal markers, affected by the Mac-Cer deficiency? Is the trafficking of Notch also affected by the Mac-Cer deficiency? is the rate of Delta endocytosis also affected by the Mac-Cer deficiency? are the levels of cell-surface Delta reduced upon the loss of Mac-Cer?

      Regarding the impact on the endo-lysosomal system, this is indeed an important aspect to explore. While we did not conduct experiments specifically designed to evaluate the steady-state distribution of endo-lysosomal markers, our analyses utilizing Rab5-GFP overexpression and Rab7 staining did not indicate any significant differences in endosome distribution in MacCer deficient conditions. Moreover, we still observed high expression of the NRE-LacZ reporter specifically at the boundaries of clones in GlcT mutant cells (Fig. 4A), indicating that GlcT mutant EBs remain responsive to Dl produced by normal ISCs located right at the clone boundary. Therefore, we propose that MacCer deficiency may specifically affect Dl trafficking without impacting Notch trafficking.

      In our 3-hour antibody uptake experiments, we observed a notable decrease in cell-surface Dl, which was accompanied by an increase in intracellular accumulation. These findings collectively suggest that Dl may be unstable on the cell surface, leading to its accumulation in early endosomes.

      Third, while the mouse results are potentially interesting, they seem to be relatively preliminary, and future studies are needed to test whether the level of Notch receptor activation is reduced in this model.

      In the mouse small intestine, olfm4 is a well-established target gene of the Notch signaling pathway, and its staining provides a reliable indication of Notch pathway activation. While we attempted to evaluate Notch activation using additional markers, such as Hes1 and NICD, we encountered difficulties, as the corresponding antibody reagents did not perform well in our hands. Despite these challenges, we believe that our findings with Olfm4 provide an important start point for further investigation in the future.

      Reviewer #3 (Public review):

      Summary:

      In this paper, Tang et al report the discovery of a Glycoslyceramide synthase gene, GlcT, which they found in a genetic screen for mutations that generate tumorous growth of stem cells in the gut of Drosophila. The screen was expertly done using a classic mutagenesis/mosaic method. Their initial characterization of the GlcT alleles, which generate endocrine tumors much like mutations in the Notch signaling pathway, is also very nice. Tang et al checked other enzymes in the glycosylceramide pathway and found that the loss of one gene just downstream of GlcT (Egh) gives similar phenotypes to GlcT, whereas three genes further downstream do not replicate the phenotype. Remarkably, dietary supplementation with a predicted GlcT/Egh product, Lactosyl-ceramide, was able to substantially rescue the GlcT mutant phenotype. Based on the phenotypic similarity of the GlcT and Notch phenotypes, the authors show that activated Notch is epistatic to GlcT mutations, suppressing the endocrine tumor phenotype and that GlcT mutant clones have reduced Notch signaling activity. Up to this point, the results are all clear, interesting, and significant. Tang et al then go on to investigate how GlcT mutations might affect Notch signaling, and present results suggesting that GlcT mutation might impair the normal endocytic trafficking of Delta, the Notch ligand. These results (Fig X-XX), unfortunately, are less than convincing; either more conclusive data should be brought to support the Delta trafficking model, or the authors should limit their conclusions regarding how GlcT loss impairs Notch signaling. Given the results shown, it's clear that GlcT affects EE cell differentiation, but whether this is via directly altering Dl/N signaling is not so clear, and other mechanisms could be involved. Overall the paper is an interesting, novel study, but it lacks somewhat in providing mechanistic insight. With conscientious revisions, this could be addressed. We list below specific points that Tang et al should consider as they revise their paper.

      Strengths:

      The genetic screen is excellent.

      The basic characterization of GlcT phenotypes is excellent, as is the downstream pathway analysis.

      Weaknesses:

      (1) Lines 147-149, Figure 2E: here, the study would benefit from quantitations of the effects of loss of brn, B4GalNAcTA, and a4GT1, even though they appear negative.

      We will incorporate the quantifications for the effects of the loss of brn, B4GalNAcTA, and a4GT1 in the updated Figure 2.

      (2) In Figure 3, it would be useful to quantify the effects of LacCer on proliferation. The suppression result is very nice, but only effects on Pros+ cell numbers are shown.

      We will add quantifications of the number of EEs per clone to the updated Figure 3.

      (3) In Figure 4A/B we see less NRE-LacZ in GlcT mutant clones. Are the data points in Figure 4B per cell or per clone? Please note. Also, there are clearly a few NRE-LacZ+ cells in the mutant clone. How does this happen if GlcT is required for Dl/N signaling?

      In Figure 4B, the data points represent the fluorescence intensity per single cell within each clone. It is true that a few NRE-LacZ+ cells can still be observed within the mutant clone; however, this does not contradict our conclusion. As noted, high expression of the NRE-LacZ reporter was specifically observed around the clone boundaries in MacCer deficient cells (Fig. 4A), indicating that the mutant EBs can normally receive Dl signal from the normal ISCs located at the clone boundary and activate the Notch signaling pathway. Therefore, we believe that, although affecting Dl trafficking, MacCer deficiency does not significantly affect Notch trafficking.

      (4) Lines 222-225, Figure 5AB: The authors use the NRE-Gal4ts driver to show that GlcT depletion in EBs has no effect. However, this driver is not activated until well into the process of EB commitment, and RNAi's take several days to work, and so the author's conclusion is "specifically required in ISCs" and not at all in EBs may be erroneous.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We will revise our manuscript to present a more cautious conclusion and describe this possibility in the updated version.

      (5) Figure 5C-F: These results relating to Delta endocytosis are not convincing. The data in Fig 5C are not clear and not quantitated, and the data in Figure 5F are so widely scattered that it seems these co-localizations are difficult to measure. The authors should either remove these data, improve them, or soften the conclusions taken from them. Moreover, it is unclear how the experiments tracing Delta internalization (Fig 5C) could actually work. This is because for this method to work, the anti-Dl antibody would have to pass through the visceral muscle before binding Dl on the ISC cell surface. To my knowledge, antibody transcytosis is not a common phenomenon.

      We thank the reviewer for these insightful comments and suggestions. In our in vivo experiments, we observed increased co-localization of Rab5 and Dl in GlcT mutant ISCs, indicating that Dl trafficking is delayed at the transition to Rab7⁺ late endosomes, a finding that is further supported by our antibody uptake experiments. We acknowledge that the data presented in Fig. 5C are not fully quantified and that the co-localization data in Fig. 5F may appear somewhat scattered; therefore, we will include additional quantification and enhance the data presentation in the revised manuscript.

      Regarding the concern about antibody internalization, we appreciate this point. We currently do not know if the antibody reaches the cell surface of ISCs by passing through the visceral muscle or via other routes. Given that the experiment was conducted with fragmented gut, it is possible that the antibody may penetrate into the tissue through mechanisms independent of transcytosis.

      As mentioned earlier, we plan to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. Anyway, due to technical challenges and potential pitfalls associated with the assays, we agree that this part of data is not fully convincing and we will provide a more cautious conclusion in the revised manuscript.

      (6) It is unclear whether MacCer regulates Dl-Notch signaling by modifying Dl directly or by influencing the general endocytic recycling pathway. The authors say they observe increased Dl accumulation in Rab5+ early endosomes but not in Rab7+ late endosomes upon GlcT depletion, suggesting that the recycling endosome pathway, which retrieves Dl back to the cell surface, may be impaired by GlcT loss. To test this, the authors could examine whether recycling endosomes (marked by Rab4 and Rab11) are disrupted in GlcT mutants. Rab11 has been shown to be essential for recycling endosome function in fly ISCs.

      We agree that assessing the state of recycling endosomes, especially by using markers such as Rab11, would be valuable in determining whether MacCer regulates Dl-Notch signaling by directly modifying Dl or by influencing the broader endocytic recycling pathway. We will incorporate these experiments into our future experimental plans to further characterize Dl trafficking in GlcT mutant ISCs.

      (7) It remains unclear whether Dl undergoes post-translational modification by MacCer in the fly gut. At a minimum, the authors should provide biochemical evidence (e.g., Western blot) to determine whether GlcT depletion alters the protein size of Dl.

      While we propose that MacCer may function as a component of lipid rafts, facilitating Dl membrane anchorage and endocytosis, we also acknowledge the possibility that MacCer could serve as a substrate for protein modifications of Dl necessary for its proper function. Conducting biochemical analyses to investigate potential post-translational modifications of Dl by MacCer would indeed provide valuable insights. To address this, we will incorporate Western blot analysis into our experimental plan to determine whether GlcT depletion affects the protein size of Dl.

      (8) It is unfortunate that GlcT doesn't affect Notch signaling in other organs on the fly. This brings into question the Delta trafficking model and the authors should note this. Also, the clonal marker in Figure 6C is not clear.

      In the revised working model, we will explicitly specify that the events occur in intestinal stem cells. Regarding Figure 6C, we will delineate the clone with a white dashed line to enhance its clarity and visual comprehension.

      (9) The authors state that loss of UGCG in the mouse small intestine results in a reduced ISC count. However, in Supplementary Figure C3, Ki67, a marker of ISC proliferation, is significantly increased in UGCG-CKO mice. This contradiction should be clarified. The authors might repeat this experiment using an alternative ISC marker, such as Lgr5.

      Previous studies have indicated that dysregulation of the Notch signaling pathway can result in a reduction in the number of ISCs. While we did not perform a direct quantification of ISC numbers in our experiments, our olfm4 staining—which serves as a reliable marker for ISCs—demonstrates a clear reduction in the number of positive cells in UGCG-CKO mice.

      The increased Ki67 signal we observed reflects enhanced proliferation in the transit-amplifying region, and it does not directly indicate an increase in ISC number. Therefore, in UGCG-CKO mice, we observe a decrease in the number of ISCs, while there is an increase in transit-amplifying (TA) cells (progenitor cells). This increase in TA cells is probably a secondary consequence of the loss of barrier function associated with the UGCG knockout.

    1. eLife Assessment

      TrASPr is an important contribution that leverages transformer models focused on regulatory regions to enhance predictions of tissue-specific splicing events. The evidence supporting the authors' claims is convincing, with rigorous analyses demonstrating improved performance relative to existing models, although some aspects of the evaluation would benefit from further clarification. This work will be of particular interest to researchers in computational genomics and RNA biology, as it offers both a refined predictive model and a new tool to designing RNA sequences for targeted splicing outcomes.

    2. Reviewer #1 (Public review):

      Summary:

      The authors propose a transformer-based model for the prediction of condition - or tissue-specific alternative splicing and demonstrate its utility in the design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant existing approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Another difference compared to Pangolin and SpliceAI which are focused on modeling individual splice junctions is the focus on modeling a complete alternative splicing event.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions.

      (4) The authors use their model for sequence design to optimize splicing outcomes, which is a novel application.

      Weaknesses:

      No weaknesses were identified by this reviewer, but I have the following comments:

      (1) I would be curious to see evidence that the model is learning position-specific representations.

      (2) The transformer encoders in TrASPr model sequences with a rather limited sequence size of 200 bp; therefore, for long introns, the model will not have good coverage of the intronic sequence. This is not expected to be an issue for exons.

      (3) In the context of sequence design, creating a desired tissue- or condition-specific effect would likely require disrupting or creating motifs for splicing regulatory proteins. In your experiments for neuronal-specific Daam1 exon 16, have you seen evidence for that? Most of the edits are close to splice junctions, but a few are further away.

      (4) For sequence design, of tissue- or condition-specific effect in neuronal-specific Daam1 exon 16 the upstream exonic splice junction had the most sequence edits. Is that a general observation? How about the relative importance of the four transformer regions in TrASPr prediction performance?

      (5) The idea of lightweight transformer models is compelling, and is widely applicable. It has been used elsewhere. One paper that came to mind in the protein realm:<br /> Singh, Rohit, et al. "Learning the language of antibody hypervariability." Proceedings of the National Academy of Sciences 122.1 (2025): e2418918121.

    3. Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of the evidence. Additionally, state-of-the-art models (SpliceAI and Pangolin) are reported to perform extremely poorly in some tasks, which is surprising in light of previous reports of their overall good prediction accuracy; the reasoning for this lack of performance compared to TrASPr is not explored.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational autoencoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks.

      Strengths:

      (1) A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      (2) Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      (1) Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      (2) As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models. For instance, Pangolin was apparently trained on a different dataset (Cardoso-Moreira et al. 2019), and using a different processing pipeline (based on SpliSER) than the ones used in this submission. As a result, the inferior performance of Pangolin reported here could potentially be due to subtle distribution shifts. The authors should add a discussion of the differences in the training set, and whether they affect your comparisons (e.g., in Figure 2). They should also consider adding a table summarizing the various datasets used in their previous work for training and testing. Publishing their training and testing datasets in an easy-to-use format would be a fantastic contribution to the community, establishing a common benchmark to be used by others.

      (3) Related to the previous point, as discussed in the manuscript, SpliceAI, and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also, consider fine-tuning Pangolin on cassette exons only (as you do for your model).

      (4) L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases - thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      (5) L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.

      (6) L214, ablations of individual features are missing.

      (7) L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included.

      (8) L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly, the complete failure of SpliceAI and Pangolin is shown in Figure 4d.

      (9) BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      (10) The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.

    4. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors propose a transformer-based model for the prediction of condition - or tissue-specific alternative splicing and demonstrate its utility in the design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant existing approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Another difference compared to Pangolin and SpliceAI which are focused on modeling individual splice junctions is the focus on modeling a complete alternative splicing event.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions.    

      (4) The authors use their model for sequence design to optimize splicing outcomes, which is a novel application.

      We wholeheartedly thank Reviewer #1 for these positive comments regarding the modeling approach we took to this task and the evaluations we performed. We have put a lot of work and thought into this and it is gratifying to see the results of that work acknowledged like this.

      Weaknesses:

      No weaknesses were identified by this reviewer, but I have the following comments:

      (1) I would be curious to see evidence that the model is learning position-specific representations.

      This is an excellent suggestion to further assess what the model is learning. We have several ideas on how to test this which we will plan to report in the revised version. 

      (2) The transformer encoders in TrASPr model sequences with a rather limited sequence size of 200 bp; therefore, for long introns, the model will not have good coverage of the intronic sequence. This is not expected to be an issue for exons.

      Yes we can divide predictions by intron length, that’s a good suggestion. We will report on that in the revision.

      (3) In the context of sequence design, creating a desired tissue- or condition-specific effect would likely require disrupting or creating motifs for splicing regulatory proteins. In your experiments for neuronal-specific Daam1 exon 16, have you seen evidence for that? Most of the edits are close to splice junctions, but a few are further away.

      That is another good question and suggestion. In the original paper describing the mutation locations some motif similarities were noted to PTB (CU) and CUG/Mbnl-like elements (Barash et al Nature 2010). We could revisit this now with an RBP motif D.B. such as http://rbpdb.ccbr.utoronto.ca/. We note the ENCODE uses human cell lines and cannot be used for this but we will also look for mouse CLIP and KD data supporting such regulatory findings. 

      (4) For sequence design, of tissue- or condition-specific effect in neuronal-specific Daam1 exon 16 the upstream exonic splice junction had the most sequence edits. Is that a general observation? How about the relative importance of the four transformer regions in TrASPr prediction performance?

      This is another excellent question that we plan to follow up with matching analysis in the revision.

      (5) The idea of lightweight transformer models is compelling, and is widely applicable. It has been used elsewhere. One paper that came to mind in the protein realm:

      Singh, Rohit, et al. "Learning the language of antibody hypervariability." Proceedings of the National Academy of Sciences 122.1 (2025): e2418918121.

      Yes, we are for sure not the only/first to advocate for such an approach. We will be sure to make that point clear in the revision and thank the reviewer for the example from a different domain.  

      Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of the evidence. Additionally, state-of-the-art models (SpliceAI and Pangolin) are reported to perform extremely poorly in some tasks, which is surprising in light of previous reports of their overall good prediction accuracy; the reasoning for this lack of performance compared to TrASPr is not explored.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational autoencoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks.

      We thank Reviewer #2 for this detailed summary and positive view of our work. It seems the main issue raised in this summary regards the evaluations: The reviewer finds details of the evaluations missing and the fact that SpliceAI and Pangolin perform poorly on some of the tasks to be surprising. In general, we made a concise effort to include the required details, including code and data tables, but will be sure to include more details based on the specific questions/comments listed below. As for the perceived performance issues for Pangolin/SpliceAI we believe this may be the result of not making it clear what tasks they perform well on vs those in which they do not work well. We give more details below. 

      Strengths:

      (1) A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      (2) Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      (1) Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      We made an effort to make the tasks be specific and detailed,  including making the code and data of those available. Still, it is evident from the above comment Reviewer #2 found this to be lacking. We will review the description and make an effort to improve that given the clarifications we include below. 

      (2) As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models. For instance, Pangolin was apparently trained on a different dataset (Cardoso-Moreira et al. 2019), and using a different processing pipeline (based on SpliSER) than the ones used in this submission. As a result, the inferior performance of Pangolin reported here could potentially be due to subtle distribution shifts. The authors should add a discussion of the differences in the training set, and whether they affect your comparisons (e.g., in Figure 2). They should also consider adding a table summarizing the various datasets used in their previous work for training and testing. Publishing their training and testing datasets in an easy-to-use format would be a fantastic contribution to the community, establishing a common benchmark to be used by others.

      There are several good points to unpack here. First, we agree that a standard benchmark will be useful to include. We will work to create and include one for the revision. That said, we note that unlike the example given by Reviewer #2 (ImageNet) there are no standards for the splicing prediction tasks. There are actually different task definitions with different input/outputs as we tried to cover briefly in the introduction section. 

      Second, regarding the usage of different data and distribution shifts as potential reasons for Pangolin performance differences. We originally evaluated Pangolin after retraining it with MAJIQ based quantifications and found no significant changes. We will include a more detailed analysis of Pangolin retrained like this in the revision. We also note that Pangolin original training involved significantly more data as it was trained on four species with four tissues each, and we only evaluated it on three of those tissues (for human), in exons the authors deemed as test data. That said, we very much agree that retraining Pangolin as mentioned above is warranted, as well as clearly listing what data was used for training as suggested by the reviewer.

      (3) Related to the previous point, as discussed in the manuscript, SpliceAI, and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also, consider fine-tuning Pangolin on cassette exons only (as you do for your model).

      As mentioned above, we originally did try to retrain Pangolin with MAJIQ PSI values without observing much differences, but we will repeat this and include the results in the revision. Trying to combine 4 different SpliceAI models as proposed by the Reviewer seems to be a different kind of a new model, one that takes 4 large ResNets and combines those with annotation. Related to that, we did try to replace the transformers in our ablation study. The reviewer’s suggestion seems like another interesting architecture to try but since this is a non existing model that would likely require some adjustments. Given that, we view adding such a new model architecture as beyond the scope of this work.

      (4) L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases - thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      Yes, that is a good suggestion, similar to one made by Reviewer #1 as well. We plan to include such analysis in the revision. 

      (5) L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.

      Previous models were not trained exclusively on constitutive exons and Pangolin specifically was trained with their version of junction usage across tissues. That said, the reviewer’s point is valid (and similar to ones made above) about a need to have a matched training/testing. As noted above we plan to include Pangolin training on our PSI values for comparison.

      (6) L214, ablations of individual features are missing.

      OK

      (7) L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included.

      The task here was to assess predictions in very different conditions, hence we tested on completely different data of human cell lines rather than similar tissue samples. Yes, we can also assess on unseen GTEX tissues as well.

      (8) L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly, the complete failure of SpliceAI and Pangolin is shown in Figure 4d.

      Line 239 refers to predicting relative inclusion levels between competing 3’ and 5’ splice sites. We admit we too expected this to be better for SpliceAI and Pangolin and will be sure to recheck for bugs, but to be fair we are not aware of a similar assessment being done for either of those algorithms (i.e. relative inclusion for 3’ and 5’ alternative splice site events).

      One issue we ran into, reflected in Reviewer #2 comments, is the mix between tasks that SpliceAI and Pangolin excel at and other tasks where they should not necessarily be expected to excel. Both algorithms focus on cryptic splice site creation/disruption. This has been the focus of those papers and subsequent applications.  While Pangolin added tissue specificity to SpliceAI training, the authors themselves admit “...predicting differential splicing across tissues from sequence alone is possible but remains a considerable challenge and requires further investigation”. The actual performance on this task is not included in Pangolin’s main text, but we refer Reviewer #2 to supplementary figure S4 in that manuscript to get a sense of Pangolin’s reported performance on this task. Similar to that, Figure 4d is for predicting *tissue specific* regulators. We do not think it is surprising that SpliceAI (tissue agnostic) and Pangolin (slight improvement compared to SpliceAI in tissue specific predictions) do not perform well on this task.  Similarly, we do not find the results in Figure 4C surprising either. These are for mutations that slightly alter inclusion level of an exon, not something SpliceAI was trained on, as it was simply trained on splice sites yes/no predictions. As noted and we will stress in the revision as well, training Pangolin on this dataset like TrASPr gives similar performance. That is to be expected as well - Pangolin is constructed to capture changes in PSI, those changes are not even tissue specific for CD19 data and the model has no problem/lack of capacity to generalize from the training set just like TrASPr does. In fact, if you only use combination of known mutations seen during training a simple regression model gives correlation of ~92-95% (Cortés-López et al 2022). In summary, we believe that better understanding of what one can realistically expect from models such as SpliceAI, Pangolin, and TrASPr will go a long way to have them better understood and used effectively. We will try to improve on that in the revision.

      (9) BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      We thank the reviewer for the suggestion. We agree those are two distinct contributions and we indeed considered having them as two separate papers. However, there is strong coupling between the design algorithm (BOS) and the predictor that enables it (TrASPr). This coupling is both conceptual (TrASPr as a “teacher”) and practical in terms of evaluations. While we use experimental data (experiments done involving Daam1 exon 16, CD19 exon 2) we still rely heavily on evaluations by TrASPr itself. A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper. For those reasons we eventually decided to make it into what we hope is a more compelling combined story about generative models for prediction and design of RNA splicing. 

      (10) The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.

      We can definitely see the logic behind trying BOS with different predictors. That said, as we note above most of BOS evaluations are based on the “teacher”. As such, it is unclear what value replacing the teacher would bring. We also note that given this limitation we focus mostly on evaluations in comparison to existing approaches (genetic algorithm or random mutations as a strawman).

    1. eLife Assessment

      Fleming et al sought to better understand DNAJC7's function in motor neurons as mutations in this gene have been associated with amyotrophic lateral sclerosis (ALS). Using iPSC-derived motor neurons, interactome, and transcriptomic data, they provide solid evidence that loss-of-function mutations in DNAJC7 disrupt RNA binding proteins and resistance to proteasomal stress. These important findings advance our understanding of DNAJC7 in motor neurons while providing clues to how its loss may be causal for ALS; nonetheless, the experiments were performed with a single iPSC line, while at least 3 are deemed to be required to validate the results. Furthermore, the mechanistic evidence is still incomplete with respect to how DNAJC7 mutations lead to HSF1 impaired activity, and whether it is direct or not.

    2. Reviewer #1 (Public review):

      Summary

      Fleming et al. present the first, proteomics-based attempt to identify the possible mechanism of action of ALS-linked DNAJC7 molecular chaperone in pathology. Impressively, it is the first report of DNAJC7 interactome studies, using a suitable iPSC-derived lower motor neuron model. Using a co-immunoprecipitation approach the authors identified that the interactome of DNAJC7 is predominantly composed of proteins engaged in response to stress, but also that this interactome is enriched in RNA-binding proteins. The authors also created a DNAJC7 haploinsufficiency cellular model and show the resulting increased insolubility of HNRNPU protein which causes disruptions in its functionality as shown by analysis of its transcriptional targets. Finally, this study uses pharmacological agents to test the effect of decreased DNAJC7 expression on cell response to proteotoxic stress and finds evidence that DNAJC7 regulates the activation of Heat shock factor 1 (HSF1) protein upon stress conditions.

      Strengths

      (1)This study uses the best so far model to study the interactome and possible mechanism of action of DNAJC7 molecular chaperone in an iPSC-derived cellular model of motor neurons. Furthermore, the authors also looked into available transcriptome databases of ALS patient samples to further test whether their findings may yield relevance to pathology.

      (2) The extent to which the authors are explicit about the sample sizes, protocols, and statistical tests used throughout this manuscript, should be applauded. This will help the whole field in their efforts to reliably replicate the results in this study.

      Weaknesses

      (1) The most significant caveat of interactome experiments inherently comes from the method of choice. It is possible that by using the co-purification approach of DNAJC7 IP the resulting pool of binding partners is depleted in proteins that interact with DNAJC7 weakly or transiently. An alternative approach presumably more sensitive towards weaker binders could use the TurboID-based proximity-labeling method.

      (2) The authors mention in Results (and Figure 2D) that HNRNPA1 was identified as DNAJC7-interacting protein in their co-IP experiments, however, an identifier for this protein cannot be found in Figure 1C and Table S1 listing the proteomics results. Could the authors appropriately update Figure 1C and Table S1, or if HNRNPA1 wasn't really a hit then remove it from listed HNRNPs?

      (3) No further validation of DNAJC7-interacting proteins from the heat-shock protein (HSP) family. Current validation of mass spectrometry-identified proteins comes from IP-western blots with antibodies against HSPs. It would be interesting to further inspect possible interactions of these proteins by inspecting co-localization with immunocytochemistry.

      (4) Similarly, the observation of DNAJC7 haploinsufficiency causing an increase in HNRNPU insolubility could be also easily further confirmed by checking for the emergence of "puncta" under a fluorescence microscope, in addition to provided WB experiments from MN lysates.

      (5) I would like to recommend the authors to also provide with this manuscript a complete dataset (possibly in the form of a table, presented similarly as Table S1) resulting from experiments presented in Figures 2F and S2D. The information on upregulated and downregulated targets in their DNAJC7 haploinsufficiency model would be a valuable resource for the field and enable further investigations.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript titled "The ALS-associated co-chaperone DNAJC7 mediates neuroprotection against proteotoxic stress by modulating HSF1 activity" describes experiments carried out in iPS cells re-differentiated into motor neurons (iNeuons, MNs) seeking to assess the functions of the J protein DnaJC7 in proteostasis. This study also investigates how an ALS-associated mutant variant (R156X) alters DnaJC7 function.

      The proteomic studies identify proteins interacting with DnaJC7. Using mRNA profiling in haplo-insufficient cells (+/R156X) compared to wild-type cells, the study seeks to identify pathways modulated by partial loss of DnaJC7 function. Studies in the DnaJC7 haplo-insufficient cells also indicate changes in the properties of ALS-associated proteins, such as HNRNPU and Matrin3 both of which are involved in the regulation of gene expression. The study also shows data indicating that DnaJC7 haploinsufficiency sensitizes cells to proteostatic stress induced by proteosome inhibition by MG132 and Hsp90 inhibition by Ganetespib. Lastly, the study investigates how DnaJC7 modulates the activity of the heat shock transcription factor (Hsf1) and thus the heat shock response.

      Strengths:

      The manuscript is well presented and most of the data is of high quality and convincing. The figures and supplementary figures are clear and easy to follow.

      This study overall provides important new insights into a mostly underexplored molecular co-chaperone and its role in proteostasis. The proteomic and transcriptomic experiments certainly advance our understanding of DnaJC7. The MN model is well-suited for these studies addressing the role of DnaJC7, particularly regarding ALS. The haplo-insufficient MNs are also a suitable model to study a potential loss of function mechanism caused by (some) fALS-associated mutants in ALS, such as the R156X mutation used here.

      Since so little is known about DnaJC7 function, the exploratory approaches applied here are particularly useful.

      Weaknesses:

      Without follow-up studies, however, e.g., with select interacting proteins, the study provides merely a descriptive list of possible interactions without mechanistic insights. Also, most interactions have not been extensively (only a few examples) validated by other methods or individual experiments.

      A major limitation of the study in its current form is that none of the experimental approaches allow for assessing the specific functions of JC7. In the absence of specificity controls, e.g., other J proteins or HOP, which, like DnaJC7, contains TPR domains and can interact with Hsp70 and Hsp90, it remains unclear if the proposed functions of DnaJC7 are specific/unique or shared by other J proteins or molecular chaperones. Accordingly, it would be highly informative to add experiments to assess if some of the reported DnaJC7 protein-protein interactions and the transcriptional alterations in haplo-insufficient cells are DnaJC7specific or also occur with other J proteins or molecular chaperones. This seems particularly important to discern specific DnaJC7 functions from general effects caused by impaired proteostasis.

      It would be informative to explore how cellular stress (e.g., MG132 treatment) alters DnaJC7 interactions with other proteins (J proteins, HOP), ideally in additional/comparative proteomic studies.<br /> The mechanism underlying the proposed regulation of Hsf1 by DnaJC7 is not quite clear to me (Figures 4 A-I). There is no evidence of a direct physical interaction between DnJC7 and Hsf1 in the proteomic data or elsewhere. It seems plausible that Hsf1/HSR dysregulation in the haplo-insufficient cells might be due to rather indirect effects, e.g., increased protein misfolding. Also, additional data showing differential activation of Hsf1 in +/+ versus +/- cells would strengthen this part, e.g. showing differences in Hsf1 trimerization, Hsp70 interactions, nuclear localization, etc.

      The manuscript might also benefit from considering the literature showing an unusually inactive HSR and Hsf1 activity in motor neurons (e.g. published by the Durham lab).

      The correlation with transcriptomic data from ALS patients compared to neurotypical controls (Figures 4 L, M) suggesting a direct role of Hsf1/HSR seems unlikely at this point. In my view, the transcriptional dysregulation in ALS patients could be unrelated to Hsf1 dysregulation and caused by rather non-specific effects of neuronal decay in ALS.

    4. Reviewer #3 (Public review):

      Summary:

      Fleming et al sought to better understand DNAJC7's function in motor neurons as mutations in this gene have been associated with amyotrophic lateral sclerosis (ALS). The research question is relevant and important. The authors use an induced pluripotent stem cell (iPSC) line to derive motor neurons (iMNs) finding that DNAJC7 interacts with RNA-binding proteins (RBP) in wild-type cells and a truncated mutant DNAJC7[R156*] disrupts the RBP, hnRNPU, by promoting its accumulation into insoluble fractions. Given that DNAJC7 is predicted to regulate stress responses, the authors then find that DNAJC7[R156*] expression sensitizes the iMNs to proteosomal stress by disrupting the expression of the key heat stress response regulator, HSF1. These findings support that loss-of-function mutations in DNAJC7 will indeed sensitize motor neurons to proteotoxic stress, potentially driving ALS. The association with RBPs, which routinely are found to be disrupted in ALS, is of interest and warrants further study.

      Strengths:

      (1) The research question is relevant and important. The authors provide interesting data that DNAJC7 mutations impact two important features in ALS, the dysregulation of RNA binding proteins and the sensitivity of motor neurons to proteotoxic stress.

      (2) The authors provide solid data to support their findings and the assays are appropriate.

      Weaknesses:

      (1) The authors rely on a single iPSC line throughout the text, using the same line to make the mutation-carrying cells. iPSCs are highly variable and at minimum 3 lines, typically 5 lines, should be used to define consistent findings. This work would be greatly strengthened if 3 or more lines were used to confirm consistent effects. This is particularly concerning given that iPSCs were differentiated using growth factors versus genetic induction. Growth-factor-based differentiations are more variable.

      (2) The authors argue that HSF1 and its targets are downregulated in sporadic ALS and mutant C9orf72 ALS. The first concern is that these transcriptomics data were derived from cortical tissue which does not contain motor neurons (Pineda et al. 2024 Cell 187: 1971-1989.e1916). The second concern is that the inclusion of C9orf72 mutant tissue is not well justified as (1) this mutation is associated with an upregulation of HSF1 and its targets in patients (Mordes et al, Acta Neuropathol Commun 2018 6(1):55; Lee et al Neuron 2023 111(9):1381-1390) and (2) the C9orf72 mutation is associated with a ALS/FTD spectrum disorder defined by TDP-43 pathology. Disease mechanisms associated with this spectrum disorder may not overlap with traditional ALS which is typically defined by SOD1 pathology.

      (3) As a whole, the findings are mechanistically disjointed, and additional experiments or discussion would help to connect the dots a bit more.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      Fleming et al. present the first, proteomics-based attempt to identify the possible mechanism of action of ALS-linked DNAJC7 molecular chaperone in pathology. Impressively, it is the first report of DNAJC7 interactome studies, using a suitable iPSC-derived lower motor neuron model. Using a co-immunoprecipitation approach the authors identified that the interactome of DNAJC7 is predominantly composed of proteins engaged in response to stress, but also that this interactome is enriched in RNA-binding proteins. The authors also created a DNAJC7 haploinsufficiency cellular model and show the resulting increased insolubility of HNRNPU protein which causes disruptions in its functionality as shown by analysis of its transcriptional targets. Finally, this study uses pharmacological agents to test the effect of decreased DNAJC7 expression on cell response to proteotoxic stress and finds evidence that DNAJC7 regulates the activation of Heat shock factor 1 (HSF1) protein upon stress conditions.

      Strengths

      (1) This study uses the best so far model to study the interactome and possible mechanism of action of DNAJC7 molecular chaperone in an iPSC-derived cellular model of motor neurons. Furthermore, the authors also looked into available transcriptome databases of ALS patient samples to further test whether their findings may yield relevance to pathology.

      (2) The extent to which the authors are explicit about the sample sizes, protocols, and statistical tests used throughout this manuscript, should be applauded. This will help the whole field in their efforts to reliably replicate the results in this study.

      We thank the reviewer for highlighting the strengths of our study.

      Weaknesses

      (1) The most significant caveat of interactome experiments inherently comes from the method of choice. It is possible that by using the co-purification approach of DNAJC7 IP the resulting pool of binding partners is depleted in proteins that interact with DNAJC7 weakly or transiently. An alternative approach presumably more sensitive towards weaker binders could use the TurboID-based proximity-labeling method.

      The reviewer raises a valid point that TurboID-based proximity biotinylation could be a more sensitive approach for identifying DNAJC7 protein-protein interactions compared to IP-MS. We agree that this strategy could be better suited to detect weak or transient interactions, and we have previously used it to characterize protein nanoenvironments and interactomes in vitro and in vivo (Wang et al. Mol Psychiatry 2024, Quan et al. mBio 2024). However, proximity biotinylation also has significant limitations, such as potential artifacts due to overexpression and high background levels. We selected the IP-MS approach to identify DNAJC7 binding partners in neurons without the need of genetically modifying or over-expressing DNAJC7.

      (2) The authors mention in Results (and Figure 2D) that HNRNPA1 was identified as DNAJC7-interacting protein in their co-IP experiments, however, an identifier for this protein cannot be found in Figure 1C and Table S1 listing the proteomics results. Could the authors appropriately update Figure 1C and Table S1, or if HNRNPA1 wasn't really a hit then remove it from listed HNRNPs?

      We apologize for the confusion. HNRNPA1 was pulled down exclusively with DNAJC7 in 2/3 independent experiments and was initially included in our list of targets. However, in our final and most stringent analysis we only considered proteins that appeared in 3/3 experiments and thus HNRNPA1 was filtered out of Figure 1C and Table S1. We will therefore remove it from Figure 2D in the revised manuscript.

      (3) No further validation of DNAJC7-interacting proteins from the heat-shock protein (HSP) family. Current validation of mass spectrometry-identified proteins comes from IP-western blots with antibodies against HSPs. It would be interesting to further inspect possible interactions of these proteins by inspecting co-localization with immunocytochemistry.

      As the reviewer points out we did in fact validate the interaction of DNAJC7 with HSP90 and HSP70 (HSP90AB1 and HSPA1A) by IP-WB as shown in Fig 1F. We agree that examining co-localization of these proteins by immunocytochemistry (ICC) would be important to investigate. However, we have been unable to do this due to technical limitations. Specifically, we have tried to perform ICC using 6 commercially available DNAJC7 antibodies and have so far been unsuccessful. In our hands the DNAJC7 ICC signal appears to be non-specific as it is not reduced when using DNAJC7 knockout and knockdown cells as controls.

      (4) Similarly, the observation of DNAJC7 haploinsufficiency causing an increase in HNRNPU insolubility could be also easily further confirmed by checking for the emergence of "puncta" under a fluorescence microscope, in addition to provided WB experiments from MN lysates.

      This is a good suggestion, and we can assess the emergence of HNRNPU "puncta" by ICC in DNAJC7 mutant iPSC-derived neurons and/or postmortem sporadic ALS patient tissue.

      (5) I would like to recommend the authors to also provide with this manuscript a complete dataset (possibly in the form of a table, presented similarly as Table S1) resulting from experiments presented in Figures 2F and S2D. The information on upregulated and downregulated targets in their DNAJC7 haploinsufficiency model would be a valuable resource for the field and enable further investigations.

      This is a good suggestion and in the revised version we will provide in Table S2 the dataset presented in Figs. 2F and S2D.

      Reviewer #2 (Public review):

      Summary:

      The manuscript titled "The ALS-associated co-chaperone DNAJC7 mediates neuroprotection against proteotoxic stress by modulating HSF1 activity" describes experiments carried out in iPS cells re-differentiated into motor neurons (iNeuons, MNs) seeking to assess the functions of the J protein DnaJC7 in proteostasis. This study also investigates how an ALS-associated mutant variant (R156X) alters DnaJC7 function. The proteomic studies identify proteins interacting with DnaJC7. Using mRNA profiling in haplo-insufficient cells (+/R156X) compared to wild-type cells, the study seeks to identify pathways modulated by partial loss of DnaJC7 function. Studies in the DnaJC7 haplo-insufficient cells also indicate changes in the properties of ALS-associated proteins, such as HNRNPU and Matrin3 both of which are involved in the regulation of gene expression. The study also shows data indicating that DnaJC7 haploinsufficiency sensitizes cells to proteostatic stress induced by proteosome inhibition by MG132 and Hsp90 inhibition by Ganetespib. Lastly, the study investigates how DnaJC7 modulates the activity of the heat shock transcription factor (Hsf1) and thus the heat shock response.

      Strengths<br /> (1) The manuscript is well presented and most of the data is of high quality and convincing. The figures and supplementary figures are clear and easy to follow.

      (2) This study overall provides important new insights into a mostly underexplored molecular co-chaperone and its role in proteostasis. The proteomic and transcriptomic experiments certainly advance our understanding of DnaJC7. The MN model is well-suited for these studies addressing the role of DnaJC7, particularly regarding ALS. The haplo-insufficient MNs are also a suitable model to study a potential loss of function mechanism caused by (some) fALS-associated mutants in ALS, such as the R156X mutation used here.

      (3) Since so little is known about DnaJC7 function, the exploratory approaches applied here are particularly useful.

      We thank the reviewer for highlighting the strengths of our study.

      Weaknesses

      (1) Without follow-up studies, however, e.g., with select interacting proteins, the study provides merely a descriptive list of possible interactions without mechanistic insights. Also, most interactions have not been extensively (only a few examples) validated by other methods or individual experiments.

      We appreciate the reviewers concern and agree that there are several intriguing DNAJC7 interactors worth studying further, that is why we wanted to share this resource with the broader community as quickly as possible. As the first study focused on DNAJC7 and its link to ALS we could not possibly investigate multiple potential interactors and focused on two: HNRNPU and HSP70/HSP90, associated with RNA metabolism and stress response respectively, as these are two pathways have previously been implicated in ALS pathogenesis. We do provide validation of these interactions and some mechanistic insight into how DNAJC7 haploinsufficiency impairs their function.

      A major limitation of the study in its current form is that none of the experimental approaches allow for assessing the specific functions of JC7. In the absence of specificity controls, e.g., other J proteins or HOP, which, like DnaJC7, contains TPR domains and can interact with Hsp70 and Hsp90, it remains unclear if the proposed functions of DnaJC7 are specific/unique or shared by other J proteins or molecular chaperones. Accordingly, it would be highly informative to add experiments to assess if some of the reported DnaJC7 protein-protein interactions and the transcriptional alterations in haplo-insufficient cells are DnaJC7specific or also occur with other J proteins or molecular chaperones. This seems particularly important to discern specific DnaJC7 functions from general effects caused by impaired proteostasis.

      We agree with the reviewer that is a very interesting question, as for example mutations in DNAJC6 can cause rare forms of Parkinson’s Disease1. However, addressing the functional overlap of DNAJC7 with other J proteins such as DNAJC6 would require substantial time and resources and is out of scope of the current manuscript. 

      It would be informative to explore how cellular stress (e.g., MG132 treatment) alters DnaJC7 interactions with other proteins (J proteins, HOP), ideally in additional/comparative proteomic studies. The mechanism underlying the proposed regulation of Hsf1 by DnaJC7 is not quite clear to me (Figures 4 A-I). There is no evidence of a direct physical interaction between DnJC7 and Hsf1 in the proteomic data or elsewhere. It seems plausible that Hsf1/HSR dysregulation in the haplo-insufficient cells might be due to rather indirect effects, e.g., increased protein misfolding. Also, additional data showing differential activation of Hsf1 in +/+ versus +/- cells would strengthen this part, e.g. showing differences in Hsf1 trimerization, Hsp70 interactions, nuclear localization, etc.

      The reviewer makes two good points here. Firstly, we do agree we should provide additional data to better understand the differential activation of HSF1 in DNACJ7 heterozygous neurons and we will focus on this question during the revision. We also agree that the mechanism underlying the regulation of HSF1 by DNAJC7 is not well defined and we acknowledge it could be indirect. Of note, HSF1 activation is regulated by HSP70, of which DNAJC7 is a co-chaperone. We will attempt to define this mechanism better during the revision.

      The manuscript might also benefit from considering the literature showing an unusually inactive HSR and Hsf1 activity in motor neurons (e.g. published by the Durham lab).

      Yes—we did in fact note this in our discussion: “At the same time, mouse MNs have previously been shown to maintain a high threshold of induction of the HSF1-mediated stress response relative to other cell types including glial cells, with the suggestion that this contributes to their vulnerability to stress signals such as insoluble proteins.” We will further consider how our findings are in line with those of Durham et al., in the revised discussion.

      The correlation with transcriptomic data from ALS patients compared to neurotypical controls (Figures 4 L, M) suggesting a direct role of Hsf1/HSR seems unlikely at this point. In my view, the transcriptional dysregulation in ALS patients could be unrelated to Hsf1 dysregulation and caused by rather non-specific effects of neuronal decay in ALS.

      This is a very reasonable concern.  We acknowledge that the HSF1 effects in patients could be driven by multiple other factors including C9-DPRs etc. However, the point of this analysis is not to claim that DNAJC7 is the cause; but rather to highlight the importance of the HSF1 pathway, which we identified as being mis-regulated in DNAJC7 mutant neurons, as broadly relevant in sporadic and other forms of genetic ALS. 

      Reviewer #3 (Public review):

      Summary:

      Fleming et al sought to better understand DNAJC7's function in motor neurons as mutations in this gene have been associated with amyotrophic lateral sclerosis (ALS). The research question is relevant and important. The authors use an induced pluripotent stem cell (iPSC) line to derive motor neurons (iMNs) finding that DNAJC7 interacts with RNA-binding proteins (RBP) in wild-type cells and a truncated mutant DNAJC7[R156*] disrupts the RBP, hnRNPU, by promoting its accumulation into insoluble fractions. Given that DNAJC7 is predicted to regulate stress responses, the authors then find that DNAJC7[R156*] expression sensitizes the iMNs to proteosomal stress by disrupting the expression of the key heat stress response regulator, HSF1. These findings support that loss-of-function mutations in DNAJC7 will indeed sensitize motor neurons to proteotoxic stress, potentially driving ALS. The association with RBPs, which routinely are found to be disrupted in ALS, is of interest and warrants further study.

      Strengths

      (1) The research question is relevant and important. The authors provide interesting data that DNAJC7 mutations impact two important features in ALS, the dysregulation of RNA binding proteins and the sensitivity of motor neurons to proteotoxic stress.

      (2) The authors provide solid data to support their findings and the assays are appropriate.

      We thank the reviewer for highlighting the strengths of our study.

      Weaknesses

      (1) The authors rely on a single iPSC line throughout the text, using the same line to make the mutation-carrying cells. iPSCs are highly variable and at minimum 3 lines, typically 5 lines, should be used to define consistent findings. This work would be greatly strengthened if 3 or more lines were used to confirm consistent effects. This is particularly concerning given that iPSCs were differentiated using growth factors versus genetic induction. Growth-factor-based differentiations are more variable.

      We will substantiate the major findings by the use of additional models and genetic backgrounds during the revision. However, our experiments utilize isogenic controls and extensive quality control assays (on-target, off target analysis, whole genome sequencing, karyotype etc.) to ensure that our isogenic lines are genomically identical --other than the DNAJC7 mutation-- and thus any phenotypes are likely caused by mutant DNAJC7 itself.   

      (2) The authors argue that HSF1 and its targets are downregulated in sporadic ALS and mutant C9orf72 ALS. The first concern is that these transcriptomics data were derived from cortical tissue which does not contain motor neurons (Pineda et al. 2024 Cell 187: 1971-1989.e1916). The second concern is that the inclusion of C9orf72 mutant tissue is not well justified as (1) this mutation is associated with an upregulation of HSF1 and its targets in patients (Mordes et al, Acta Neuropathol Commun 2018 6(1):55; Lee et al Neuron 2023 111(9):1381-1390) and (2) the C9orf72 mutation is associated with a ALS/FTD spectrum disorder defined by TDP-43 pathology. Disease mechanisms associated with this spectrum disorder may not overlap with traditional ALS which is typically defined by SOD1 pathology.

      SOD1 pathology represents only a small fraction (<2%) of all ALS patients and is therefore not traditional ALS. The majority (<97%) of sporadic and familial ALS cases (including C9orf72 but excluding SOD1 and FUS cases) are uniformly characterized by TDP-43 pathology. Nevertheless, we do agree that it would be better to assess spinal cord data but unfortunately such single cell datasets form ALS patients do not currently exist. We acknowledge that the HSF1 effects in patients could be driven by multiple other factors including C9-DPRs etc. However, the point of this analysis is not to claim that DNAJC7 is the cause; but rather to highlight the importance of the HSF1 pathway, which we identified as being mis-regulated in DNAJC7 mutant neuron, as being broadly relevant in sporadic and other forms of genetic ALS. 

      (3) As a whole, the findings are mechanistically disjointed, and additional experiments or discussion would help to connect the dots a bit more.

      We will revise the manuscript with additional experiments and discussion to better connect the dots.

      Citations

      (1) Kurian, M. A. & Abela, L. in GeneReviews(®)   (eds M. P. Adam et al.)  (University of Washington, Seattle Copyright © 1993-2025, University of Washington, Seattle. GeneReviews is a registered trademark of the University of Washington, Seattle. All rights reserved., 1993).

    1. eLife assessment

      The authors made a useful finding that Zizyphi spinosi semen, a traditional Chinese medicine, has demonstrated excellent biological activity and potential therapeutic effects against Alzheimer's disease (AD). The researchers presented the effects, but the research evidence for the mechanism was incomplete. The main claims were only partially supported.

    2. Reviewer #1 (Public review):

      Summary:

      The study shows that Zizyphi spinosi semen (ZSS), particularly its non-extracted simple crush powder, has significant therapeutic effects on neurodegenerative diseases. It removes Aβ, tau, and α-synuclein oligomers, restores synaptophysin levels, enhances BDNF expression and neurogenesis, and improves cognitive and motor functions in mouse AD, FTD, DLB, and PD models. Additionally, ZSS powder reduces DNA oxidation and cellular senescence in normal-aged mice, increases synaptophysin, BDNF, and neurogenesis, and enhances cognition to levels comparable to young mice.

      Weaknesses:

      (1) While the study demonstrates that ZSS has protective effects across a wide range of animal models, including AD, FTD, DLB, PD, and both young and aged mice, it is broad and lacks a detailed investigation into the underlying mechanisms. This is the most significant concern.

      (2) The authors highlight that the non-extracted simple crush powder of ZSS shows more substantial effects than its hot water extract and extraction residue. However, the manuscript provides very limited data comparing the effects of these three extracts.

      (3) The authors have not provided a rationale for the dosing concentrations used, nor have they tested the effects of the treatment in normal mice to verify its impact under physiological conditions.

      (4) Regarding the assessment of cognitive function in mice, the authors only utilized the Morris Water Maze (MWM) test, which includes a five-day spatial learning training phase followed by a probe trial. The authors focused solely on the learning phase. However, it is relevant to note that data from the learning phase primarily reflects the learning ability of the mice, while the probe trial is more indicative of memory. Therefore, it is essential that probe trial data be included for a more comprehensive analysis. A justification should be included to explain why the latency of 1st is about 50s not 60s.

      (5) The BDNF immunohistochemical staining in the manuscript appears to be non-specific.

      (6) The central pathological regions in PD are the substantia nigra and striatum. Please replace the staining results from the cortex and hippocampus with those from these regions in the PD model.

    3. Reviewer #2 (Public review):

      Summary:

      The authors studied the effects of hot water extract, extraction residue, and non-extracted simple crush powder of ZSS in diseased or aged mice. It was found that ZSS played an anti-neurodegenerative role by removing toxic proteins, repairing damaged neurons, and inhibiting cell senescence.

      Strengths:

      The authors studied the effects of ZSS in different transgenic mice and analyzed the different states of ZSS and the effects of different components.

      Weaknesses:

      The authors' study lacked an in-depth exploration of mechanisms, including changes in intracellular signal transduction, drug targets, and drug toxicity detection.

    4. Reviewer #3 (Public review):

      ZSS has been widely used in Traditional Chinese Medicine as a sleep-promoting herb. This study tests the effects of ZSS powder and extracts on AD, PD, and aging, and broad protective effects were revealed in mice.

      However, this work did not include a mechanistic study or target data on ZSS were included, and PK data were also not involved. Mechanisms or targets and PK study are suggested. A human PK study is preferred over mice or rats. E.g. which main active ingredients and the concentration in plasma, in this context, to study the pharmacological mechanisms of ZSS.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      (1) While the study demonstrates that ZSS has protective effects across a wide range of animal models, including AD, FTD, DLB, PD, and both young and aged mice, it is broad and lacks a detailed investigation into the underlying mechanisms. This is the most significant concern.

      We appreciate this comment. We recognize that elucidating the mechanism is an important research topic, and we are currently working on it. The purpose of publishing this paper at this time is to inform the public as soon as possible about natural materials and methods that may be effective in preventing dementia and neurodegenerative diseases, and to encourage similar research.

      (2) The authors highlight that the non-extracted simple crush powder of ZSS shows more substantial effects than its hot water extract and extraction residue. However, the manuscript provides very limited data comparing the effects of these three extracts.

      Certainly, it would be better to compare them in several different models, but we believe that important results have already been obtained in tau Tg mice, and comparative data in other models are just additive and confirmatory.

      (3) The authors have not provided a rationale for the dosing concentrations used, nor have they tested the effects of the treatment in normal mice to verify its impact under physiological conditions.

      As described in the Materials and Methods section, the dosage was determined based on the results of preliminary experiments. The beneficial effects in normal mice are shown in Figure 5.

      (4) Regarding the assessment of cognitive function in mice, the authors only utilized the Morris Water Maze (MWM) test, which includes a five-day spatial learning training phase followed by a probe trial. The authors focused solely on the learning phase. However, it is relevant to note that data from the learning phase primarily reflects the learning ability of the mice, while the probe trial is more indicative of memory. Therefore, it is essential that probe trial data be included for a more comprehensive analysis. A justification should be included to explain why the latency of 1st is about 50s not 60s.

      We agree that it is better to include the results of the probe test. We did not include them this time, but we would like to include them in the future. In the memory acquisition training, five trials were performed per day. Since the mice learned the location of the platform during the first five trials, the latency on the first day became around 50 seconds.

      (5) The BDNF immunohistochemical staining in the manuscript appears to be non-specific.

      We cannot understand the basis for saying it is non-specific.

      (6) The central pathological regions in PD are the substantia nigra and striatum. Please replace the staining results from the cortex and hippocampus with those from these regions in the PD model.

      We examined the substantia nigra and found that synuclein pathology appeared in Tg mice and was suppressed by ZSS administration. However, because we did not investigate the striatum, we decided not to show the results for the nigrostriatal system this time. Instead, we thought that we could demonstrate the inhibitory effect of ZSS on synuclein pathology by showing the results for the cortex and hippocampus, which showed early functional decline in these mice (Fig. 4E).

      Reviewer #2 (Public review):

      The authors' study lacked an in-depth exploration of mechanisms, including changes in intracellular signal transduction, drug targets, and drug toxicity detection.

      We appreciate this comment. We understand that the mechanism, targets, and toxicity are important issues to be considered in the future.

      Reviewer #3 (Public review):

      However, this work did not include a mechanistic study or target data on ZSS were included, and PK data were also not involved. Mechanisms or targets and PK study are suggested. A human PK study is preferred over mice or rats. E.g. which main active ingredients and the concentration in plasma, in this context, to study the pharmacological mechanisms of ZSS.

      We appreciate this comment. We understand that the mechanism and target are important issues to consider in the future. As the reviewer pointed out, to conduct PK studies, we must first identify the active ingredients. Unfortunately, we have not been able to identify them yet.

      Reviewer #2 (Recommendations for the authors):

      The authors have proved that ZSS has neuroprotective effects through rigorous animal experiments. However, ZSS contains other active substances besides jujuboside A, jujuboside B, and spinosin, which is more concerning. More critical data may be obtained if experiments have been designed to search for active substances.

      We appreciate this suggestion. We recognize that identifying the true active ingredients is a very important issue. Future studies will be designed to identify them and elucidate their mechanism of action.

    1. eLife Assessment

      This useful study presents a possible solution for a significant problem - that of draining vein sensitivity in functional MRI, which complicates the interpretability of laminar-fMRI results. The addition of a low diffusion-weighted gradient is presented to remove the draining vein signal and obtain functional responses with higher spatial fidelity. However, the strength of the evidence is incomplete, and most tests appear to have been done only in a single subject. Significance thresholds in presented maps are very low and most cortical depth-dependent response profiles do not differ from baseline, even in the BOLD data shown as reference. Curiously, even BOLD group data fails to replicate the well-known pattern of draining towards the cortical surface.

    2. Reviewer #1 (Public review):

      Summary:

      This study aims to provide imaging methods for users of the field of human layer-fMRI. This is an emerging field with 240 papers published so far. Different than implied in the manuscript, 3T is well represented among those papers. E.g. see the papers below that are not cited in the manuscript. Thus, the claim on the impact of developing 3T methodology for wider dissemination is not justified. Specifically, because some of the previous papers perform whole brain layer-fMRI (also at 3T) in more efficient, and more established procedures.

      The authors implemented a sequence with lots of nice features. Including their own SMS EPI, diffusion bipolar pulses, eye-saturation bands, and they built their own reconstruction around it. This is not trivial. Only a few labs around the world have this level of engineering expertise. I applaud this technical achievement. However, I doubt that any of this is the right tool for layer-fMRI, nor does it represent an advancement for the field. In the thermal noise dominated regime of sub-millimeter fMRI (especially at 3T) it is established to use 3D readouts over 2D (SMS) readouts. While it is not trivial to implement SMS, the vendor implementations (as well as the CMRR and MGH implementations) are most widely applied across the majority of current fMRI studies already. The author's work on this does not serve any previous shortcomings in the field.

      The mechanism to use bi-polar gradients to increase the localization specificity is doubtful to me. In my understanding, killing the intra-vascular BOLD should make it less specific. Also, the empirical data do not suggest a higher localization specificity to me.

      Embedding this work in the literature of previous methods is incomplete. Recent trends of vessel signal manipulation with ABC or VAPER are not mentioned. Comparisons with VASO are outdated and incorrect.

      The reproducibility of the methods and the result is doubtful (see below).

      I don't think that this manuscript is in the top 50% of the 240 layer-fmri papers out there.

      3T layer-fMRI papers that are not cited:

      Taso, M., Munsch, F., Zhao, L., Alsop, D.C., 2021. Regional and depth-dependence of cortical blood-flow assessed with high-resolution Arterial Spin Labeling (ASL). Journal of Cerebral Blood Flow and Metabolism. https://doi.org/10.1177/0271678X20982382

      Wu, P.Y., Chu, Y.H., Lin, J.F.L., Kuo, W.J., Lin, F.H., 2018. Feature-dependent intrinsic functional connectivity across cortical depths in the human auditory cortex. Scientific Reports 8, 1-14. https://doi.org/10.1038/s41598-018-31292-x

      Lifshits, S., Tomer, O., Shamir, I., Barazany, D., Tsarfaty, G., Rosset, S., Assaf, Y., 2018. Resolution considerations in imaging of the cortical layers. NeuroImage 164, 112-120. https://doi.org/10.1016/j.neuroimage.2017.02.086

      Puckett, A.M., Aquino, K.M., Robinson, P.A., Breakspear, M., Schira, M.M., 2016. The spatiotemporal hemodynamic response function for depth-dependent functional imaging of human cortex. NeuroImage 139, 240-248. https://doi.org/10.1016/j.neuroimage.2016.06.019

      Olman, C.A., Inati, S., Heeger, D.J., 2007. The effect of large veins on spatial localization with GE BOLD at 3 T: Displacement, not blurring. NeuroImage 34, 1126-1135. https://doi.org/10.1016/j.neuroimage.2006.08.045

      Ress, D., Glover, G.H., Liu, J., Wandell, B., 2007. Laminar profiles of functional activity in the human brain. NeuroImage 34, 74-84. https://doi.org/10.1016/j.neuroimage.2006.08.020

      Huber, L., Kronbichler, L., Stirnberg, R., Ehses, P., Stocker, T., Fernández-Cabello, S., Poser, B.A., Kronbichler, M., 2023. Evaluating the capabilities and challenges of layer-fMRI VASO at 3T. Aperture Neuro 3. https://doi.org/10.52294/001c.85117

      Scheeringa, R., Bonnefond, M., van Mourik, T., Jensen, O., Norris, D.G., Koopmans, P.J., 2022. Relating neural oscillations to laminar fMRI connectivity in visual cortex. Cerebral Cortex. https://doi.org/10.1093/cercor/bhac154

      Strengths:

      See above. The authors developed their own SMS sequence with many features. This is important to the field. And does not leave sequence development work to view isolated monopoly labs. This work democratises SMS.<br /> The questions addressed here are of high relevance to the field: getting tools with good sensitivity, user-friendly applicability, and locally specific brain activity mapping is an important topic in the field of layer-fMRI.

      Weaknesses:

      (1) I feel the authors need to justify why flow-crushing helps localization specificity. There is an entire family of recent papers that aims to achieve higher localization specificity by doing the exact opposite. Namely, MT or ABC fRMRI aims to increase the localization specificity by highlighting the intravascular BOLD by means of suppressing non-flowing tissue. To name a few:

      Priovoulos, N., de Oliveira, I.A.F., Poser, B.A., Norris, D.G., van der Zwaag, W., 2023. Combining arterial blood contrast with BOLD increases fMRI intracortical contrast. Human Brain Mapping hbm.26227. https://doi.org/10.1002/hbm.26227.

      Pfaffenrot, V., Koopmans, P.J., 2022. Magnetization Transfer weighted laminar fMRI with multi-echo FLASH. NeuroImage 119725. https://doi.org/10.1016/j.neuroimage.2022.119725

      Schulz, J., Fazal, Z., Metere, R., Marques, J.P., Norris, D.G., 2020. Arterial blood contrast ( ABC ) enabled by magnetization transfer ( MT ): a novel MRI technique for enhancing the measurement of brain activation changes. bioRxiv. https://doi.org/10.1101/2020.05.20.106666

      Based on this literature, it seems that the proposed method will make the vein problem worse, not better. The authors could make it clearer how they reason that making GE-BOLD signals more extra-vascular weighted should help to reduce large vein effects.

      The empirical evidence for the claim that flow crushing helps with the localization specificity should be made clearer. The response magnitude with and without flow crushing looks pretty much identical to me (see Fig, 6d).<br /> It's unclear to me what to look for in Fig. 5. I cannot discern any layer patterns in these maps. It's too noisy. The two maps of TE=43ms look like identical copies from each other. Maybe an editorial error?

      The authors discuss bipolar crushing with respect to SE-BOLD where it has been previously applied. For SE-BOLD at UHF, a substantial portion of the vein signal comes from the intravascular compartment. So I agree that for SE-BOLD, it makes sense to crush the intravascular signal. For GE-BOLD however, this reasoning does not hold. For GE-BOLD (even at 3T), most of the vein signal comes from extravascular dephasing around large unspecific veins and the bipolar crushing is not expected to help with this.

      (2) The bipolar crushing is limited to one single direction of flow. This introduces a lot of artificial variance across the cortical folding pattern. This is not mentioned in the manuscript. There is an entire family of papers that perform layer-fmri with black-blood imaging that solves this with a 3D contrast preparation (VAPER) that is applied across a longer time period, thus killing the blood signal while it flows across all directions of the vascular tree. Here, the signal cruising is happening with a 2D readout as a "snap-shot" crushing. This does not allow the blood to flow in multiple directions.<br /> VAPER also accounts for BOLD contaminations of larger draining veins by means of a tag-control sampling. The proposed approach here does not account for this contamination.

      Chai, Y., Li, L., Huber, L., Poser, B.A., Bandettini, P.A., 2020. Integrated VASO and perfusion contrast: A new tool for laminar functional MRI. NeuroImage 207, 116358. https://doi.org/10.1016/j.neuroimage.2019.116358

      Chai, Y., Liu, T.T., Marrett, S., Li, L., Khojandi, A., Handwerker, D.A., Alink, A., Muckli, L., Bandettini, P.A., 2021. Topographical and laminar distribution of audiovisual processing within human planum temporale. Progress in Neurobiology 102121. https://doi.org/10.1016/j.pneurobio.2021.102121

      If I would recommend anyone to perform layer-fMRI with blood crushing, it seems that VAPER is the superior approach. The authors could make it clearer why users might want to use the unidirectional crushing instead.

      (3) The comparison with VASO is misleading.<br /> The authors claim that previous VASO approaches were limited by TRs of 8.2s. The authors might be advised to check the latest literature of the last years.<br /> Koiso et al. has performed whole brain layer-fMRI VASO at 0.8mm at 3.9 seconds (with reliable activation) and 2.7 seconds (with unconvincing activation pattern, though), and 2.3 (without activation).<br /> Also, whole brain layer-fMRI BOLD at 0.5mm and 0.7mm has been previously performed by the Juelich group at TRs of 3.5s (their TR definition is 'fishy' though).

      Koiso, K., Müller, A.K., Akamatsu, K., Dresbach, S., Gulban, O.F., Goebel, R., Miyawaki, Y., Poser, B.A., Huber, L., 2023. Acquisition and processing methods of whole-brain layer-fMRI VASO and BOLD: The Kenshu dataset. Aperture Neuro 34. https://doi.org/10.1101/2022.08.19.504502

      Yun, S.D., Pais‐Roldán, P., Palomero‐Gallagher, N., Shah, N.J., 2022. Mapping of whole‐cerebrum resting‐state networks using ultra‐high resolution acquisition protocols. Human Brain Mapping. https://doi.org/10.1002/hbm.25855

      Pais-Roldan, P., Yun, S.D., Palomero-Gallagher, N., Shah, N.J., 2023. Cortical depth-dependent human fMRI of resting-state networks using EPIK. Front. Neurosci. 17, 1151544. https://doi.org/10.3389/fnins.2023.1151544

      The authors are correct that VASO is not advised as a turn-key method for lower brain areas, incl. Hippocampus and subcortex. However, the authors use this word of caution that is intended for inexperienced "users" as a statement that this cannot be performed. This statement is taken out of context. This statement is not from the academic literature. It's advice for the 40+ user base that want to perform layer-fMRI as a plug-and-play routine tool in neuroscience usage. In fact, sub-millimeter VASO is routinely being performed by MRI-physicists across all brain areas (including deep brain structures, hippocampus etc). E.g. see Koiso et al. and an overview lecture from a layer-fMRI workshop that I had recently attended: https://youtu.be/kzh-nWXd54s?si=hoIJjLLIxFUJ4g20&t=2401

      Thus, the authors could embed this phrasing into the context of their own method that they are proposing in the manuscript. E.g. the authors could state whether they think that their sequence has the potential to be disseminated across sites, considering that it requires slow offline reconstruction in Matlab?<br /> Do the authors think that the results shown in Fig. 6c are suggesting turn-key acquisition of a routine mapping tool? In my humble opinion it looks like random noise, with most of the activation outside the ROI (in white matter).

      (4) The repeatability of the results is questionable.<br /> The authors perform experiments about the robustness of the method (line 620). The corresponding results are not suggesting any robustness to me. In fact the layer profiles in Fig. 4c vs. Fig 4d are completely opposite. Location of peaks turn into locations of dips and vice versa.<br /> The methods are not described in enough detail to reproduce these results.<br /> The authors mention that their image reconstruction is done "using in-house MATLAB code" (line 634). They do not post a link to github, nor do they say if they share this code.

      It is not trivial to get good phase data for fMRI. The authors do not mention how they perform the respective coil-combination.<br /> No data are shared for reproduction of the analysis.

      (5) The application of NODRIC is not validated.<br /> Previous applications of NORDIC at 3T layer-fMRI have resulted in mixed success. When not adjusted for the right SNR regime it can result in artifactual reductions of beta scores, depending on the SNR across layers. The authors could validate their application of NORDIC and confirm that the average layer-profiles are unaffected by the application of NORDIC. Also, the NORDIC version should be explicitly mentioned in the manuscript.

      Akbari, A., Gati, J.S., Zeman, P., Liem, B., Menon, R.S., 2023. Layer Dependence of Monocular and Binocular Responses in Human Ocular Dominance Columns at 7T using VASO and BOLD (preprint). Neuroscience. https://doi.org/10.1101/2023.04.06.535924

      Knudsen, L., Guo, F., Huang, J., Blicher, J.U., Lund, T.E., Zhou, Y., Zhang, P., Yang, Y., 2023. The laminar pattern of proprioceptive activation in human primary motor cortex. bioRxiv. https://doi.org/10.1101/2023.10.29.564658

      Comments on revisions:

      Among all the concerns mentioned above, I think there is only one of the specific issues that was sufficiently addressed.<br /> The authors implemented a combination of three consecutive-dimensional flow crushers. Other concerns were not sufficiently addressed to change my confidence level of the study.<br /> - While the abstract is still focusing on the utility of using 3T, they do not give credit to early 3T layer-fMRI papers leading the way to larger coverage and connectivity applications.<br /> - While the author's choice of using custom SMS 2D readout is justified for them. I do not think that this very method will utilize widespread 3T whole brain connectivity experiments across the global 3T community. This lowers the impact of the paper.<br /> - The images in Fig. 5 are still suspiciously similar. To the level that the noise pattern outside the brain is identical across large parts of the maps with and without PR.<br /> - Maybe it's my ignorance, but I still do not agree why flow crushing focuses the local BOLD responses to small vessels.<br /> - While my feel of a misleading representation of the literature had been accompanied by explicit references, the authors claim that they cannot find them?!? Or claim that they are about something else (which they are not, in my viewpoint).<br /> Data and software are still not shared (not even example data, or nii data).

    3. Reviewer #2 (Public review):

      This study developed a setup for laminar fMRI at 3T that aimed to get the best from all worlds in terms of brain coverage, temporal resolution, sensitivity to detect functional responses and spatial specificity. They used a gradient-echo EPI readout to facilitate sensitivity, brain coverage and temporal resolution. The former was additionally boosted by NORDIC denoising and the latter two were further supported by acceleration both in-plane and across slices. The authors evaluated whether the implementation of velocity-nulling (VN) gradients could mitigate macrovascular bias, known to hamper laminar specificity of gradient-echo BOLD.

      Strengths:

      The setup includes 0.9 mm isotropic acquisitions with large coverage at a reasonable TR. These parameters are hard to optimize simultaneously, and I applaud the ambitious attempt to get "the best from all worlds" (large coverage, high spatio/temporal resolution, spatial specificity, sensitivity), which is sought after in the field. Also, in terms of the availability of the method, it is favorable that it benefits from lower field strength (additional time for VN-gradient implementation, afforded by longer gray matter T2*). Furthermore, I like that the authors took steps to improve the original manuscript by e.g., collecting more data, adjusting the VN implementation to include flow-suppression along three rather than a single dimension, and adjusting the ROI-definition procedure to avoid circularity issues.

      That being said, I still find the evidence weak in terms of this sequence achieving high spatial specificity and sensitivity. The results feel oversold and further validation is needed to make a case for the authors' conclusion that "[...] the potential impact of this development is expected to be extensive across various domains of neuroscience research". This is elaborated in the comments below:

      The authors acknowledge that the VN setup in its current form probably does not suppress the impact of most ascending veins (these are also not targeted by phase regression, as most are probably too small to produce sufficiently large phase responses). This seems to limit the theoretical support for the author's claim of reduced inter-layer blurring (e.g. the claim that deep and superficial signals are less coupled with VN gradients than without based on Fig 6-7). This limitation withstanding, the method may still be helpful for limiting laminar dependencies by suppressing pial vein responses (which may carry signal from distant regions and layers that blur into superficial layers if left unsuppressed). Unfortunately, the empirical support of VN gradients suppressing superficial bias seems quite weak and is hard to evaluate. For example, the profiles in Figure 4 does not consistently show clearly less superficial bias when VN gradients are on - this might partly be due to the fact that clear bias was not always present in the profiles even without VN. I suspect this is largely explained by the selection of very small and quite unrepresentative ROIs. The corresponding activation maps appear strongly weighted towards CSF which is not always captured in the profile. I recommend sampling a much larger patch of cortex to more accurately capture the actual underlying bias. In this way, all non-VN profiles should have clear bias which should be clearly suppressed for VN if the method is effective. The authors do evaluate the effect of VN/phase regression based on a large activated region in visual cortex (Fig 5) - why not show laminar profiles from here, which is an obvious way to show the effect on superficial bias? I think such evaluations would be a more direct way of evaluating the methods impact on specificity, and are necessary for subsequent FC evaluations to be convincing.

      The phase regression results are described inconsistently. In the results section, the authors, in my opinion, "correctly" acknowledge that phase regression seemed to have a very minor impact. However, in the discussion section it is described as if phase regression was effective in suppressing macrovascular responses (L 553-558), which the results do not support (especially based on profiles in Fig 4). There is barely any difference with/without phase regression, which may be due to the fact that ordinary least squares regression was chosen over a deming model which accounts for noise on the phase regressor. Although the authors correctly mentioned in their "answers to reviewers" that the required noise-ratio between magnitude and phase data can be hard to estimate, attempts of that has been described in previous phase regression studies which showed much larger effects (see e.g. Stanley et al. 2020, Knudsen et al. 2023).

      I like that the authors put in additional efforts to provide analyses to validate their NORDIC implementation. However, this needs to be done on the VN setup directly, not the "regular BOLD setup" with b=0, since the ability of NORDIC to distinguish signal and noise components depends on CNR which is expected to deviate for these setups. Also, it seems z-scores and confidence intervals were computed based on GLM residuals which may lead to inflated z-values and overly narrow CI's due to reduced degrees of freedom following denoising. The denoised z-maps from Fig 3 indeed look somewhat strange, i.e. seemingly increased false positives (more salt/pepper and a bunch of white matter activation) with very weak hand knob activation. Also, something must be wrong with the CIs on the laminar profiles - they seem extremely narrow despite noise levels obviously being high for highly accelerated 3T submillimeter results extracted from a very small ROI. The authors may consider computing these statistics from variance across trials instead.

      Given that the idea of the setup is to take advantage in terms of sensitivity by using GE-BOLD contrast relative to e.g. SE-EPI or CBV-weighted setups, they need to carefully demonstrate the sensitivity of their setup, which could be limited by high acceleration factors, the VN gradients, low field strength, etc. I like that they now put more emphasis on non-masked activation maps, but further comparison could be made through tSNR maps, raw single-volume images, raw timeseries, CNR based on across-trial variance, etc.

      The major rationale for the setup is to achieve functional connectivity (FC) with brain-wide coverage at laminar resolutions, but it is framed as if this is something that has not been possible in the past with existing setups (statements such as: "Despite advancements in acquisition speed, current CBV/CBF-based fMRI techniques remain inadequate for layer-dependent resting-state fMRI" (L138-140). To me, the functional connectivity results presented here with the VN setup are clearly less convincing than what has been shown with e.g. CBV-weighted acquisitions (e.g. Huber et al. 2021, Chai et al. 2024). The VN setup might also have advantages such as larger coverage as mentioned by the authors, but they fail to balance the comparison by highlighting where previous studies had clear edges. Thus, the impact of the results needs to be down-stated and a more balanced comparison with existing laminar FC studies is warranted. For example, acknowledging that the CBV-weighted studies demonstrate much higher spatial specificity.

      Overall I would recommend a stronger emphasis on validating the claims about the sequence on task-based data for which there is a large body of literature to benchmark against (e.g. laminar fMRI studies in V1 and M1), before going to FC where the base for comparison and reference is much more limited in humans at laminar scales.

    4. Reviewer #3 (Public review):

      Summary:

      The authors are looking for a spatially specific functional brain response to visualise non-invasively with 3T (clinical field strength) MRI. They propose a velocity-nulled weighting to remove signal from draining veins in a submillimeter multiband acquisition.

      Strengths:

      - This manuscript addresses a real need in the cognitive neuroscience community interested in imaging responses in cortical layers in-vivo in humans.<br /> - An additional benefit is the proposed implementation at 3T, a widely available field strength.

      Weaknesses:

      - The comparison in Figure 4 for different b-values shows % signal changes. However, as the baseline signal changes with added diffusion weighting, this is rather uninformative. A plot of t-values against cortical depth would be more insightful.<br /> - Surprisingly, the %-signal change for a b-value of 0 is below 1% for 3/4 participants, even at the cortical surface. This raises some doubts about the task or ROI definition. A finger-tapping task should reliably engage the primary motor cortex, even at 3T, and even in individual participants.<br /> - The double peak patter in the BOLD weighted images in Figure 4 is unexpected given the existing literature on BOLD responses as a function of cortical depth.<br /> - Although I'd like to applaud the authors for their ambition with the connectivity analysis, the low significance threshold used in these maps (z=1,64) leads to concerns about the SNR of the underlying data.

      I remain unconvinced of the conclusion that the developed VN fMRI exhibited layer specificity - the double peak which is taken as a marker of specificity is not absent in the BOLD responses either, and overall BOLD and VN response profiles as a function of cortical depth are quite similar.

    5. Author response:

      The following is the authors’ response to the original reviews.

      General responses:

      The authors sincerely thank all the reviewers for their valuable and constructive comments. We also apologize for the long delay in providing this rebuttal due to logistical and funding challenges. In this revision, we modified the bipolar gradients from one single direction to all three directions. Additionally, in response to the concerns regarding data reliability, we conducted a thorough examination of each step in our data processing pipeline. In the original processing workflow, the projection-onto-convex-set (POCS) method was used for partial Fourier reconstruction. Upon examination, we found that applying the POCS method after parallel image reconstruction significantly altered the signal and resulted in considerable loss of functional feature. Futhermore, the original scan protocol employed a TE of 46 ms, which is notably longer than the typical TE of 33 ms. A prolonged TE can increase the ratio of extravascular to intravascular contributions. Importantly, the impact of TE on the efficacy of phase regression remains unclear, introducing potential confounding effects. To address these issues, we revised the protocol by shortening the TE from 46 ms to 39 ms. This adjustment was achieved by modifying the SMS factor to 3 and the in-plane acceleration rate to 3, thereby minimizing the confounding effects associated with an extended TE.

      Following these changes, we recollected task-based fMRI data (N=4) and resting-state fMRI data (N=14) under the updated protocol. Using the revised dataset, we validated layer-specific functional connectivity (FC) through seed-based analyses. These analyses revealed distinct connectivity patterns in the superficial and deep layers of the primary motor cortex (M1), with statistically significant inter-layer differences. Furthermore, additional analyses with a seed in the primary sensory cortex (S1) corroborated the robustness and reliability of the revised methodology. We also changed the ‘directed’ functional connectivity in the title to ‘layer-specific’ functional connectivity, as drawing conclusions about directionality requires auxiliary evidence beyond the scope of this study.

      We provide detailed responses to the reviewers’ comments below.

      Reviewer #1 (Public Review):

      Summary:

      (1)   This study aims to provide imaging methods for users of the field of human layer-fMRI. This is an emerging field with 240 papers published so far. Different than implied in the manuscript, 3T is well represented among those papers. E.g. see the papers below that are not cited in the manuscript. Thus, the claim on the impact of developing 3T methodology for wider dissemination is not justified. Specifically, because some of the previous papers perform whole brain layer-fMRI (also at 3T) in more efficient, and more established procedures.

      3T layer-fMRI papers that are not cited:

      Taso, M., Munsch, F., Zhao, L., Alsop, D.C., 2021. Regional and depth-dependence of cortical blood-flow assessed with high-resolution Arterial Spin Labeling (ASL). Journal of Cerebral Blood Flow and Metabolism. https://doi.org/10.1177/0271678X20982382

      Wu, P.Y., Chu, Y.H., Lin, J.F.L., Kuo, W.J., Lin, F.H., 2018. Feature-dependent intrinsic functional connectivity across cortical depths in the human auditory cortex. Scientific Reports 8, 1-14. https://doi.org/10.1038/s41598-018-31292-x

      Lifshits, S., Tomer, O., Shamir, I., Barazany, D., Tsarfaty, G., Rosset, S., Assaf, Y., 2018. Resolution considerations in imaging of the cortical layers. NeuroImage 164, 112-120. https://doi.org/10.1016/j.neuroimage.2017.02.086

      Puckett, A.M., Aquino, K.M., Robinson, P.A., Breakspear, M., Schira, M.M., 2016. The spatiotemporal hemodynamic response function for depth-dependent functional imaging of human cortex. NeuroImage 139, 240-248. https://doi.org/10.1016/j.neuroimage.2016.06.019

      Olman, C.A., Inati, S., Heeger, D.J., 2007. The effect of large veins on spatial localization with GE BOLD at 3 T: Displacement, not blurring. NeuroImage 34, 1126-1135. https://doi.org/10.1016/j.neuroimage.2006.08.045

      Ress, D., Glover, G.H., Liu, J., Wandell, B., 2007. Laminar profiles of functional activity in the human brain. NeuroImage 34, 74-84. https://doi.org/10.1016/j.neuroimage.2006.08.020

      Huber, L., Kronbichler, L., Stirnberg, R., Ehses, P., Stocker, T., Fernández-Cabello, S., Poser, B.A., Kronbichler, M., 2023. Evaluating the capabilities and challenges of layer-fMRI VASO at 3T. Aperture Neuro 3. https://doi.org/10.52294/001c.85117

      Scheeringa, R., Bonnefond, M., van Mourik, T., Jensen, O., Norris, D.G., Koopmans, P.J., 2022. Relating neural oscillations to laminar fMRI connectivity in visual cortex. Cerebral Cortex. https://doi.org/10.1093/cercor/bhac154

      We thank the reviewer for listing out 8 papers related to 3T layer-fMRI papers. The primary goal of our work is to develop a methodology for brain-wide, layer-dependent resting-state functional connectivity at 3T. Upon review of the cited papers, we found that:

      (1) One study (Lifshits et al.) was not an fMRI study.

      (2) One study (Olman et al.) was conducted at 7T, not 3T.

      (3) Two studies (Taso et al. and Wu et al.) employed relatively large voxel sizes (1.6 × 2.3 × 5 mm³ and 1.5 mm isotropic, respectively), which limits layer specificity.

      (4) Only one of the listed studies (Huber et al., Aperture Neuro 2023) provides coverage of more than half of the brain.

      While each of these studies offers valuable insights, the VASO study by Huber et al. is the most relevant to our work, given its brain-wide coverage. However, the VASO method employs a relatively long TR (14.137 s), which may not be optimal for resting-state functional connectivity analyses.

      To address these limitations, our proposed method achieves submillimeter resolution, layer specificity, brain-wide coverage, and a significantly shorter TR (<5 s) altogether. We believe this advancement provides a meaningful contribution to the field, enabling broader applicability of layer-fMRI at 3T.

      (2) The authors implemented a sequence with lots of nice features. Including their own SMS EPI, diffusion bipolar pulses, eye-saturation bands, and they built their own reconstruction around it. This is not trivial. Only a few labs around the world have this level of engineering expertise. I applaud this technical achievement. However, I doubt that any of this is the right tool for layer-fMRI, nor does it represent an advancement for the field. In the thermal noise dominated regime of sub-millimeter fMRI (especially at 3T), it is established to use 3D readouts over 2D (SMS) readouts. While it is not trivial to implement SMS, the vendor implementations (as well as the CMRR and MGH implementations) are most widely applied across the majority of current fMRI studies already. The author's work on this does not serve any previous shortcomings in the field.

      We would like to thank the reviewer for their comments and the recognition of the technical efforts in implementing our sequence. We would like to address the points raised:

      (1) We completely agree that in-house implementation of existing techniques does not constitute an advancement for the field. We did not claim otherwise in the manuscript. Our focus was on the development of a method for brain-wide, layer-dependent resting-state functional connectivity at 3T, as mentioned in the response above.

      (2) The reviewer stated that "it is established to use 3D readouts over 2D (SMS) readouts". This is a strong claim, and we believe it requires robust evidence to support it. While it is true that 3D readouts can achieve higher tSNR in certain regions, such as the central brain, as shown in the study by Vizioli et al. (ISMRM 2020 abstract; https://cds.ismrm.org/protected/20MProceedings/PDFfiles/3825.html?utm_source=chatgpt.com ), higher tSNR does not necessarily equate to improved detection power in fMRI studies. For instance, Le Ster et al. (PLOS ONE, 2019; https://doi.org/10.1371/journal.pone.0225286 ). demonstrated that while 3D EPI had higher tSNR in the central brain, SMS EPI produced higher t-scores in activation maps.

      (3) When choosing between SMS EPI and 3D EPI, multiple factors should be taken into account, not just tSNR. For example, SMS EPI and 3D EPI differ in their sensitivity to motion and the complexity of motion correction. The choice between them depends on the specific research goals and practical constraints.

      (4) We are open to different readout strategies, provided they can be demonstrated suitable to the research goals. In this study, we opted for 2D SMS primarily due to logistical considerations. This choice does not preclude the potential use of 3D readouts in the future if they are deemed more appropriate for the project objectives.

      The mechanism to use bi-polar gradients to increase the localization specificity is doubtful to me. In my understanding, killing the intra-vascular BOLD should make it less specific. Also, the empirical data do not suggest a higher localization specificity to me.

      We will elaborate the mechanism and reasoning in the later responses.

      Embedding this work in the literature of previous methods is incomplete. Recent trends of vessel signal manipulation with ABC or VAPER are not mentioned. Comparisons with VASO are outdated and incorrect.

      The reproducibility of the methods and the result is doubtful (see below).

      In this revision, we updated the scan protocol and recollected the imaging data. Detailed explanations and revised results are provided in the later responses.

      I don't think that this manuscript is in the top 50% of the 240 layer-fmri papers out there.

      We respect the reviewer’s personal opinion. However, we can only address scientific comments or critiques.

      Strengths:

      See above. The authors developed their own SMS sequence with many features. This is important to the field. And does not leave sequence development work to view isolated monopoly labs. This work democratises SMS.

      The questions addressed here are of high relevance to the field: getting tools with good sensitivity, user-friendly applicability, and locally specific brain activity mapping is an important topic in the field of layer-fMRI.

      Weaknesses:

      (1) I feel the authors need to justify why flow-crushing helps localization specificity. There is an entire family of recent papers that aim to achieve higher localization specificity by doing the exact opposite. Namely, MT or ABC fRMRI aims to increase the localization specificity by highlighting the intravascular BOLD by means of suppressing non-flowing tissue. To name a few:

      Priovoulos, N., de Oliveira, I.A.F., Poser, B.A., Norris, D.G., van der Zwaag, W., 2023. Combining arterial blood contrast with BOLD increases fMRI intracortical contrast. Human Brain Mapping hbm.26227. https://doi.org/10.1002/hbm.26227.

      Pfaffenrot, V., Koopmans, P.J., 2022. Magnetization Transfer weighted laminar fMRI with multi-echo FLASH. NeuroImage 119725. https://doi.org/10.1016/j.neuroimage.2022.119725

      Schulz, J., Fazal, Z., Metere, R., Marques, J.P., Norris, D.G., 2020. Arterial blood contrast ( ABC ) enabled by magnetization transfer ( MT ): a novel MRI technique for enhancing the measurement of brain activation changes. bioRxiv. https://doi.org/10.1101/2020.05.20.106666

      Based on this literature, it seems that the proposed method will make the vein problem worse, not better. The authors could make it clearer how they reason that making GE-BOLD signals more extra-vascular weighted should help to reduce large vein effects.

      The proposed VN fMRI method employs VN gradients to selectively suppress signals from fast-flowing blood in large vessels. Although this approach may initially appear to diverge from the principles of CBV-based techniques (Chai et al., 2020; Huber et al., 2017a; Pfaffenrot and Koopmans, 2022; Priovoulos et al., 2023), which enhance sensitivity to vascular changes in arterioles, capillaries, and venules while attenuating signals from static tissue and large veins, it aligns with the fundamental objective of all layer-specific fMRI methods. Specifically, these approaches aim to maximize spatial specificity by preserving signals proximal to neural activation sites and minimizing contributions from distal sources, irrespective of whether the signals are intra- or extra-vascular in origin. In the context of intravascular signals, CBV-based methods preferentially enhance sensitivity to functional changes in small vessels (proximal components) while demonstrating reduced sensitivity to functional changes in large vessels (distal components). For extravascular signals, functional changes are a mixture of proximal and distal influences. While tissue oxygenation near neural activation sites represents a proximal contribution, extravascular signal contamination from large pial veins reflects distal effects that are spatially remote from the site of neuronal activity. CBV-based techniques mitigate this challenge by unselectively suppressing signals from static tissues, thereby highlighting contributions from small vessels. In contrast, the VN fMRI method employs a targeted suppression strategy, selectively attenuating signals from large vessels (distal components) while preserving those from small vessels (proximal components). Furthermore, the use of a 3T scanner and the inclusion of phase regression in the VN approach mitigates contamination from large pial veins (distal components) while preserving signals reflecting local tissue oxygenation (proximal components). By integrating these mechanisms, VN fMRI improves spatial specificity, minimizing both intravascular and extravascular contributions that are distal to neuronal activation sites. We have incorporated the responses into Discussion section.

      The empirical evidence for the claim that flow crushing helps with the localization specificity should be made clearer. The response magnitude with and without flow crushing looks pretty much identical to me (see Fig, 6d).

      In the new results in Figure 4, the application of VN gradients attenuated the bias towards pial surface. Consistent with the results in Figure 4, Figure 5 also demonstrated the suppression of macrovascular signal by VN gradients.

      It's unclear to me what to look for in Fig. 5. I cannot discern any layer patterns in these maps. It's too noisy. The two maps of TE=43ms look like identical copies from each other. Maybe an editorial error?

      In this revision, the original Figure 5 has been removed. However, we would like to clarify that the two maps with TE = 43 ms in the original Figure 5 were not identical. This can be observed in the difference map provided in the right panel of the figure.

      The authors discuss bipolar crushing with respect to SE-BOLD where it has been previously applied. For SE-BOLD at UHF, a substantial portion of the vein signal comes from the intravascular compartment. So I agree that for SE-BOLD, it makes sense to crush the intravascular signal. For GE-BOLD however, this reasoning does not hold. For GE-BOLD (even at 3T), most of the vein signal comes from extravascular dephasing around large unspecific veins, and the bipolar crushing is not expected to help with this.

      The reviewer’s statement that "most of the vein signal comes from extravascular dephasing around large unspecific veins" may hold true for 7T. However, at 3T, the susceptibility-induced Larmor frequency shift is reduced by 57%, and the extravascular contribution decreases by more than 35%, as shown by Uludağ et al. 2009 ( DOI: 10.1016/j.neuroimage.2009.05.051 ).

      Additionally, according to the biophysical models (Ogawa et al., 1993; doi: 10.1016/S0006-3495(93)81441-3 ), the extravascular contamination from the pial surface is inversely proportional to the square of the distance from vessel. For a vessel diameter of 0.3 mm and an isotropic voxel size of 0.9 mm, the induced frequency shift is reduced by at least 36-fold at the next voxel. Notably, a vessel diameter of 0.3 mm is larger than most pial vessels. Theoretically, the extravascular effect contributes minimally to inter-layer dependency, particularly at 3T compared to 7T due to weaker susceptibility-related effects at lower field strengths. Empirically, as shown in Figure 7c, the results at M1 demonstrated that layer specificity can be achieved statistically with the application of VN gradients. We have incorporated this explanation into the Introduction and Discussion sections of the manuscript.

      (2) The bipolar crushing is limited to one single direction of flow. This introduces a lot of artificial variance across the cortical folding pattern. This is not mentioned in the manuscript. There is an entire family of papers that perform layer-fmri with black-blood imaging that solves this with a 3D contrast preparation (VAPER) that is applied across a longer time period, thus killing the blood signal while it flows across all directions of the vascular tree. Here, the signal cruising is happening with a 2D readout as a "snap-shot" crushing. This does not allow the blood to flow in multiple directions.

      VAPER also accounts for BOLD contaminations of larger draining veins by means of a tag-control sampling. The proposed approach here does not account for this contamination.

      Chai, Y., Li, L., Huber, L., Poser, B.A., Bandettini, P.A., 2020. Integrated VASO and perfusion contrast: A new tool for laminar functional MRI. NeuroImage 207, 116358. https://doi.org/10.1016/j.neuroimage.2019.116358

      Chai, Y., Liu, T.T., Marrett, S., Li, L., Khojandi, A., Handwerker, D.A., Alink, A., Muckli, L., Bandettini, P.A., 2021. Topographical and laminar distribution of audiovisual processing within human planum temporale. Progress in Neurobiology 102121. https://doi.org/10.1016/j.pneurobio.2021.102121

      If I would recommend anyone to perform layer-fMRI with blood crushing, it seems that VAPER is the superior approach. The authors could make it clearer why users might want to use the unidirectional crushing instead.

      We understand the reviewer’s concern regarding the directional limitation of bipolar crushing. As noted in the responses above, we have updated the bipolar gradient to include three orthogonal directions instead of a single direction. Furthermore, flow-related signal suppression does not necessarily require a longer time period. Bipolar diffusion gradients have been effectively used to nullify signals from fast-flowing blood, as demonstrated by Boxerman et al. (1995; DOI: 10.1002/mrm.1910340103). Their study showed that vessels with flow velocities producing phase changes greater than p radians due to bipolar gradients experience significant signal attenuation. The critical velocity for such attenuation can be calculated using the formula: 1/(2gGDd) where g is the gyromagnetic ratio, G is the gradient strength, d is the gradient pulse width and D is the time between the two bipolar gradient pulses. In the framework of Boxerman et al. at 1.5T, the critical velocity for b value of 10 s/mm<sup>2</sup> is ~8 mm/s, resulting in a ~30% reduction in functional signal. In our 3T study, b values of 6, 7, and 8 s/mm<sup>2</sup> correspond to critical velocities of 16.8, 15.2, and 13.9 mm/s, respectively. The flow velocities in capillaries and most venules remain well below these thresholds. Notably, in our VN fMRI sequences, bipolar gradients were applied in all three orthogonal directions, whereas in Boxerman et al.'s study, the gradients were applied only in the z-direction. Given the voxel dimensions of 3 × 3 × 7 mm<sup>3</sup> in the 1.5T study, vessels within a large voxel are likely oriented in multiple directions, meaning that only a subset of fast-flowing signals would be attenuated. Therefore, our approach is expected to induce greater signal reduction, even at the same b values as those used in Boxerman et al.'s study. We have incorporated this text into the Discussion section of the manuscript.

      (3) The comparison with VASO is misleading.

      The authors claim that previous VASO approaches were limited by TRs of 8.2s. The authors might be advised to check the latest literature of the last years.

      Koiso et al. performed whole brain layer-fMRI VASO at 0.8mm at 3.9 seconds (with reliable activation), 2.7 seconds (with unconvincing activation pattern, though), and 2.3 (without activation).

      Also, whole brain layer-fMRI BOLD at 0.5mm and 0.7mm has been previously performed by the Juelich group at TRs of 3.5s (their TR definition is 'fishy' though).

      Koiso, K., Müller, A.K., Akamatsu, K., Dresbach, S., Gulban, O.F., Goebel, R., Miyawaki, Y., Poser, B.A., Huber, L., 2023. Acquisition and processing methods of whole-brain layer-fMRI VASO and BOLD: The Kenshu dataset. Aperture Neuro 34. https://doi.org/10.1101/2022.08.19.504502

      Yun, S.D., Pais‐Roldán, P., Palomero‐Gallagher, N., Shah, N.J., 2022. Mapping of whole‐cerebrum resting‐state networks using ultra‐high resolution acquisition protocols. Human Brain Mapping. https://doi.org/10.1002/hbm.25855

      Pais-Roldan, P., Yun, S.D., Palomero-Gallagher, N., Shah, N.J., 2023. Cortical depth-dependent human fMRI of resting-state networks using EPIK. Front. Neurosci. 17, 1151544. https://doi.org/10.3389/fnins.2023.1151544

      We thank the reviewer for providing these references. While the protocol with a TR of 3.9 seconds in Koiso’s work demonstrated reasonable activation patterns, it was not tested for layer specificity. Given that higher acceleration factors (AF) can cause spatial blurring, a protocol should only be eligible for comparison if layer specificity is demonstrated.

      Secondly, the TRs reported in Koiso’s study pertain only to either the VASO or BOLD acquisition, not the combined CBV-based contrast. To generate CBV-based images, both VASO and BOLD data are required, effectively doubling the TR. For instance, if the protocol with a TR of 3.9 seconds is used, the effective TR becomes approximately 8 seconds. The stable protocol used by Koiso et al. to acquire whole-brain data (94.08 mm along the z-axis) required 5.2 seconds for VASO and 5.1 seconds for BOLD, resulting in an effective TR of 10.3 seconds. The spatial resolution achieved was 0.84 mm isotropic.

      Unfortunately, we could not find the Juelich paper mentioned by the reviewer.

      To have a more comprehensive comparison, we collated relevant literature on brain-wide layer-specific fMRI. We defined brain-wide acquisition as imaging protocols that cover more than half of the human brain, specifically exceeding 55 mm along the superior-inferior axis. We identified five studies and summarized their scan parameters, including effective TR, coverage, and spatial resolution, in Table 1.

      The authors are correct that VASO is not advised as a turn-key method for lower brain areas, incl. Hippocampus and subcortex. However, the authors use this word of caution that is intended for inexperienced "users" as a statement that this cannot be performed. This statement is taken out of context. This statement is not from the academic literature. It's advice for the 40+ user base that wants to perform layer-fMRI as a plug-and-play routine tool in neuroscience usage. In fact, sub-millimeter VASO is routinely being performed by MRI-physicists across all brain areas (including deep brain structures, hippocampus etc). E.g. see Koiso et al. and an overview lecture from a layer-fMRI workshop that I had recently attended: https://youtu.be/kzh-nWXd54s?si=hoIJjLLIxFUJ4g20&t=2401

      In this revision, we decided to focus on cortico-cortical functional connectivity and have removed the LGN-related content. Consequently, the text mentioned by the reviewer was also removed. Nevertheless, we apologize if our original description gave the impression that functional mapping of deep brain regions using VASO is not feasible. The word of caution we used is based on the layer-fMRI blog ( https://layerfmri.com/2021/02/22/vaso_ve/ ) and reflects the challenges associated with this technique, as outlined by experts like Dr. Huber and Dr. Strinberg.

      According to the information provided, including the video, functional mapping of the hippocampus and amygdala using VASO is indeed possible but remains technically challenging. The short arterial arrival times in these deep brain regions can complicate the acquisition, requiring RF inversion pulses to cover a wider area at the base of the brain. For example, as of 2023, four or more research groups were attempting to implement layer-fMRI VASO in the hippocampus. One such study at 3T required multiple inversion times to account for inflow effects, highlighting the technical complexity of these applications. This is the context in which we used the word of caution. We are not sure whether recent advancements like MAGEC VASO have improved its applicability. As of 2024, we have not identified any published VASO studies specifically targeting deep brain structures such as the hippocampus or amygdala. Therefore, it is difficult to conclude that “sub-millimeter VASO is routinely being performed by MRI physicists on deep brain structures such as the hippocampus.”

      Thus, the authors could embed this phrasing into the context of their own method that they are proposing in the manuscript. E.g. the authors could state whether they think that their sequence has the potential to be disseminated across sites, considering that it requires slow offline reconstruction in Matlab?

      We are enthusiastic about sharing our imaging sequence, provided its usefulness is conclusively established. However, it's important to note that without an online reconstruction capability, such as the ICE, the practical utility of the sequence may be limited. Unfortunately, we currently don’t have the manpower to implement the online reconstruction. Nevertheless, we are more than willing to share the offline reconstruction codes upon request.

      Do the authors think that the results shown in Fig. 6c are suggesting turn-key acquisition of a routine mapping tool? In my humble opinion, it looks like random noise, with most of the activation outside the ROI (in white matter).

      As we mentioned in the ‘general response’ in the beginning of the rebuttal, the POCS method for partial Fourier reconstruction caused the loss of functional feature, potentially accounting for the activation in white matter. In this revision, we have modified the pulse sequence, scan protocol and processing pipelines.

      According to the results in Figure 4, stable activation in M1 was observed at the single-subject level across most scan protocols. Yet, the layer-dependent activation profiles in M1 were spatially unstable, irrespective of the application of VN gradients. This spatial instability is not entirely unexpected, as T2*-based contrast is inherently sensitive to various factors that perturb the magnetic field, such as eye movements, respiration, and macrovascular signal fluctuations. Furthermore, ICA-based artifact removal was intentionally omitted in Figure 4 to ensure fair comparisons between protocols, leaving residual artifacts unaddressed. Inconsistency in performing the button-pressing task across sessions may also have contributed to the observed variability. These results suggest that submillimeter-resolution fMRI may not yet be suitable for reliable individual-level layer-dependent functional mapping, unless group-level statistics are incorporated to enhance robustness. We have incorporated this text into the Limitation section of the manuscript.

      (4) The repeatability of the results is questionable.

      The authors perform experiments about the robustness of the method (line 620). The corresponding results are not suggesting any robustness to me. In fact, the layer profiles in Fig. 4c vs. Fig 4d are completely opposite. The location of peaks turns into locations of dips and vice versa.

      The methods are not described in enough detail to reproduce these results.

      The authors mention that their image reconstruction is done "using in-house MATLAB code" (line 634). They do not post a link to github, nor do they say if they share this code.

      We thank the reviewer for the comments regarding reproducibility and data sharing. In response, we have revised the Methods section and elaborated on the technical details to improve clarity and reproducibility.

      Regarding code sharing, we acknowledge that the current in-house MATLAB reconstruction code requires further refinement to improve its readability and usability. Due to limited manpower, we have not yet been able to complete this task. However, we are committed to making the code publicly available and will upload it to GitHub as soon as the necessary resources are available.

      For data sharing, we face logistical challenges due to the large size of the dataset, which spans tens of terabytes. Platforms like OpenNeuro, for example, typically support datasets up to 10TB, making it difficult to share the data in its entirety. Despite this limitation, we are more than willing to share offline reconstruction codes and raw data upon request to facilitate reproducibility.

      Regarding data robustness, we kindly refer the reviewer to our response to the previous comment, where we addressed these concerns in greater detail.

      It is not trivial to get good phase data for fMRI. The authors do not mention how they perform the respective coil-combination.

      No data are shared for reproduction of the analysis.

      Obtaining phase data is relatively straightforward when the images are retrieved directly from raw data. For coil combination, we employed the adaptive coil combination approach described by (Walsh et al.; DOI: 10.1002/(sici)1522-2594(200005)43:5<682::aid-mrm10>3.0.co;2-g ) The MATLAB code for this implementation was developed by Dr. Diego Hernando and is publicly available at https://github.com/welton0411/matlab .

      (5) The application of NODRIC is not validated.

      Previous applications of NORDIC at 3T layer-fMRI have resulted in mixed success. When not adjusted for the right SNR regime it can result in artifactual reductions of beta scores, depending on the SNR across layers. The authors could validate their application of NORDIC and confirm that the average layer-profiles are unaffected by the application of NORDIC. Also, the NORDIC version should be explicitly mentioned in the manuscript.

      Akbari, A., Gati, J.S., Zeman, P., Liem, B., Menon, R.S., 2023. Layer Dependence of Monocular and Binocular Responses in Human Ocular Dominance Columns at 7T using VASO and BOLD (preprint). Neuroscience. https://doi.org/10.1101/2023.04.06.535924

      Knudsen, L., Guo, F., Huang, J., Blicher, J.U., Lund, T.E., Zhou, Y., Zhang, P., Yang, Y., 2023. The laminar pattern of proprioceptive activation in human primary motor cortex. bioRxiv. https://doi.org/10.1101/2023.10.29.564658

      We appreciate the reviewer’s suggestion. To validate the application of NORDIC denoising in our study, we compared the BOLD activation maps before and after denoising in the visual and motor cortices, as well as the depth-dependent activation profiles in M1. These results are presented in Figure 3. The activation patterns in the denoised maps were consistent with those in the non-denoised maps but exhibited higher statistical significance. Notably, BOLD activation within M1 was only observed after NORDIC denoising, underscoring the necessity of this approach. Figure 3c shows the depth-dependent activation profiles in M1, highlighted by the green contours in Figure 3b. Both denoised and non-denoised profiles followed similar trends; however, as expected, the non-denoised profile exhibited larger confidence intervals compared to the NORDIC-denoised profile. These results confirm that NORDIC denoising enhances sensitivity without introducing distortions in the functional signal. The corresponding text has been incorporated into the Results section.

      Regarding the implementation details of NORDIC denoising, the reconstructed images were denoised using a g-factor map (function name: NIFTI_NORDIC). The g-factor map was estimated from the image time series, and the input images were complex-valued. The width of the smoothing filter for the phase was set to 10, while all other hyperparameters were retained at their default values. This information has been integrated into the Methods section for clarity and reproducibility.

      Reviewer #2 (Public Review):

      This study developed a setup for laminar fMRI at 3T that aimed to get the best from all worlds in terms of brain coverage, temporal resolution, sensitivity to detect functional responses, and spatial specificity. They used a gradient-echo EPI readout to facilitate sensitivity, brain coverage and temporal resolution. The former was additionally boosted by NORDIC denoising and the latter two were further supported by parallel-imaging acceleration both in-plane and across slices. The authors evaluated whether the implementation of velocity-nulling (VN) gradients could mitigate macrovascular bias, known to hamper the laminar specificity of gradient-echo BOLD.

      The setup allows for 0.9 mm isotropic acquisitions with large coverage at a reasonable TR (at least for block designs) and the fMRI results presented here were acquired within practical scan-times of 12-18 minutes. Also, in terms of the availability of the method, it is favorable that it benefits from lower field strength (additional time for VN-gradient implementation, afforded by longer gray matter T2*).

      The well-known double peak feature in M1 during finger tapping was used as a test-bed to evaluate the spatial specificity. They were indeed able to demonstrate two distinct peaks in group-level laminar profiles extracted from M1 during finger tapping, which was largely free from superficial bias. This is rather intriguing as, even at 7T, clear peaks are usually only seen with spatially specific non-BOLD sequences. This is in line with their simple simulations, which nicely illustrated that, in theory, intravascular macrovascular signals should be suppressible with only minimal suppression of microvasculature when small b-values of the VN gradients are employed. However, the authors do not state how ROIs were defined making the validity of this finding unclear; were they defined from independent criteria or were they selected based on the region mostly expressing the double peak, which would clearly be circular? In any case, results are based on a very small sub-region of M1 in a single slice - it would be useful to see the generalizability of superficial-bias-free BOLD responses across a larger portion of M1.

      We appreciate and understand the reviewer’s concerns. Given the small size of the hand knob region within M1 and its intersubject variability in location, defining this region automatically remains challenging. However, we applied specific criteria to minimize bias during the delineation of M1: 1) the hand knob region was required to be anatomically located in the precentral sulcus or gyrus; 2) it needed to exhibit consistent BOLD activation across the majority of testing conditions; and 3) the region was expected to show BOLD activation in the deep cortical layers under the condition of b = 0 and TE = 30 ms. Once the boundaries across cortical depth were defined, the gray matter boundaries of hand knob region were delineated based on the T1-weighted anatomical image and the cortical ribbon mask but excluded the BOLD activation map to minimize potential bias in manual delineation. Based on the new criteria, the resulting depth-dependent profiles, as shown in Figure 4, are no longer superficial-bias-free.

      As repeatedly mentioned by the authors, a laminar fMRI setup must demonstrate adequate functional sensitivity to detect (in this case) BOLD responses. The sensitivity evaluation is unfortunately quite weak. It is mainly based on the argument that significant activation was found in a challenging sub-cortical region (LGN). However, it was a single participant, the activation map was not very convincing, and the demonstration of significant activation after considerable voxel-averaging is inadequate evidence to claim sufficient BOLD sensitivity. How well sensitivity is retained in the presence of VN gradients, high acceleration factors, etc., is therefore unclear. The ability of the setup to obtain meaningful functional connectivity results is reassuring, yet, more elaborate comparison with e.g., the conventional BOLD setup (no VN gradients) is warranted, for example by comparison of tSNR, quantification and comparison of CNR, illustration of unmasked-full-slice activation maps to compare noise-levels, comparison of the across-trial variance in each subject, etc. Furthermore, as NORDIC appears to be a cornerstone to enable submillimeter resolution in this setup at 3T, it is critical to evaluate its impact on the data through comparison with non-denoised data, which is currently lacking.

      We appreciate the reviewer’s comments and acknowledge that the LGN results from a single participant were not sufficiently convincing. In this revision, we have removed the LGN-related results and focused on cortico-cortical FC. To evaluate data quality, we opted to present BOLD activation maps rather than tSNR, as high tSNR does not necessarily translate to high functional significance. In Figure 3, we illustrate the effect of NORDIC denoising, including activation maps and depth-dependent profiles. Figure 4 presents activation maps acquired under different TE and b values, demonstrating that VN gradients effectively reduce the bias toward the pial surface without altering the overall activation patterns. The results in Figure 4 and Figure 5 provide evidence that VN gradients retain sensitivity while reducing superficial bias. The ability of the setup to obtain meaningful FC results was validated through seed-based analyses, identifying distinct connectivity patterns in the superficial and deep layers of the primary motor cortex (M1), with significant inter-layer differences (see Figure 7). Further analyses with a seed in the primary sensory cortex (S1) demonstrated the reliability of the method (see Figure 8). For further details on the results, including the impact of VN gradients and NORDIC denoising, please refer to Figures 3 to 8 in the Results section.

      Additionally, we acknowledge the limitations of our current protocol for submillimeter-resolution fMRI at the individual level. We found that robust layer-dependent functional mapping often requires group-level statistics to enhance reliability. This issue has been discussed in detail in the Limitations section.

      The proposed setup might potentially be valuable to the field, which is continuously searching for techniques to achieve laminar specificity in gradient echo EPI acquisitions. Nonetheless, the above considerations need to be tackled to make a convincing case.

      Reviewer #3 (Public Review):

      Summary:

      The authors are looking for a spatially specific functional brain response to visualise non-invasively with 3T (clinical field strength) MRI. They propose a velocity-nulled weighting to remove the signal from draining veins in a submillimeter multiband acquisition.

      Strengths:

      - This manuscript addresses a real need in the cognitive neuroscience community interested in imaging responses in cortical layers in-vivo in humans.

      - An additional benefit is the proposed implementation at 3T, a widely available field strength.

      Weaknesses:

      - Although the VASO acquisition is discussed in the introduction section, the VN-sequence seems closer to diffusion-weighted functional MRI. The authors should make it more clear to the reader what the differences are, and how results are expected to differ. Generally, it is not so clear why the introduction is so focused on the VASO acquisition (which, curiously, lacks a reference to Lu et al 2013). There are many more alternatives to BOLD-weighted imaging for fMRI. CBF-weighted ASL and GRASE have been around for a while, ABC and double-SE have been proposed more recently.

      The major distinction between diffusion-weighted fMRI (DW-fMRI) and our methodology lies in the b-value employed. DW-fMRI typically measures cellular swelling using b-values greater than 1000 s/mm<sup>2</sup> (e.g., 1800 s/mm(sup>2</sup>). In contrast, our VN-fMRI approach measures hemodynamic responses by employing smaller b-values specifically designed to suppress signals from fast-flowing draining veins rather than detecting microstructural changes.

      Regarding other functional contrasts, we agree that more layer-dependent fMRI approaches should be mentioned. In this revision, we have expanded the Introduction section to include discussions of the double spin-echo approach and CBV-based methods, such as MT-weighted fMRI, VAPER, ABC, and CBF-based method ASL. Additionally, the reference to Lu et al. (2013) has been cited in the revised manuscript. The corresponding text has been incorporated into the Introduction section to provide a more comprehensive overview of alternative functional imaging techniques.

      - The comparison in Figure 2 for different b-values shows % signal changes. However, as the baseline signal changes dramatically with added diffusion weighting, this is rather uninformative. A plot of t-values against cortical depth would be much more insightful.

      - Surprisingly, the %-signal change for a b-value of 0 is not significantly different from 0 in the gray matter. This raises some doubts about the task or ROI definition. A finger-tapping task should reliably engage the primary motor cortex, even at 3T, and even in a single participant.

      - The BOLD weighted images in Figure 3 show a very clear double-peak pattern. This contradicts the results in Figure 2 and is unexpected given the existing literature on BOLD responses as a function of cortical depth.

      - Given that data from Figures 2, 3, and 4 are derived from a single participant each, order and attention affects might have dramatically affected the observed patterns. Especially for Figure 4, neither BOLD nor VN profiles are really different from 0, and without statistical values or inter-subject averaging, these cannot be used to draw conclusions from.

      We appreciate the reviewer’s suggestions. In this revision, we have made significant updates to the participant recruitment, scan protocol, data processing, and M1 delineation. Please refer to the "General Responses" at the beginning of the rebuttal and the first response to Reviewer #2 for more details.

      Previously, the variation in depth-dependent profiles was calculated across upscaled voxels within a specific layer. However, due to the small size of the hand knob region, the number of within-layer voxels was limited, resulting in inaccurate estimations of signal variation. In the revised manuscript, the signal was averaged within each layer before performing the GLM analysis, and signal variation was calculated using the temporal residuals. The technical details of these changes are described in the "Materials and Methods" section. Furthermore, while the initial submission used percentage signal change for the profiles of M1, the dramatic baseline fluctuations observed previously are no longer an issue after the modifications. For this reason, we retained the use of percentage signal change to present the depth-dependent profiles. After these adjustments, the profiles exhibited a bias toward the pial surface, particularly in the absence of VN gradients.

      - In Figure 5, a phase regression is added to the data presented in Figure 4. However, for a phase regression to work, there has to be a (macrovascular) response to start with. As none of the responses in Figure 4 are significant for the single participant dataset, phase regression should probably not have been undertaken. In this case, the functional 'responses' appear to increase with phase regression, which is contra-intuitive and deserves an explanation.

      We agreed with reviewer’s argument. In the revised results, the issues mentioned by the reviewer are largely diminished. The updated analyses demonstrate that phase regression effectively reduces superficial bias, as shown in Figures 4 and 5.

      - Consistency of responses is indeed expected to increase by a removal of the more variable vascular component. However, the microvascular component is always expected to be smaller than the combination of microvascular + macrovascular responses. Note that the use of %signal changes may obscure this effect somewhat because of the modified baseline. Another expected feature of BOLD profiles containing both micro- and microvasculature is the draining towards the cortical surface. In the profiles shown in Figure 7, this is completely absent. In the group data, no significant responses to the task are shown anywhere in the cortical ribbon.

      We agreed with reviewer’s comments. In the revised manuscript, the results have been substantially updated to addressing the concerns raised. The original Figure 7 is no longer relevant and has been removed.

      - Although I'd like to applaud the authors for their ambition with the connectivity analysis, I feel that acquisitions that are so SNR starved as to fail to show a significant response to a motor task should not be used for brain wide directed connectivity analysis.

      We appreciate the reviewer’s comments and share the concern about SNR limitations. In the updated results presented in Figure 5, the activation patterns in the visual cortex were consistent across TEs and b values. At the motor cortex, stable activation in M1 was observed at the single-subject level across most scan protocols. However, the layer-dependent activation profiles in M1 exhibited spatial instability, irrespective of the application of VN gradients. This spatial instability is not entirely unexpected, as T2*-based contrast is inherently sensitive to factors that perturb the magnetic field, such as eye movements, respiration, and macrovascular signal fluctuations. Additionally, ICA-based artifact removal was intentionally omitted in Figure 4 to ensure fair comparisons across protocols, leaving some residual artifacts unaddressed. Variability in task performance during button-pressing sessions may have further contributed to the observed inconsistencies.

      Although these findings suggest that submillimeter-resolution fMRI may not yet be reliable for individual-level layer-dependent functional mapping, the group-level FC analyses can still yield robust results. In Figure 7, group-level statistics revealed distinct functional connectivity (FC) patterns associated with superficial and deep layers in M1. These FC maps exhibited significant differences between layers, demonstrating that VN fMRI enhances inter-layer independence. Additional FC analyses with a seed placed in S1 further validated these findings (see Figure 8).

      The claim of specificity is supported by the observation of the double-peak pattern in the motor cortex, previously shown in multiple non-BOLD studies. However, this same pattern is shown in some of the BOLD weighted data, which seems to suggest that the double-peak pattern is not solely due to the added velocity nulling gradients. In addition, the well-known draining towards the cortical surface is not replicated for the BOLD-weighted data in Figures 3, 4, or 7. This puts some doubt about the data actually having the SNR to draw conclusions about the observed patterns.

      We appreciate the reviewer’s comments. In the updated results, the efficacy of the VN gradients is evident near the pial surface, as shown in Figures 4 and 5. In Figure 4, comparing the second and third columns (b = 0 and b = 6 s/mm<sup>2</sup>, respectively, at TE = 38 ms), the percentage signal change in the superficial layers is generally lower with b = 6 s/mm<sup>2</sup> than with b = 0. This indicates that VN gradient-induced signal suppression is more pronounced in the superficial layers. Additionally, in Figure 5, the VN gradients effectively suppressed macrovascular signals as highlighted by the blue circles. These observations support the role of VN gradients in enhancing specificity by reducing superficial bias and macrovascular contamination. Furthermore, bias towards cortical surface was observed in the updated results in Figure 4.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) L141: "depth dependent" is slightly misleading here. It could be misunderstood to suggest that the authors are assessing how spatial specificity varies as a function of depth. Rather, they are assessing spatial specificity based on depth-dependent responses (double peak feature). Perhaps "layer-dependent spatial specificity" could be substituted with laminar specificity?

      We thank the reviewer for the suggestion. The term “depth dependent” has been replaced by “layer dependent” in the revised manuscript.

      (2) L146-149: these do not validate spatial specificity.

      The original text is removed.

      (3) L180: Maybe helpful to describe what the b-value is to assist unfamiliar readers.

      We have clarified the b-value as “the strength of the bipolar diffusion gradients” where it is first mentioned in the manuscript.

      (4) Figure 1B: I think it would be appropriate with a sentence of how the authors define micro/macrovasculature. Figure 1B seems to suggest that large ascending veins are considered microvascular which I believe is a bit unconventional. Nevertheless, as long as it is clearly stated, it should be fine.

      In our context, macrovasculature refers to vessels that are distal to neural activation sites and contribute to extravascular contamination. These vessels are typically larger in size (e.g., > 0.1 mm in diameter) and exhibit faster flow rates (e.g., > 10 mm/s).

      (5) I think the authors could be more upfront with the point about non-suppressed extravascular effects from macrovasculature, which was briefly mentioned in the discussion. It could already be highlighted in the introduction or theory section.

      We thank the reviewer’s suggestions. We have expanded the discussion of extravascular effects from macrovasculature in both the Introduction (5th paragraph) and Discussion (3rd paragraph) sections.

      (6) The phase regression figure feels a bit misplaced to me. If the authors agree: rather than showing the TE-dependency of the effect of phase regression, it may be more relevant for the present study to compare the conventional setup with phase regression, with the VN setup without phase regression. I.e., to show how the proposed setup compares to existing 3T laminar fMRI studies.

      In this revision, both the TE-dependent and VN-dependent effects of phase regression were investigated. The results in Figure 4 and Figure 5 demonstrated that phase regression effectively suppresses macrovascular contributions primarily near the gray matter/CSF boundary, irrespective of TE or the presence of VN gradients.

      (7) L520: It might be beneficial to also cite the large body of other laminar studies showing the double peak feature to underscore that it is highly robust, which increases its relevance as a test-bed to assess spatial specificity.

      We agreed. More literatures have been cited (Chai et al., 2020; Huber et al., 2017a; Knudsen et al., 2023; Priovoulos et al., 2023).

      (8) L557: The argument that only one participant was assessed to reduce inter-subject variability is hard to buy. If significant variability exists across subjects, this would be highly relevant to the authors and something they would want to capture.

      We thank the reviewer for the suggestions. In this revision, we have increased the number of participants to 4 for protocol development and 14 for resting-state functional connectivity analysis, allowing us to better assess and account for inter-subject variability.

      (9) L637: add download link and version number.

      The download link has been added as requested. The version number is not applicable.

      (10) L638: How was the phase data coil-combined?

      The reconstructed multi-channel data, which were of complex values, were combined using the adaptive combination method (Walsh et al.; DOI: 10.1002/(sici)1522-2594(200005)43:5<682::aid-mrm10>3.0.co;2-g). The MATLAB code for this implementation was developed by Dr. Diego Hernando and is publicly available at https://github.com/welton0411/matlab . The phase data were then extracted using the MATLAB function ‘angle’.

      (11) L639: Why was the smoothing filter parameter changed (other parameters were default)?

      The smoothing filter parameter was set based on the suggestion provided in the help comments of the NIFTI_NORDIC function:

      function  NIFTI_NORDIC(fn_magn_in,fn_phase_in,fn_out,ARG)

      % fMRI

      %

      %  ARG.phase_filter_width=10;

      In other words, we simply followed the recommendation outlined in the NIFTI_NORDIC function’s documentation.

      (12) I assume the phase data was motion corrected after transforming to real and imaginary components and using parameters estimated from magnitude data? Maybe add a few sentences about this.

      Prior to phase regression, the time series of real and imaginary components were subjected to motion correction, followed by phase unwrapping. The phase regression was incorporated early in the data processing pipeline to minimize the discrepancy in data processing between magnitude and phase images (Stanley et al., 2021).

      (13) Was phase regression applied with e.g., a deming model, which accounts for noise on both the x and y variable? In my experience, this makes a huge difference compared with regular OLS.

      We appreciate the reviewer’s insightful comment. We are aware that the noise present in both magnitude and phase data therefore linear Deming regression would be a good fit to phase regression (Stanley et al., 2021). To perform Deming regression, however, the ratio of magnitude error variance to phase error variance must be predefined. In our initial tests, we found that the regression results were sensitive to this ratio. To avoid potential confounding, we opted to use OLS regression for the current analysis. However, we agreed Deming model could enhance the efficacy of phase regression if the ratio could be determined objectively and properly.

      (14) Figure 2: What is error bar reflecting? I don't think the across-voxel error, as also used in Figure 4, is super meaningful as it assumes the same response of all voxels within a layer (might be alright for such a small ROI). Would it be better to e.g. estimate single-trial response magnitude (percent signal change) and assess variability across? Also, it is not obvious to me why b=30 was chosen. The authors argue that larger values may kill signal, but based on this Figure in isolation, b=48 did not have smaller response magnitudes (larger if anything).

      We agreed with the reviewer’s opinion on the across-voxel error. In the revised manuscript, the signal was averaged within each layer before performing the GLM analysis, and signal variation was calculated using the temporal residuals. The technical details of these changes are described in the "Materials and Methods" section.

      Additionally, the bipolar diffusion gradients were modified from a single direction to three orthogonal directions. As a result, the questions and results related to b=30 or b=48 are no longer applicable.

      (15) Figure 5: would be informative to quantify the effect of phase regression over a large ROI and evaluate reduction in macrovascular influence from superficial bias in laminar profiles.

      We appreciate the reviewer’s suggestion. In the revised manuscript, the reduction in macrovascular influence from superficial bias across a large ROI is displayed in Figure 5. Additionally, the impact on laminar profiles is demonstrated in Figure 4.

      (16) L406-408: What kind of robustness?

      We acknowledge that describing the protocol as “robust” was an overstatement. The updated results indicate that the current protocol for submillimeter fMRI may not yet be suitable for reliable individual-level layer-dependent functional mapping. However, group-level functional connectivity (FC) analyses demonstrated clear layer-specific distinctions with VN fMRI, which were not evident in conventional fMRI. These findings highlight the enhanced layer specificity achievable with VN fMRI.

      (17) Figure 8: I think C) needs pointers to superficial, middle, and deep layers? Why is it not in the same format as in Figure 9C? The discussion of the FC results could benefit from more references supporting that these observations are in line with the literature.

      In the revised results, the layer pooling shown in Figure 9c has been removed, making the question regarding format alignment no longer applicable. Additionally, references supporting the FC results have been added to the revised Discussion section (7th paragraph).

      (18) L456-457: But correlation coefficients may also be biased by different CNR across layers.

      That is correct. In the updated FC results in Figure 7 to 9, we used group-level statistics rather than correlation coefficients.

      Reviewer #3 (Recommendations For The Authors):

      The results in Figure 2-6 should be repeated over, or averaged over, a (small) group of participants. N=6 is usual in this field. I would seriously reconsider the multiband acceleration - the acquisition seemingly cannot support the SNR hit.

      A few more specific points are given below:

      (1) Abstract: The sentence about LGN in the abstract came for me out of the blue - why would LGN be important here, it's not even a motor network node? Perhaps the aims of the study should be made more clear - if it's about networks as suggested earlier then a network analysis result would be expected too. Expanding the directed FC findings would improve the logical flow of the abstract. Given the many concerns, removing the connectivity analysis altogether would also be an option.

      We thank the reviewer for the suggestions. The LGN-related results indeed diluted the focus of this study and have been completely removed in this revision.

      (2) Line 105: in addition to the VASO method, ..

      The corresponding text has been revised, and as a result, the reviewer’s suggestion is no longer applicable.

      (3) If out of the set MB 4 / 5 / 6 MB4 was best, why did the authors not continue with a comparison including MB3 and MB2? It seems to me unlikely that the MB4 acquisition is actually optimal.

      Results: We appreciate the reviewer’s suggestions. In this revision, we decreased the MB factor to 3, as it allowed us to increase the in-plane acceleration rate to 3, thereby shortening the TE. The resulting sensitivity for both individual and group-level results is detailed in earlier responses, such as the response to Q16 for Reviewer #2.

      (4) The formatting of the references is occasionally flawed, including first names and/or initials. Please consider using a reliable reference manager.

      We used Zotero as our reference manager in this revision to ensure consistency and accuracy. The references have been formatted according to the APA style.

      (5) In the caption of Figure 5, corrected and uncorrected p values are identical. What multiple comparisons correction was made here? A multiple comparisions over voxels (as is standard) would usually lead to a cut-off ~z=3.2. That would remove most of the 'responses' shown in figure 5.

      We appreciate the reviewer’s comment. The original results presented in Figure 5 have been removed in the revised manuscript, making this comment no longer applicable.

    1. eLife Assessment

      In this useful study, Millard et al. assessed the effects of nicotine on pain sensitivity and peak alpha frequency (PAF). The evidence shown is incomplete to support the key claim that nicotine modulates PAF or pain sensitivity, considering the effect sizes observed. This raises the question of whether the chosen experimental intervention was the most suitable approach for investigating their research question. Nonetheless, the work can be incorporated into the literature investigating the relationship between nicotine and pain, and could be of broad interest to pain researchers.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, Millard and colleagues investigated if the analgesic effect of nicotine on pain sensitivity, assessed with two pain models, is mediated by Peak Alpha Frequency (PAF) recorded with resting state EEG. The authors found indeed that nicotine (4 mg, gum) reduced pain ratings during phasic heat pain but not cuff pressor algometry compared to placebo conditions. Nicotine also increased PAF (globally). However, mediation analysis revealed that the reduction in pain ratings elicited by the phasic heat pain after taking nicotine was not mediated by the changes in PAF. Also, the authors only partially replicated the correlation between PAF and pain sensitivity at baseline (before nicotine treatment). At the group-level no correlation was found, but an exploratory analysis showed that the negative correlation (lower PAF, higher pain sensitivity) was present in males but not in females. The authors discuss the lack of correlation.<br /> In general, the study is rigorous, methodology is sound and the paper is well written. Results are compelling and sufficiently discussed.

      Strengths:

      Strengths of this study are the pre-registration, proper sample size calculation and data analysis. But also the presence of the analgesic effect of nicotine and the change in PAF.

      Weaknesses:

      It would even be more convincing if they had manipulated PAF directly.

    3. Reviewer #2 (Public review):

      Summary:

      The study by Millard et al. investigates the effect of nicotine on alpha peak frequency and pain in a very elaborate experimental design. According to the statistical analysis, the authors found a factor-corrected significant effect for prolonged heat pain but not for alpha peak frequency in response to the nicotine treatment.

      Strengths:

      I very much like the study design and that the authors followed their research line by aiming to provide a complete picture of the pain-related cortical impact of alpha peak frequency. This is very important work, even in the absence of any statistical significance. I also appreciate the preregistration of the study and the well-written and balanced introduction.

      Weaknesses:

      The weakness of the study revolves around two aspects:

      (1) Source separation (ICA or similar) would have been more appropriate than electrode ROIs to extract the alpha signal. By using a source separation approach, different sources of alpha (mu, occipital alpha, laterality) could be disentangled.

      (2) There is also a suggestion in the literature in the manuscript) that nicotine treatment may not work as intended. Instead, the authors' decision to use nicotine to modulate peak alpha frequency and pain was based on other, inappropriate work on chronic pain and chronic smokers. In the present study, the authors use nicotine treatment and transient painful stimulation in nonsmokers. The unfortunate decision to use nicotine severely hampered the authors' goal of the study.

      Impact: The impact of the study could be to show what did not work to answer the authors' research questions. The study would have more impact with a more appropriate pain intervention model and an analysis strategy that untangles the different alpha sources.

    4. Reviewer #3 (Public review):

      In this manuscript, Millard et al. investigate the effects of nicotine on pain sensitivity and peak alpha frequency (PAF) in resting state EEG. To this end, they ran a randomized, double-blind, placebo-controlled experiment involving 62 healthy adults that received either 4 mg nicotine gum (n=29) or placebo (n=33). Prolonged heat and pressure were used as pain models. Resting state EEG and pain intensity (assessed with a visual analog scale) were measured before and after the intervention. Additionally, several covariates (sex at birth, depression and anxiety symptoms, stress, sleep quality, among others) were recorded. Data was analyzed using ANCOVA-equivalent two-wave latent change score models, as well as repeated measures analysis of variance. Results do not show experimentally relevant changes of PAF or pain intensity scores for neither of the prolonged pain models due to nicotine intake.

      The main strengths of the manuscript are its solid conceptual framework and the thorough experimental design. The researchers make a good case in the introduction and discussion for the need to further investigate the association of PAF and pain sensitivity. Furthermore, they proceed to carefully describe every aspect of the experiment in great detail, which is excellent for reproducibility purposes. Finally, they analyze the data from different and provide an extensive report of their results.

      There are relevant weaknesses to highlight. Firstly, authors preregistered the study and the analysis plan, but the preregistration does not contain an estimation of the expected effect sizes or the rationale for the selected the sample size. Furthermore, the authors interpret their results in a way that is not supported by the evidence (which is notorious in the abstract and the first paragraph of the discussion). Even though some of the differences are statistically significant (e.g., global PAF, pain intensity ratings during heat pain), these differences are far from being experimentally or clinically relevant. The effect sizes observed are not sufficiently large to consider that pain sensitivity was modulated by the nicotine intake, which puts into question all the answers to the research questions posed in the study. The authors attempt to nuance this throughout the discussion, but in a way that is not compatible with the main claims.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Millard and colleagues investigated if the analgesic effect of nicotine on pain sensitivity, assessed with two pain models, is mediated by Peak Alpha Frequency (PAF) recorded with resting state EEG. The authors found indeed that nicotine (4 mg, gum) reduced pain ratings during phasic heat pain but not cuff pressor algometry compared to placebo conditions. Nicotine also increased PAF (globally). However, mediation analysis revealed that the reduction in pain ratings elicited by the phasic heat pain after taking nicotine was not mediated by the changes in PAF. Also, the authors only partially replicated the correlation between PAF and pain sensitivity at baseline (before nicotine treatment). At the group-level no correlation was found, but an exploratory analysis showed that the negative correlation (lower PAF, higher pain sensitivity) was present in males but not in females. The authors discuss the lack of correlation.

      In general, the study is rigorous, methodology is sound and the paper is well-written. Results are compelling and sufficiently discussed.

      Strengths:

      Strengths of this study are the pre-registration, proper sample size calculation, and data analysis. But also the presence of the analgesic effect of nicotine and the change in PAF.

      Weaknesses:

      It would even be more convincing if they had manipulated PAF directly.

      We thank Reviewer #1 for their positive and constructive comments regarding our study. We appreciate the view that the study was rigorous and methodologically sound, that the paper was well-written, and that the strengths included our pre-registration, sample size calculation, and data analysis.

      In response to the reviewer's comment about more directly manipulating Peak Alpha Frequency (PAF), we agree that such an approach could provide a more direct investigation of the role of PAF in pain processing. We chose nicotine to modulate PAF as the literature suggested it was associated with a reliable increase in PAF speed. As mentioned in our Discussion, there are several alternative methods to manipulate PAF, such as non-invasive brain stimulation techniques (NIBS) like transcranial alternating current stimulation (tACS) or neurofeedback training. These approaches could help clarify whether a causal relationship exists between PAF and pain sensitivity. Although methods such as NIBS still require further investigation as there is little evidence for these approaches changing PAF (Millard et al., 2024).

      Reviewer #2 (Public Review):

      Summary:

      The study by Millard et al. investigates the effect of nicotine on alpha peak frequency and pain in a very elaborate experimental design. According to the statistical analysis, the authors found a factor-corrected significant effect for prolonged heat pain but not for alpha peak frequency in response to the nicotine treatment.

      Strengths:

      I very much like the study design and that the authors followed their research line by aiming to provide a complete picture of the pain-related cortical impact of alpha peak frequency. This is very important work, even in the absence of any statistical significance. I also appreciate the preregistration of the study and the well-written and balanced introduction. However, it is important to give access to the preregistration beforehand.

      Weaknesses:

      The weakness of the study revolves around three aspects:

      (1) I am not entirely convinced that the authors' analysis strategy provides a sufficient signal-tonoise ratio to estimate the peak alpha frequency in each participant reliably. A source separation (ICA or similar) would have been better suited than electrode ROIs to extract the alpha signal. By using a source separation approach, different sources of alpha (mu, occipital alpha, laterality) could be disentangled.

      (2) Also, there's a hint in the literature (reference 49 in the manuscript) that the nicotine treatment may not work as intended. Instead, the authors' decision to use nicotine to modulate the peak alpha frequency and pain relied on other, not suitable work on chronic pain and permanent smokers. In the present study, the authors use nicotine treatment and transient painful stimulation on nonsmokers.

      (3) In my view, the discussion could be more critical for some aspects and the authors speculate towards directions their findings can not provide any evidence. Speculations are indeed very important to generate new ideas but should be restricted to the context of the study (experimental pain, acute interventions). The unfortunate decision to use nicotine severely hampered the authors' aim of the study.

      Impact:

      The impact of the study could be to show what has not worked to answer the research questions of the authors. The authors claim that their approach could be used to define a biomarker of pain. This is highly desirable but requires refined methods and, in order to make the tool really applicable, more accurate approaches at subject level.

      We thank reviewer #2 for their recognition of the study’s design, the importance of this research area, and the pre-registration of our study. In response to the weaknesses highlighted:

      (1) We appreciate the reviewer’s suggestion to improve the signal-to-noise ratio by applying source separation techniques, such as ICA, which have now been performed and incorporated into the manuscript. Our original decision to use sensor-level ROIs followed the precedent set in previous studies, our rationale being to improve reproducibility and avoid  biases from picking individual electrodes or manually picking sources. We have  added analyses using an automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing sensorimotor sites. Here again we found no significant differences in the mediation results that used a sensor space sensorimotor ROI, further supporting the robustness of the chosen approach. ICA could still potentially disentangle different sources of alpha, such as occipital alpha and mu rhythm, and provide new insights into the PAF-pain relationship. We have now added a discussion in the manuscript about the potential advantages of source separation techniques and suggest that the possible contributions of separate alpha sources be investigated and compared to sensor space PAF as a direction for future research.

      (2) We recognise the reviewer's concern regarding our choice of nicotine as a modulator of pain and alpha peak frequency (PAF). The meta-analysis by Ditre et al. (2016) indeed points to small effect sizes for nicotine's impact on experimental pain and highlights the potential for publication bias. However, our decision to use nicotine in this study was not primarily based on its direct analgesic effects, but rather on its well-documented ability to modulate PAF, in smoking and non-smoker populations, as outlined in our study aims.

      In this regard, the intentional use of nicotine was to assess whether changes in PAF could mediate alterations in pain. This approach aligns with the broader concept that a direct effect of an intervention is not necessary to observe indirect effects (Fairchild & McDaniel, 2017). We have, however, revised our introduction to further clarify this rationale, highlighting that nicotine was used as a tool for PAF modulation, not solely for its potential analgesic properties.

      (3) We agree with the reviewer’s observation that certain aspects of the Discussion could be more cautious, particularly regarding speculations about nicotine’s effects and PAF as a biomarker of pain. We have revised the Discussion to ensure that our interpretations are better grounded in the data from this study, clearly stating the limitations and avoiding overgeneralization. This revision focuses on a more critical evaluation of the potential relationships between PAF, nicotine, and pain sensitivity based solely on our experimental context.

      Finally, We also apologize for not providing access to the preregistration earlier. This was an oversight on our end, and we will ensure that future preregistrations are made available upfront.

      Reviewer #3 (Public Review):

      In this manuscript, Millard et al. investigate the effects of nicotine on pain sensitivity and peak alpha frequency (PAF) in resting state EEG. To this end, they ran a pre-registered, randomized, double-blind, placebo-controlled experiment involving 62 healthy adults who received either 4 mg nicotine gum (n=29) or placebo (n=33). Prolonged heat and pressure were used as pain models. Resting state EEG and pain intensity (assessed with a visual analog scale) were measured before and after the intervention. Additionally, several covariates (sex at birth, depression and anxiety symptoms, stress, sleep quality, among others) were recorded. Data was analyzed using ANCOVAequivalent two-wave latent change score models, as well as repeated measures analysis of variance. Results do not show *experimentally relevant* changes of PAF or pain intensity scores for either of the prolonged pain models due to nicotine intake.

      The main strengths of the manuscript are its solid conceptual framework and the thorough experimental design. The researchers make a good case in the introduction and discussion for the need to further investigate the association of PAF and pain sensitivity. Furthermore, they proceed to carefully describe every aspect of the experiment in great detail, which is excellent for reproducibility purposes. Finally, they analyse the data from almost every possible angle and provide an extensive report of their results.

      The main weakness of the manuscript is the interpretation of these results. Even though some of the differences are statistically significant (e.g., global PAF, pain intensity ratings during heat pain), these differences are far from being experimentally or clinically relevant. The effect sizes observed are not sufficiently large to consider that pain sensitivity was modulated by the nicotine intake, which puts into question all the answers to the research questions posed in the study.

      We would like to express our gratitude to Reviewer #3 for their thoughtful and constructive review, including the positive feedback on the strengths of our study's conceptual framework, experimental design, and thorough methodological descriptions.

      We acknowledge the concern regarding the experimental and clinical relevance of some statistically significant results (e.g., global PAF and pain intensity during heat pain) and agree that small effect sizes may limit their practical implications. However, our primary goal was to assess whether nicotine-induced changes in PAF mediate pain changes, rather than to demonstrate large direct effects on pain sensitivity. Nicotine was chosen for its known ability to modulate PAF, and our focus was on the mechanistic role of PAF in pain perception. To clarify this, we have revised the discussion to better differentiate between statistical significance, experimental relevance, and clinical applicability. We emphasize that this study represents a preliminary step towards understanding PAF’s mechanistic role in pain, rather than a direct clinical application.

      We appreciate the suggestion to refine our interpretation. We have adjusted our language to ensure it aligns with the effect sizes observed and made recommendations for future research, such as testing different nicotine doses, to potentially uncover stronger or more clinically relevant effects.

      Although modest, we believe these findings offer valuable insights into the potential mechanisms by which nicotine affects alpha oscillations and pain. We have also discussed how these small effects could become more pronounced in different populations (e.g., chronic pain patients) and over time, offering guidance for future research on PAF modulation and pain sensitivity.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      I have a number of points that the authors may want to consider for this or future work.

      (1) By reviewing the literature provided by the authors in the introduction I think that using nicotine as a means to modulate pain and alpha peak frequency was a mistake. The only work that may give a hint on whether nicotine can modulate experimental pain is the meta-analysis by Ditre and colleagues (2016). They suggest that their small effect may contain a publication bias. I think the other "large body of evidence" is testing something else than analgesia.

      Thank you for your consideration of our choice of nicotine in the study. The meta-analysis by Ditre and colleagues (2016) suggests small effect sizes for nicotine's impact on experimental pain, compared to the moderate effects claimed in some papers, especially when accounting for the potential publication bias you mentioned. However, our selection of nicotine was primarily driven by its documented ability to modulate PAF rather than its direct analgesic effects, as clearly stated in our aims. Therefore, we do not view our decision to use nicotine as a mistake; instead, it was aligned with our goal of assessing whether changes in PAF mediate alterations in pain and thus served as a valuable tool. This perspective aligns with the broader concept that a direct effect is not a prerequisite for observing indirect effects of an intervention on an outcome (Fairchild &

      McDaniel, 2017). To further enhance clarity, we've revised the introduction to emphasize the role of nicotine in manipulating PAF in relation to our study's aims.

      Previously we wrote: “A large body of evidence suggests that nicotine is an ideal choice for manipulating PAF, as both nicotine and smoking increase PAF speed [37,40–47] as well as pain thresholds and tolerance [48–52].” This has been changed to read: “Because evidence suggests that nicotine can modulate PAF, where both nicotine and smoking increase PAF speed [37,40–47], we chose nicotine to assess our aim of whether changes in PAF mediate changes in pain in a ‘mediation by design’ approach [48]. In addition, given evidence that nicotine may increase experimental pain thresholds and tolerance [49–53], nicotine could also influence pain ratings during tonic pain.”

      (2) As mentioned above, the OSF page is not accessible.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      (3) I generally struggle with the authors' approach to investigating alpha. With the approach the authors used to detect peak alpha frequency it might be that the alpha signal may just show such a low amplitude that it is impossible to reliably detect it at electrode level. In my view, the approach is not accurate enough, which can be seen by the "jagged" shape of the individual alpha peak frequency. In my view, a source separation technique would have been more useful. I wonder which of the known cortical alphas contributes to the effects the authors have reported previously: occipital, mu rhythms projections or something else? A source separation approach disentangles the different alphas and will increase the SNR. My suggestion would be to work on ICA components or similar approaches. The advantage is that the components are almost completely free of any artefacts. ICAs could be run on the entire data or separately for each individual. In the latter case, it might be that some participants do not exhibit any alpha component.

      We appreciate your thoughtful consideration of our approach to investigating alpha. The calculation of PAF involves various methods and analysis steps across the literature (Corcoran et al., 2018; Gil Avila et al., 2023; McLain et al., 2022). Your query about which known cortical alphas contribute to reported effects is important. Initially focusing on a sensorimotor component from an ICA in Furman et al., 2018, subsequent work from our labs suggested a broader relationship between PAF and pain across the scalp (Furman et al., 2019; Furman et al., 2020; Millard et al., 2022), and a desire to conduct analyses at the sensor level in order to improve the reproducibility of the methods (Furman et al., 2020). However, based on your comment we have made several additions to the manuscript, including: explaining why we did not use manual ICA methods, suggest this for future research, and added an exploratory analysis using a recently developed automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing activity from occipital or motor sites.

      While we acknowledge that ICA components can offer a better signal-to-noise ratio (SNR) and possibly smoother spectral plots, we opted for our chosen method to avoid potential bias inherent in deciding on a component following source separation. The desire for a quick, automated, replicable, and unbiased pipeline, crucial for potential clinical applications of PAF as a biomarker, influenced this decision. At the time of analysis registration, automated methods for deciding which alpha components to extract following ICA were not apparent. We have now added this reasoning to Methods.

      “Contrary to some previous studies that used ICA to isolate sensory region alpha sources (Furman et al., 2018; De Martino et al., 2021; Valentini et al., 2022), we used pre-determined sensor level ROIs to improve reproducibility and reduce the potential for bias when individually selecting ICA components. Using sensor level ROIs may decrease the signal-to-noise ratio of the data; however, this approach has still been effective for observing the relationship between PAF and experimental pain (Furman et al., 2019; Furman et al., 2020).”

      We have also added use of ICA and development of methods as a suggestion for future research in the discussion:

      “Additionally, the use of global PAF may have introduced mediation measurement error into our mediation analysis. The spatial precision used in the current study was based on previous literature on PAF as a biomarker of pain sensitivity, which have used global and/or sensorimotor ROIs (Furman et al., 2018; Furman et al., 2020). Identification and use of the exploratory electrode clusters found in this study could build upon the current work (e.g., Furman et al., 2021). However, exploratory analysis of the clusters found in the present analysis demonstrated no influence on mediation analysis results (Supplementary Materials 3.8-3.10). Alternatively, independent component analysis (ICA) could be used to identify separate sources of alpha oscillations (Choi et al., 2005), as used in other experimental PAF-pain studies (Furman et al., 2018; Valentini et al., 2022), which could aid to disentangle the potential relevance of different alpha sources in the PAFpain relationship. Although this comes with the need to develop more reproducible and automated methods for identifying such components.”

      The specific location or source of PAF that relates to pain remains unclear. Because of this, we did employ an exploratory cluster-based permutation analysis to assess the potential for variations in the presence of PAF changes across the scalp at sensor level, and emphasise that location of PAF change could be explored in future. However, we have now conducted the mediation analysis (difference score 2W-LCS model) using averages from the data-driven parietal cluster, frontal cluster, and both clusters together. For these we see a stronger effect of gum on PAF change, which was expected given the data driven approach of picking electrodes. There was still a total and direct effect of nicotine on pain during the PHP model, but still no indirect effect via change in PAF. For the CPA models, there were still no significant total, direct, or indirect effects of nicotine on CPA ratings. Therefore, using these data-driven clusters did not alter results compared to the model using the global PAF variable.

      The reader has been directed to this supplementary material so:

      “The potential mediating effect of this change in PAF on change in PHP and CPA was explored (not pre-registered) by averaging within each cluster (central-parietal: CP1, CP2, Cpz, P1, P2, P3, P4, Pz, POz; right-frontal: F8, FT8, FT10) and across both clusters. This averaging across electrodes produced three new variables, each assessed in relation to mediating effects on PHP and CPA ratings. The resulting in six exploratory mediation analysis (difference score 2W-LCS) models demonstrated minimal differences from the main analysis of global PAF (8-12 Hz), except for the

      expected stronger effect of nicotine on change in PAF (bs = 0.11-0.14, ps < .003; Supplementary

      Materials 3.8-3.10).”

      Moreover, our team has been working on an automated method for selecting ICA components, so in response to your comment we assessed whether using this method altered the results of the current analysis. The in-depth methodology behind this new automatic pipeline will be published with a validation from some co-authors in the current collaboration in due course. At present, in summary, this automatic pipeline conducts independent component analysis (ICA) 10 times for each resting state, and selects the component with the highest topographical correlation to a template created of a sensorimotor alpha component from Furman et al., (2018). 

      The results of the PHP or CPA mediation models were not substantially different using the PAF calculated from independent components than that using the global PAF. For the PHP model, the total effect (b = -0.648, p \= .033) and direct effects (b = -0.666, p \= .035) were still significant, and there was still no significant indirect effect (b = 0.018, p \= .726). The general fit was reduced, as although the CFI was above 0.90, akin to the original model, the RMSEA and SRMR were not below 0.08, unlike the original models (Little, 2013). For the CPA model, there were still no significant total (b = -0.371, p \= .357), direct (b = -0.364, p \= .386), or indirect effects (b = -0.007, p \= .906), and the model fit also decreased, with CFI below 0.90 and RMSEA and SRMR above 0.08. See supplementary material (3.11). Note that still no correlations were seen between this IC sensorimotor PAF and pain (PHP: r = 0.11, p = .4; CPA: r \= -0.064, p = .63).

      Interestingly, in both models, there was now no longer a significant a-path (PHP: b = 0.08, p =

      0.292; CPA: b = 0.039, p = 0.575), unlike previously observed (PHP: b = 0.085, p = 0.018; CPA: b = 0.089, p = 0.011). We interpret this as supporting the previously highlighted difference between finding an effect on PAF globally but not in a sensorimotor ROI (and now a sensorimotor IC), justifying the exploratory CBPA and the suggestion in the discussion to explore methodology.

      We understand that this analysis does not fully uncover the reviewer’s question in which they wondered which of the known cortical alphas contributes to the effects reported in our previous work. However, we consider this exploration to be beyond the scope of the current paper, as it would be more appropriately addressed with larger datasets or combinations of datasets, potentially incorporating MEG to better disentangle oscillatory sources. The highlighted differences seen between global PAF, sensorimotor ROI PAF, sensorimotor IC PAF, as well as the CBPA of PAF changes provide ample directions for future research to build upon: 1) which alpha (sensor or source space) are related to pain, 2) how are these alpha signals represented robustly in a replicable way, and 3) which alpha (sensor or source space) are manipulable through interventions. These are all excellent questions for future studies to investigate.

      The below text has been added to the Discussion:

      In-house code was developed to compare a sensorimotor component to the results presented in this manuscript (Supplementary Material 3.11), showing similar results to the sensorimotor ROI mediation analysis presented here. However, examination of which alpha - be it sensor or source space - are related to pain, how they can be robustly represented, and how they can be manipulated are ripe avenues for future study.

      (4) I have my doubts that you can get a reliable close to bell-shaped amplitude distribution for every participant. The argument that the peak detection procedure is hampered by the high-amplitude lower frequency can be easily solved by subtracting the "slope" before determining the peak. My issue is that the entire analysis is resting on the assumption that each participant has a reliable alpha effect at electrode level. This is not the case. Non-alpha participants can severely distort the statistics. ICA-based analyses would be more sensitive but not every participant will show alpha. You may want to argue with robust group effects but In my view, every single participant counts, particularly for this type of data analysis, where in the case of a low SNR the "peak" can easily shift to the extremes. In case there is an alpha effect for a specific subject, we should see a smooth bump in the frequency spectrum between 8 and 12 12Hz. Anything beyond that is hard to believe. The long stimulation period allows a broad FFT analysis window with a good frequency resolution in order to detect the alpha frequency bump.

      The reviewer is correct that non-alpha participants can distort the statistics. We did visually assess the EEG of each individual’s spectra at baseline to establish the presence of global peaks, as we believe this is good practice to aid understanding of the data. Please see Author response image 1 for individual spectra seen at baseline. Although not all participants had a ‘smooth bump in the frequency spectrum between 8 and 12 Hz’, we prefer to not apply/necessitate this assumption to our data. Chiang et al., (2011) suggest that ~3% of individuals do not have a discernible alpha peak, and in our data we observed only one participant without a very obvious spectral peak (px-39). But, this participant does have enough activity within the alpha range to identify PAF by the CoG method (i.e. not just flat spectra and activity on top of 1/f characteristics). Without a pre-registered and standardised decision process to remove such a participant in place, we opted to not remove any participants to avoid curation of our data.

      Author response image 1.

      (5) I find reports on frequent channel rejections reflect badly on the data quality. Bad channels can be avoided with proper EEG preparation. EEG should be continuously monitored during recording in order to obtain best data quality. Have any of the ROI channels been rejected?

      We appreciate your attention to the channel rejection. We believe that the average channels removed (0.94, 0.98, 0.74, and 0.87 [range: 0-4] for each of the four resting states out of 64 channels) does not suggest overly frequent rejection, as it was less than one electrode on average and the numbers are below the accepted number of bad channels to remove/interpolate (i.e. 10%) in EEG pipelines (Debnath et al., 2020; Kayhan et al., 2022). To maintain data quality, consistently poor channels were identified and replaced over time. We hope you will accept our transparency on this issue and note that by stating how channel removal decisions were made (i.e. 8 or more deviations) and reporting the number of channels removed, we adhere to the COBIDAS guidelines (Pernet et al., 2018; 2020).

      During analysis, cases of sensorimotor ROI channels being rejected were noted and are now specified in our manuscript. “Out of 248 resting states recorded, 14 resting states had 4 ROI channels instead of 5. Importantly, no resting state had fewer than 4 channels for the sensorimotor ROI.”

      Note, we also realised that we had not specified that we did interpolate channels for the cluster based permutation analysis. This has been corrected with the following sentence:

      “Removed channels were not interpolated for the pre-registered global and sensorimotor ROI averaged analyses, but were interpolated for an exploratory cluster based permutation analysis using the nearest neighbour average method in `Fieldtrip`.”

      (6) I have some issues buying the authors' claims that there is an effect of nicotine on prolonged pain. By looking at the mean results for the nicotine and placebo condition, this can not be right. What was the point in including the variables in the equation? In my view, in this within-subject design the effect of nicotine should be universal, no matter what gender, age, or depression. The unconditional effect of nicotine is close to zero. I can not get my head around how any of the variables can turn the effects into significance. There must be higher or lower variable scores that might be related to a higher or lower effect on nicotine. The question is not to consider these variables as a nuisance but to show how they modulate the pain-related effect of nicotine treatment. Still, the overall nicotine effect of the entire group is basically zero.

      Another point is that for within-subject analyses even tiny effects can become statistically significant if they are systematically in one direction. This might be the case here. There might be a significant effect of nicotine on pain but the actual effect size (5.73 vs. 5.78) is actually not interpretable. I think it would be interesting for the reader how (in terms of pain rating difference) each of the variables can change the effect of nicotine.

      Thank you for your comments. We recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      In light of this, we have also altered the PAF Table 3 to reflect both the pre-post values used for the CPA mediation and baseline correlations with CPA and PHP pain (i.e. N=62), and the pre-post values used for the PHP mediation (i.e. n=60).

      It is inherently difficult to visualise the findings of a mediation analysis with confounding variables that also used latent change scores (LCS) and random-effect intercepts for participants. LCS was specifically used because of issues of regression to the mean that occur if you calculate a straightforward ‘difference-score’, therefore calculating the difference in order to demonstrate the results of the statistical model in a figure, for example, does not provide a full description of the data assessed (Valente & McKinnon, 2017). Nevertheless, if we look at the data descriptively with this in mind, then calculating the change in PHP ratings does indicate that, for the nicotine group, the mean change in PHP ratings was -0.047 (SD = 1.05, range: -4.13, 1.45). Meanwhile, for the placebo group the mean change in PHP ratings was 0.33 (SD = 0.75, range: -1.37, 1.66). Therefore suggesting a slight decrease in pain ratings on average for the nicotine group compared to a slight increase on average for the placebo group. With control for pre-determined confounders, we found that the latent change score was -0.63 lower for the nicotine group compared to the control group (i.e. the direct effect of nicotine on change in pain).

      If the reviewer is only discussing the effect of nicotine on pain, we do not believe that this effect ‘should be universal’. There is clear evidence that effects of nicotine on other measures can vary greatly across individuals (Ettinger et al., 2009; Falco & Bevins, 2015; Pomerleau et al., 1995). Our intention would not be to propose a universal effect but to understand how these variables may influence nicotine's impact on pain for individuals. Here we focus on the effects of nicotine on PAF and pain sensitivity, but attempted to control for the potential influence of these other confounding factors. Therefore, our statistical approach goes beyond mean values, incorporating variables like sex at birth, age, and depression to control for and explore potential modulating factors. Control for confounding factors is an important aspect of mediation analysis (Lederer et al., 2019; VanderWeele, 2019).

      Regarding the seemingly small effect size, we understand your concern. Indeed ‘tiny effects can become statistically significant if they are systematically in one direction’, which may be what we see in this analysis. We do not agree that the effect is ‘not interpretable’, rather that it should be interpreted in light of its small effect size (effect size being the beta coefficient in our analysis, rather than the mean group difference). We agree on the importance of considering practical significance alongside statistical significance and hope to conduct additional experiments and analyses in future to elucidate the contribution of each variable to the subtle and therefore not entirely conclusive overall effect you mention.

      Your feedback on this is valuable, and we have ensured a more detailed discussion in the revised manuscript on how these factors should be interpreted alongside some additional post-hoc analyses of confounding factors that were significant in our mediation, with the note that investigation of these interactions is exploratory. We had already discussed the potential contribution of sex on the effect of nicotine on PAF, with exploratory post-hoc analysis on this included in supplementary materials. In addition, we have now added an exploratory post-hoc analysis on the potential contribution of stress on the effect of nicotine on pain. This then shows the stratified effects by the covariates that our model suggest are influencing change in PAF and pain.

      Results edits:

      “There was also a significant effect of perceived stress at baseline on change in PHP ratings when controlling for group allocation and other confounding variables (b = -0.096, p = .048, bootstrapped 95% CI: [-0.19, -0.000047]), where higher perceived stress resulted in larger decreases in PHP ratings (see Supplementary Material 3.3 for post-hoc analysis of stress).”

      Supplementary material addition:

      “3.3 Exploratory analysis of the influence of perceived stress on the effects of nicotine on change in PHP ratings “

      “Due to the significant estimated effects of perceived stress on change in PHP ratings in the 2WLCS mediation model, we also explored post-hoc effects of stress on change in PHP ratings. We found that there is strong evidence for a negative correlation between stress and change in PHP rating within the nicotine group (n = 28, r = −0.39, BF10 = 13.65; Figure 3) that is not present in the placebo group, with equivocal evidence (n = 32, r = −0.14, BF10 = 0.46). This suggests that those with higher baseline stress who had nicotine gum experienced greater decreases in PHP ratings. Note that there was less, but still sufficient evidence for this relationship within the nicotine group when the participant who was a potential outlier for change in PHP rating was removed (n = 27, r = −0.32, BF10 = 1.45). “

      Author response image 2.

      Spearman correlations od baseline perceived stress with the change in phasic heat pain (PHP) ratings, suggest strong evidence for a negative relationship for the nicotine gum groupin orange (n=28; BF<sub>10</sub>=13.65) but not for the placebo group in grey (n=32; BF<sub>10</sub>=0.46). Regression lines and 95% confidence intervals.

      Discussion edits:

      “For example, in addition to the effect of nicotine on prolonged heat pain ratings, our results suggest an effect of stress on changes in heat pain ratings, with those self-reporting higher stress at baseline having greater reductions in pain. Our post-hoc analysis suggested that this relationship between higher stress and larger decrease in PHP ratings was only present for the nicotine group (Supplementary Material 3.3). As stress is linked to nicotine use [69,70] and pain [71–73], these interactions should be explored in future.”

      (7) Is the differential effect of nicotine vs. placebo based on the pre vs. post treatment effect of the placebo condition or on the pre vs. post effect of the nicotine treatment? Can the mediation model be adapted and run for each condition separately? The placebo condition seems to have a stronger effect and may have driven the result.

      Thank you for your comments. In our mediation analysis, the differential effect of nicotine vs. placebo is assessed as a comparison between the pre-post difference within each condition. A latent change score (i.e. pre-post) is calculated for each condition (nicotine and placebo), and then the effect of being in the nicotine group (dummy coded as 1) is compared to being in the placebo group (dummy coded as 0). The comparison between conditions is needed for this model (Valente & MacKinnon, 2017), as we are assessing the change in PAF and pain in the nicotine group compared to the change in the placebo group.

      However, to address your response, it is possible to simplify and assess the relationship between the change in peak alpha frequency (PAF) and change in pain within each gum group (nicotine and placebo) independently, without including the intervention as a factor. To do this, the mediation model can be simplified to regression analysis with latent change scores that focus purely on these relationships. The results of this can help to understand whether change in PAF influences change in pain within each group separately. As with the main analysis, we see no significant influence of change in PAF on change in pain while controlling for the same confounding variables within the nicotine group (Beta = -0.146 +/- 1.105, p = 0.895, 95% CI: -2.243, 2.429) or the placebo group (Beta = 0.730 +/- 2.061, p = 0.723, 95% CI: -4.177, 3.625).

      When suggesting that the “the placebo condition seems to have a stronger effect and may have driven the result”, we believe you are referring to the increase in mean PHP ratings within the placebo group from pre (5.51 +/- 2.53) to post-placebo gum (5.84 +/- 2.67). Indeed there was a significant increase in pain ratings pre to post chewing placebo gum (t(31) = -2.53, p = 0.0165, 95% CI: -0.603, -0.0653), that was not seen after chewing nicotine gum (t(27) = 0.237, p = 0.81, 95% CI: -0.358, 0.452). In lieu of a control where no gum was chewed (i.e. simply a second pain assessment ~30 minutes after the first), we assume the gum without nicotine is a good reference that controls for the effect of time plus expectation of chewing nicotine gum. With this in mind, as we describe in our results, the change in PHP ratings is reduced in the nicotine group compared to the placebo group. Note that this phrasing keeps the effect of placebo on pain as our reference from which to view the effect of nicotine on pain. However, you are correct that we need to ensure we emphasise that the change in pain in the PHP group is reduced in comparison to the change seen after placebo.

      We have not included these extra statistics in our revised manuscript, but hope that they aid the your understanding and interpretation of the included analyses and have highlighted these nuances in the discussion.

      “However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice.”

      (8) I would not dare to state that nicotine can function as an acute analgesic. Acute analgesics need to work for everyone. The average effect here is close to zero.

      In light of your feedback, we have refined our language to avoid a sweeping assertion of universal analgesic effects and emphasize individual variability. Nicotine's role as a coping strategy for pain is acknowledged in the literature (Robinson et al., 2022), with the meta-analysis by Ditre et al. (2016) discussing its potential as an acute analgesic in humans, along with some evidence from animal research (Zhang et al., 2020). Our revised discussion underscores the need for further exploration into factors influencing nicotine's potential impact on pain. We have also specified the short-term nature of nicotine use in this context to distinguish acute effects from potential opposing effects after long-term use (Zhang et al., 2020).

      “Short-term nicotine use is thought to have acute analgesic properties in experimental settings, with a review reporting that nicotine increased pain thresholds and pain tolerance [49]. In addition, research in a rat model suggests analgesic effects on mechanical thresholds after short-term nicotine use (Zhang et al., 2020). However, previous research has not assessed the acute effects of nicotine on prolonged experimental pain models. The present study found that 4 mg of nicotine reduced heat pain ratings during prolonged heat pain compared to placebo for our human participants, but that prolonged pressure pain decreased irrespective of which gum was chewed. Our findings are thus partly consistent with the idea that nicotine may have acute analgesic properties [49], although further research is required to explore factors that may influence nicotine’s potential impact on a variety of prolonged pain models. We further advance the literature by reporting this effect in a

      model of prolonged heat pain, which better approximates the experience of clinical pain than short lasting models used to assess thresholds and tolerance [50]. However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice. Future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (9) Figures 2E and 2F are not particularly intuitive. Usually, the colour green in "jet" colour coding is being used for "zero" values. I would suggest to cut off the blue and use only the range between red green and red.

      We have chosen to retain the current colour scale for several reasons. In our analysis, green represents the middle of the frequency range (approx 10 Hz in this case), and if we were to use green as zero, it would effectively remove both blue and green from the plot, resulting in only red shades. Additionally, we have provided a clear colour scale for reference next to the plot, which allows readers to interpret the data accurately. Our intention is to maintain clarity and precision in representing the data, rather than conforming strictly to conventional practices in color coding.

      We believe that the current representation effectively conveys the results of our study while allowing readers to interpret the data within the context provided. Thank you again for your suggestion, and we hope you understand our reasoning in this matter.

      (10) Did the authors do their analysis on the parietal ROI or on the pre-registerred ROI?

      The analysis was conducted on the pre-registered sensorimotor ROI and on the global values. We have now also conducted the analysis with the regions suggested with the cluster based permutation analysis as requested by reviewer 2, comment 3.

      (11) Point 3.2 in the discussion. I would be very cautious to discuss smoking and chronic pain in the context of the manuscript. The authors can not provide any additional knowledge with their design targeting non-smokers, acute nicotine and experimental pain. The information might be interesting in the introduction in order to provide the reader with some context but is probably misleading in the discussion.

      We appreciate your perspective and agree with your caution regarding the discussion of smoking and chronic pain. While our study specifically targets non-smokers and focuses on acute nicotine effects in experimental pain, we understand the importance of contextual clarity. We have removed these points from the discussion to not mislead the reader.

      Previously we wrote, and have removed: “For those with chronic pain, smoking and nicotine use is reported as a coping strategy for pain [52]; abstinence can increase pain sensitivity [48,50], and pain is thus seen as a barrier to smoking cessation due to fear of worsening pain [51,52]. Therefore, continued understanding of the acute effects of nicotine on models of prolonged pain could improve understanding of the role of nicotine and smoking use in chronic pain [49,51,52].”

      (12) I very much appreciate section 3.3 of the discussion. I would not give up on PAF as a target to modulate pain. A modulation might not be possible in such a short period of experimental intervention. PAF might need longer and different interventions to gradually shift in order to attenuate the intensity of pain. As discussed by the authors themselves, I would also consider other targets for alpha analysis (as mentioned above not other electrodes or ROIs but separated sources.)

      Thank you for your comments on section 3.3. We appreciate your recognition of the potential significance of PAF as a target for pain modulation. Your insights align with our considerations that the experimental intervention duration or type might be a limiting factor in observing substantial shifts in PAF to attenuate pain intensity. We had mentioned the use of the exploratory electrode clusters in future work, but have now also mentioned that the use of ICA to identify separate ICA sources may provide an alternative approach. See responses to your previous ICA comment regarding separate sources.

      REFERENCES for responses to reviewer 2

      Chiang, A. K. I., Rennie, C. J., Robinson, P. A., Van Albada, S. J., & Kerr, C. C. (2011). Age trends and sex differences of alpha rhythms including split alpha peaks. Clinical Neurophysiology, 122(8), 1505-1517.

      Debnath, R., Buzzell, G. A., Morales, S., Bowers, M. E., Leach, S. C., & Fox, N. A. (2020). The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology, 57(6), e13580.

      Ettinger, U., Williams, S. C., Patel, D., Michel, T. M., Nwaigwe, A., Caceres, A., ... & Kumari, V. (2009). Effects of acute nicotine on brain function in healthy smokers and non-smokers: estimation of inter-individual response heterogeneity. Neuroimage, 45(2), 549-561.

      Falco, A. M., & Bevins, R. A. (2015). Individual differences in the behavioral effects of nicotine: a review of the preclinical animal literature. Pharmacology Biochemistry and Behavior, 138, 80-90.

      Kayhan, E., Matthes, D., Haresign, I. M., Bánki, A., Michel, C., Langeloh, M., ... & Hoehl, S. (2022). DEEP: A dual EEG pipeline for developmental hyperscanning studies. Developmental cognitive neuroscience, 54, 101104.

      Lederer, D. J., Bell, S. C., Branson, R. D., Chalmers, J. D., Marshall, R., Maslove, D. M., ... & Vincent, J. L. (2019). Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. Annals of the American Thoracic Society, 16(1), 22-28.

      Little TD. Longitudinal structural equation modeling. Guilford press; 2013.

      Pernet, C., Garrido, M., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2018). Best practices in data analysis and sharing in neuroimaging using MEEG.

      Pernet, C., Garrido, M. I., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2020). Issues and recommendations from the OHBM COBIDAS MEEG committee for reproducible EEG and MEG research. Nature neuroscience, 23(12), 1473-1483.

      Pomerleau, O. F. (1995). Individual differences in sensitivity to nicotine: implications for genetic research on nicotine dependence. Behavior genetics, 25(2), 161-177.

      Robinson, C. L., Kim, R. S., Li, M., Ruan, Q. Z., Surapaneni, S., Jones, M., ... & Southerland, W. (2022). The Impact of Smoking on the Development and Severity of Chronic Pain. Current Pain and Headache Reports, 26(8), 575-581.

      Xia, J., Mazaheri, A., Segaert, K., Salmon, D. P., Harvey, D., Shapiro, K., ... & Olichney, J. M. (2020). Event-related potential and EEG oscillatory predictors of verbal memory in mild cognitive impairment. Brain communications, 2(2), fcaa213.

      VanderWeele, T. J. (2019). Principles of confounder selection. European journal of epidemiology, 34, 211-219.

      Valente, M. J., & MacKinnon, D. P. (2017). Comparing models of change to estimate the mediated effect in the pretest–posttest control group design. Structural Equation Modeling: A Multidisciplinary Journal, 24(3), 428-450.

      Vimolratana, O., Aneksan, B., Siripornpanich, V., Hiengkaew, V., Prathum, T., Jeungprasopsuk, W., ... & Klomjai, W. (2024). Effects of anodal tDCS on resting state eeg power and motor function in acute stroke: a randomized controlled trial. Journal of NeuroEngineering and Rehabilitation, 21(1), 1-15.

      Zhang, Y., Yang, J., Sevilla, A., Weller, R., Wu, J., Su, C., ... & Candiotti, K. A. (2020). The mechanism of chronic nicotine exposure and nicotine withdrawal on pain perception in an animal model. Neuroscience letters, 715, 134627.

      Reviewer #3 (Recommendations For The Authors):

      Introduction

      (1) Rationale and link to chronic pain. I am not sure I agree with the statement "The ability to identify those at greater risk of developing chronic pain is limited". I believe there is an abundance of literature associating risk factors with the different instances of chronic pain (e.g., Mills et al., 2019). The fact that the authors cite studies involving potential neuroimaging biomarkers leads me to believe that they perhaps did not intend to make such a broad statement, or that they wanted to focus on individual prediction instead of population risk.

      We thank the reviewer for the thought put into this comment. We did indeed wish to refer to individual prediction, but also realise that the focus on predicting pain might not be the most appropriate opening for this manuscript. Therefore, we have adjusted the below sentence to refer to the need to identify modifiable factors rather than the need to predict pain.

      “Identifying modifiable factors that influence pain sensitivity could be a key step in reducing the presence and burden of chronic pain (van der Miesen et al., 2019; Davis et al., 2020; Tracey et al., 2021).”

      (2) The statement "Individual peak alpha frequency (PAF) is an electro-physiological brain measure that shows promise as a biomarker of pain sensitivity, and thus may prove useful for predicting chronic pain development" is a non sequitur. PAF may very well be a biomarker of pain sensitivity, but the best measures of pain sensitivity we have (selfreported pain intensity ratings) in general are not in themselves predictive of the development of chronic pain. Conversely, features that are not related to pain sensitivity could be useful for predicting chronic pain (e.g., Tanguay-Sabourin et al., 2023).

      We agree that it is essential to acknowledge that self-reported pain intensity ratings alone are not definitive predictors of chronic pain development. To align with this, we have revised the sentence, removing the second clause to avoid overstatement. The adjusted sentence now reads, "Individual peak alpha frequency (PAF) is an electrophysiological brain measure that shows promise as a biomarker of pain sensitivity."

      (3) Finally, some of the statements in the discussion comparing a tonic heat pain model with chronic neuropathic pain might be an overstatement. Whereas it is true that some of the descriptors are similar, the time courses and mechanisms are vastly different.

      We appreciate this comment, and agree that it is difficult to compare the heat pain model used to clinical neuropathic pain. This was an oversight and with further understanding we have removed this comment from the introduction and the discussion:

      “In parallel, we saw no indication of a relationship between PAF and pain ratings during CPA. The introduction of the CPA model, specifically calibrated to a moderate pain threshold, provides further support for the notion that the relationship between PAF and pain is specific to certain pain types [17,28]. Prolonged heat pain was pre-dominantly described as moderate/severe shooting, sharp, and hot pain, whereas prolonged pressure pain was predominantly described as mild/moderate throbbing, cramping, and aching in the present study. It is possible that the PAF–pain relationship is specific to particular pain models and protocols [12,17].”

      Methodology

      (4) or the benefit of good science. However, I am compelled to highlight that I could not access the preregistered files, even though I waited for almost two weeks after requesting permission to do so. This was a problem on two levels: the main one is that I could not check the hypothesized effect sizes of the sample size estimation, which are not only central to my review, and in general negate all the benefits that should go with preregistration (i.e., avoiding phacking, publication bias, data dredging, HARKing, etc.). The second one is that I had to provide an email address to request access. This allows the authors to potentially identify the reviewers. Whereas I have no issues with this and I support transparent peer review practices (https://elifesciences.org/inside-elife/e3e90410/increasingtransparency-in-elife-s-review-process), I also note that this might condition other reviewers.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      Interpretation of results

      (5)To be perfectly clear, I trust the results of this study more than some of the cited studies regarding nicotine and pain because it was preregistered, the sample size is considerably larger, and it seems carefully controlled. I just do not agree with the interpretation of the results, stated in the first paragraph of the Discussion. Quoting J. Cohen, "The primary product of a research inquiry is one or more measures of effect size, not P values" (Cohen, 1990). As I am sure the authors are aware of, even tiny differences between conditions, treatments or groups will eventually be statistically significant given arbitrarily large sample sizes. What really matters then is the magnitude of these differences. In general, the authors hypothesize on why there were no differences on the pressure pain model, and why decreases in heat pain were not mediated by PAF, but do not seem to consider the possibility that the intervention just did not cause the intended effect on the nociceptive system, which would be a much more straightforward explanations for all observations.

      While acknowledging and agreeing with the concern that 'even tiny differences between conditions, treatments, or groups will eventually be statistically significant given arbitrarily large sample sizes,' it's crucial to clarify that our sample size of N=62 does not fall into the category of arbitrarily large. We carefully considered the observed outcomes in the pressure pain model and the lack of PAF mediation in heat pain, as dictated by our statistical approach and the obtained results.

      The suggestion of a straightforward explanation aligning with the intervention not causing the intended effect on the nociceptive system is a valid consideration. We did contemplate the possibility of a false positive, emphasising this in the limitations of our findings and the need for replication to draw stronger conclusions to follow up this initial study.

      (6) In this regard, I do not believe that an average *increase* of 0.05 / 10 (Nicotine post - pre) can be considered a "reduction of pain ratings", regardless of the contrast with placebo (average increase of 0.24 / 10). This tiny effect size is more relevant in the context of the considerable inter-individual variation, in which subjects scored the same heat pain model anywhere from 1 to 10, and the same pressure pain model anywhere from 1 to 8.5. In this regard, the minimum clinically or experimentally important differences (MID) in pain ratings varies from study to study and across painful conditions but is rarely below 1 / 10 in a VAS or NRS scale, see f. ex. (Olsen et al., 2017). It is not my intention to question whether nicotine can function as an acute analgesic in general (as stated in the Discussion), but instead, if it worked as such under these very specific experimental conditions. I also acknowledge that the authors note this issue in two lines in the Discussion, but I believe that this is not weighed properly.

      We appreciate your perspective on the interpretation of the effect size, and we understand the importance of considering it in the context of individual variation.

      As also discussed in response to comment 6 From reviewer 2, we recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      Moreover, we have made sure refer to the comparison with the placebo group when discussing the reduction or decrease in pain seen in the nicotine group, for example:

      “2) nicotine reduced prolonged heat pain intensity but not prolonged pressure pain intensity compared to placebo gum;”

      “The nicotine group had a decrease in heat pain ratings compared to the placebo group and increased PAF speed across the scalp from pre to post-gum, driven by changes at central-parietal and right-frontal regions.”

      We have kept our original comment of whether this effect on pain is meaningful in practice to refer to the minimum clinically or experimentally important differences in pain ratings as highlighted by Olsen et al., 2017.

      “While acknowledging the modest effect size, it’s essential to consider the broader context of our study’s focus. Assessing the clinical relevance of pain reduction is pertinent in applications involving the use of any intervention for pain management [69]. However, from a mechanistic standpoint, particularly in understanding the implications of and relation to PAF, the specific magnitude of the pain effect becomes less pivotal. Nevertheless, future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (7) In line with the topic of effect sizes, average effect sizes for PAF in the study cited in the manuscript range from around 1 Hz (Boord et al., 2008; Wydenkeller et al., 2009; Lim et al., 2016), to 2 Hz (Foulds et al., 1994), compared with changes of 0.06 Hz (Nicotine post - pre) or -0.01 Hz (Placebo post - pre). MIDs are not so clearly established for peak frequencies in EEG bands, but they should be certainly larger than some fractions of a Hertz (which is considerably below the reliability of the measurement).

      We appreciate your care of these nuances. We acknowledge the differences in effect sizes between our study and those referenced in the manuscript. Given the current state of the literature, it's noteworthy that ‘MIDs’ for peak frequencies in EEG bands, particularly PAF changes, are not clearly established, other than a recent publication suggesting that even small changes in PAF are reliable and meaningful (Furman et al., 2021). In light of this, we have addressed the uncertainty around the existence and determination of MIDs in our revision, highlighting the need for further research in this area.

      In addition, our study employed a greater frequency resolution (0.2 Hz) compared to some of the referenced studies, with approximately 0.5 Hz resolution (Boord et al., 2008; Wydenkeller et al., 2009; Foulds et al., 1994). This improved resolution allows for a more precise measurement of changes in PAF. Considering this, it is plausible that studies with lower resolution might have conflated increases in PAF, and our higher resolution contributes to a more accurate representation of the observed changes.

      We have also incorporated this insight into the manuscript, emphasising the methodological advancements in our study and their potential impact on the interpretation of PAF changes. Thank you for your thoughtful feedback.

      “The ability to detect changes in PAF can be considerably impacted by the frequency resolution used during Fourier Transformations, an element that is overlooked in recent methodological studies on PAF calculation [16,95]. Changes in PAF within individuals might be obscured or conflated by lower frequency resolutions, which should be considered further in future research.”

      (8) The authors also ran alternative statistical models to analyze the data and did not find consistent results in terms of PHP ratings (PAF modulation was still statistically significantly different). The authors attribute this to the necessity of controlling for covariates. Now, considering the effects sizes, aren't these statistically significant differences just artifacts stemming from the inclusion of too many covariates (Simmons et al., 2011)? How much influence should be attributable to depression and anxiety symptoms, stress, sleep quality and past pain, considering that these are healthy volunteers? Should these contrasting differences call the authors to question the robustness of the findings (i.e., whether the same data subjected to different analysis provides the same results), particularly when the results do not align with the preregistered hypothesis (PAF modulation should occur on sensorimotor ROIs)?

      Thank you for your comments on our alternative statistical models. By including these covariates, we aim to provide a more nuanced understanding of the complexities within our data by considering their potential impact on the effects of interest. The decision to include covariates was preregistered (apologies again that this was not available) and made with consideration of balancing model complexity and avoiding potential confounding. Moreover, we hope that the insights gained from these analyses will offer valuable information about the behaviour of our data and aid future research in terms of power calculations, expected variance, and study design.

      (9) Beyond that, I believe in some cases that the authors overreach in an attempt to provide explanations for their results. While I agree that sex might be a relevant covariate, I cannot say whether the authors are confirming a pre-registered hypothesis regarding the gender-specific correlation of PAF and pain, or if this is just a post hoc subgroup analysis. Given the large number of analyses performed (considering the main document and the supplementary files), caution should be exercised on the selective interpretation of those that align with the researchers' hypotheses.

      We chose to explore the influence of sex on the correlation between PAF and pain, because this has also been investigated in previous publications of the relationship (Furman et al., 2020).  We state that the assessment by sex is exploratory in our results on p.17: “in an exploratory analysis of separate correlations in males and females (Figure 5, plot C)”. For clarity regarding whether this was a pre-registered exploration or not, we have adjusted this to be: “in an exploratory analysis (not pre-registered) of separate correlations in males and females (Figure 5, plot C), akin to those conducted in previous research on this topic (Furman et al., 2020),

      We have made sure to state this in the discussion also. Therefore, when we previously said on p.22:

      “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7–11,15] was only observed here for male participants during the PHP model for global PAF.” We have now changed this to: “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7– 11,15] was only observed here for male participants during the PHP model for global PAF in an exploratory analysis.”

      Please also note that we altered the colour and shape of points on the correlation plot (Figure 5 in initial submission), the male brown was changed to a dark brown as we realised that the light brown colour was difficult to read. The shape was then changed for male points so that the two groups can be distinguished in grey-scale.

      Overall, your thoughtful feedback is instrumental in refining the interpretation of our findings, and we look forward to presenting a more comprehensive and nuanced discussion. Thank you for your comments.

      REFERENCES for responses to reviewer 3

      Arendt-Nielsen, L., & Yarnitsky, D. (2009). Experimental and clinical applications of quantitative sensory testing applied to skin, muscles and viscera. The Journal of Pain, 10(6), 556-572.

      Chowdhury, N. S., Skippen, P., Si, E., Chiang, A. K., Millard, S. K., Furman, A. J., ... & Seminowicz, D. A. (2023). The reliability of two prospective cortical biomarkers for pain: EEG peak alpha frequency and TMS corticomotor excitability. Journal of Neuroscience Methods, 385, 109766.

      Fishbain, D. A., Lewis, J. E., & Gao, J. (2013). Is There Significant Correlation between SelfReported Low Back Pain Visual Analogue Scores and Low Back Pain Scores Determined by Pressure Pain Induction Matching?. Pain practice, 13(5), 358-363.

      Furman, A. J., Prokhorenko, M., Keaser, M. L., Zhang, J., Chen, S., Mazaheri, A., & Seminowicz, D. A. (2021). Prolonged pain reliably slows peak alpha frequency by reducing fast alpha power.

      bioRxiv, 2021-07.

      Heitmann, H., Ávila, C. G., Nickel, M. M., Dinh, S. T., May, E. S., Tiemann, L., ... & Ploner, M. (2022). Longitudinal resting-state electroencephalography in patients with chronic pain undergoing interdisciplinary multimodal pain therapy. Pain, 163(9), e997.

      McLain, N. J., Yani, M. S., & Kutch, J. J. (2022). Analytic consistency and neural correlates of peak alpha frequency in the study of pain. Journal of neuroscience methods, 368, 109460.

      Ngernyam, N., Jensen, M. P., Arayawichanon, P., Auvichayapat, N., Tiamkao, S., Janjarasjitt, S., ... & Auvichayapat, P. (2015). The effects of transcranial direct current stimulation in patients with neuropathic pain from spinal cord injury. Clinical Neurophysiology, 126(2), 382-390.

      Parker, T., Huang, Y., Raghu, A. L., FitzGerald, J., Aziz, T. Z., & Green, A. L. (2021). Supraspinal effects of dorsal root ganglion stimulation in chronic pain patients. Neuromodulation: Technology at the Neural Interface, 24(4), 646-654.

      Petersen-Felix, S., & Arendt-Nielsen, L. (2002). From pain research to pain treatment: the role of human experimental pain models. Best Practice & Research Clinical Anaesthesiology, 16(4), 667680.

      Sarnthein, J., Stern, J., Aufenberg, C., Rousson, V., & Jeanmonod, D. (2006). Increased EEG power and slowed dominant frequency in patients with neurogenic pain. Brain, 129(1), 55-64.

      Sato, G., Osumi, M., & Morioka, S. (2017). Effects of wheelchair propulsion on neuropathic pain and resting electroencephalography after spinal cord injury. Journal of Rehabilitation Medicine, 49(2), 136-143.

      Sufianov, A. A., Shapkin, A. G., Sufianova, G. Z., Elishev, V. G., Barashin, D. A., Berdichevskii, V. B., & Churkin, S. V. (2014). Functional and metabolic changes in the brain in neuropathic pain syndrome against the background of chronic epidural electrostimulation of the spinal cord. Bulletin of experimental biology and medicine, 157(4), 462-465.

    1. eLife Assessment

      This valuable study describes MerQuaCo, a computational and automatic quality control tool for spatial transcriptomics datasets. The authors have collected a remarkable number of tissues to construct the main algorithm. The exceptional strength of the evidence is demonstrated through a combination of empirical observations, automated computational approaches, and validation against existing software packages. MerQuaCo will interest researchers who routinely perform spatial transcriptomic imaging (especially MERSCOPE), as it provides an imperfection detector and quality control measures for reliable and reproducible downstream analysis.

    2. Reviewer #1 (Public review):

      Summary:

      The authors present MerQuaCo, a computational tool that fills a critical gap in the field of spatial transcriptomics: the absence of standardized quality control (QC) tools for image-based datasets. Spatial transcriptomics is an emerging field where datasets are often imperfect, and current practices lack systematic methods to quantify and address these imperfections. MerQuaCo offers an objective and reproducible framework to evaluate issues like data loss, transcript detection variability, and efficiency differences across imaging planes.

      Strengths:

      (1) The study draws on an impressive dataset comprising 641 mouse brain sections collected on the Vizgen MERSCOPE platform over two years. This scale ensures that the documented imperfections are not isolated or anecdotal but represent systemic challenges in spatial transcriptomics. The variability observed across this large dataset underscores the importance of using sufficiently large sample sizes when benchmarking different image-based spatial technologies. Smaller datasets risk producing misleading results by over-representing unusually successful or unsuccessful experiments. This comprehensive dataset not only highlights systemic challenges in spatial transcriptomics but also provides a robust foundation for evaluating MerQuaCo's metrics. The study sets a valuable precedent for future quality assessment and benchmarking efforts as the field continues to evolve.

      (2) MerQuaCo introduces thoughtful metrics and filters that address a wide range of quality control needs. These include pixel classification, transcript density, and detection efficiency across both x-y axes (periodicity) and z-planes (p6/p0 ratio). The tool also effectively quantifies data loss due to dropped images, providing tangible metrics for researchers to evaluate and standardize their data. Additionally, the authors' decision to include examples of imperfections detectable by visual inspection but not flagged by MerQuaCo reflects a transparent and balanced assessment of the tool's current capabilities.

      Weaknesses:

      (1) The study focuses on cell-type label changes as the main downstream impact of imperfections. Broadening the scope to explore expression response changes of downstream analyses would offer a more complete picture of the biological consequences of these imperfections and enhance the utility of the tool.

      (2) While the manuscript identifies and quantifies imperfections effectively, it does not propose post-imaging data processing solutions to correct these issues, aside from the exclusion of problematic sections or transcript species. While this is understandable given the study is aimed at the highest quality atlas effort, many researchers don't need that level of quality to compare groups. It would be important to include discussion points as to how those cut-offs should be decided for a specific study.

      (3) Although the authors demonstrate the applicability of MerQuaCo on a large MERFISH dataset, and the limited number of sections from other platforms, it would be helpful to describe its limitations in its generalizability.

    3. Reviewer #2 (Public review):

      Summary:

      The authors present MerQuaCo, a computational tool for quality control in image-based spatial transcriptomic, especially MERSCOPE. They assessed MerQuaCo on 641 slides that are produced in their institute in terms of the ratio of imperfection, transcript density, and variations of quality by different planes (x-axis).

      Strengths:

      This looks to be a valuable work that can be a good guideline of quality control in future spatial transcriptomics. A well-controlled spatial transcriptomics dataset is also important for the downstream analysis.

      Weaknesses:

      The results section needs to be more structured.

    4. Reviewer #3 (Public review):

      Summary:

      MerQuaCo is an open-source computational tool developed for quality control in image-based spatial transcriptomics data, with a primary focus on data generated by the Vizgen MERSCOPE platform. The authors analyzed a substantial dataset of 641 fresh-frozen adult mouse brain sections to identify and quantify common imperfections, aiming to replace manual quality assessment with an automated, objective approach, providing standardized data integrity measures for spatial transcriptomics experiments.

      Strengths:

      The manuscript's strengths lie in its timely utility, rigorous empirical validation, and practical contributions to methodology and biological discovery in spatial transcriptomics.

      Weaknesses:

      While MerQuaCo demonstrates utility in large datasets and cross-platform potential, its generalizability and validation require expansion, particularly for non-MERSCOPE platforms and real-world biological impact.

    1. eLife Assessment

      This study presents important findings on the role of CXXC-finger protein 1 in regulatory T cell gene regulation and function. The evidence supporting the authors' claims is convincing, with mostly state-of-the-art technology. The work will be of relevance to immunologists interested in regulatory T cell biology and autoimmunity.

    2. Reviewer #1 (Public review):

      Summary:

      This work investigated the role of CXXC-finger protein 1 (CXXC1) in regulatory T cells. CXXC1-bound genomic regions largely overlap with Foxp3-bound regions and regions with H3K4me3 histone modifications in Treg cells. CXXC1 and Foxp3 interact with each other, as shown by co-immunoprecipitation. Mice with Treg-specific CXXC1 knockout (KO) succumb to lymphoproliferative diseases between 3 to 4 weeks of age, similar to Foxp3 KO mice. Although the immune suppression function of CXXC1 KO Treg is comparable to WT Treg in an in vitro assay, these KO Tregs failed to suppress autoimmune diseases such as EAE and colitis in Treg transfer models in vivo. This is partly due to the diminished survival of the KO Tregs after transfer. CXXC1 KO Tregs do not have an altered DNA methylation pattern; instead, they display weakened H3K4me3 modifications within the broad H3K4me3 domains, which contain a set of Treg signature genes. These results suggest that CXXC1 and Foxp3 collaborate to regulate Treg homeostasis and function by promoting Treg signature gene expression through maintaining H3K4me3 modification.

      Strengths:

      Epigenetic regulation of Treg cells has been a constantly evolving area of research. The current study revealed CXXC1 as a previously unidentified epigenetic regulator of Tregs. The strong phenotype of the knockout mouse supports the critical role CXXC1 plays in Treg cells. Mechanistically, the link between CXXC1 and the maintenance of broad H3K4me3 domains is also a novel finding.

      Weaknesses:

      The authors addressed the reviewer's critiques fully in the revised manuscript.

    3. Reviewer #2 (Public review):

      FOXP3 has been known to form diverse complexes with different transcription factors and enzymes responsible for epigenetic modifications, but how extracellular signals timely regulate FOXP3 complex dynamics remains to be fully understood. Histone H3K4 tri-methylation (H3K4me3) and CXXC finger protein 1 (CXXC1), which is required to regulate H3K4me3, also remain to be fully investigated in Treg cells. Here, Meng et al. performed a comprehensive analysis of H3K4me3 CUT&Tag assay on Treg cells and a comparison of the dataset with the FOXP3 ChIP-seq dataset revealed that FOXP3 could facilitate the regulation of target genes by promoting H3K4me3 deposition. Moreover, CXXC1-FOXP3 interaction is required for this regulation. They found that specific knockdown of Cxxc1 in Treg leads to spontaneous severe multi-organ inflammation in mice and that Cxxc1-deficient Treg exhibits enhanced activation and impaired suppression activity. In addition, they have also found that CXXC1 shares several binding sites with FOXP3 especially on Treg signature gene loci, which are necessary for maintaining homeostasis and identity of Treg cells.

      Comments on revisions:

      The authors have fully addressed the reviewers' comments and questions.

    4. Reviewer #3 (Public review):

      In the report entitled "CXXC-finger protein 1 associates with FOXP3 to stabilize homeostasis and suppressive functions of regulatory T cells", the authors demonstrated that Cxxc1-deletion in Treg cells leads to the development of severe inflammatory disease with impaired suppressive function. Mechanistically, CXXC1 interacts with Foxp3 and regulates the expression of key Treg signature genes by modulating H3K4me3 deposition. Their findings are interesting and significant.

      Comments on revisions:

      In the revised manuscript, the authors have responded well to all the concerns reviewers raised. The manuscript has further improved.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This work investigated the role of CXXC-finger protein 1 (CXXC1) in regulatory T cells. CXXC1-bound genomic regions largely overlap with Foxp3-bound regions and regions with H3K4me3 histone modifications in Treg cells. CXXC1 and Foxp3 interact with each other, as shown by co-immunoprecipitation. Mice with Treg-specific CXXC1 knockout (KO) succumb to lymphoproliferative diseases between 3 to 4 weeks of age, similar to Foxp3 KO mice. Although the immune suppression function of CXXC1 KO Treg is comparable to WT Treg in an in vitro assay, these KO Tregs failed to suppress autoimmune diseases such as EAE and colitis in Treg transfer models in vivo. This is partly due to the diminished survival of the KO Tregs after transfer. CXXC1 KO Tregs do not have an altered DNA methylation pattern; instead, they display weakened H3K4me3 modifications within the broad H3K4me3 domains, which contain a set of Treg signature genes. These results suggest that CXXC1 and Foxp3 collaborate to regulate Treg homeostasis and function by promoting Treg signature gene expression through maintaining H3K4me3 modification.

      Strengths:

      Epigenetic regulation of Treg cells has been a constantly evolving area of research. The current study revealed CXXC1 as a previously unidentified epigenetic regulator of Tregs. The strong phenotype of the knockout mouse supports the critical role CXXC1 plays in Treg cells. Mechanistically, the link between CXXC1 and the maintenance of broad H3K4me3 domains is also a novel finding.

      Weaknesses:

      (1) It is not clear why the authors chose to compare H3K4me3 and H3K27me3 enriched genomic regions. There are other histone modifications associated with transcription activation or repression. Please provide justification.

      Thank you for highlighting this important point. We chose to focus on H3K4me3 and H3K27me3 enriched genomic regions because these histone modifications are well-characterized markers of transcriptional activation and repression, respectively. H3K4me3 is predominantly associated with active promoters, while H3K27me3 marks repressed chromatin states, particularly in the context of gene regulation at promoters. This duality provides a robust framework for investigating the balance between transcriptional activation and repression in Treg cells. While histone acetylation, such as H3K27ac, is linked to enhancer activity and transcriptional elongation, our focus was on promoter-level regulation, where H3K4me3 and H3K27me3 are most relevant. Although other histone modifications could provide additional insights, we chose to focus on these two to maintain clarity and feasibility in our analysis. We have revised the text accordingly; please refer to Page 18, lines 353-356.

      (2) It is not clear what separates Clusters 1 and 3 in Figure 1C. It seems they share the same features.

      We apologize for not clarifying these clusters clearly. Cluster 1 and 3 are both H3K4me3 only group, with H3K4me3 enrichment and gene expression levels being higher in Cluster 1. At first, we divided the promoters into four categories because we wanted to try to classify them into four categories: H3K4me3 only, H3K27me3 only, H3K4me3-H3K27me3 co-occupied, and None. However, in actual classification, we could not distinguish H3K4me3-H3K27me3 co-occupied group. Instead, we had two categories of H3K4me3 only, with cluster 1 having a higher enrichment level for H3K4me3 and gene expression levels.

      (3) The claim, "These observations support the hypothesis that FOXP3 primarily functions as an activator by promoting H3K4me3 deposition in Treg cells." (line 344), seems to be a bit of an overstatement. Foxp3 certainly can promote transcription in ways other than promoting H3K3me3 deposition, and it also can repress gene transcription without affecting H3K27me3 deposition. Therefore, it is not justified to claim that promoting H3K4me3 deposition is Foxp3's primary function.

      Thank you for your insightful feedback. We agree that the statement in line 344 may have overstated the role of FOXP3 in promoting H3K4me3 deposition as its primary function. As you pointed out, FOXP3 is indeed a multifaceted transcription factor that regulates gene expression through various mechanisms. It can promote transcription independent of H3K4me3 deposition, as well as repress transcription without directly influencing H3K27me3 levels.

      To more accurately reflect the broader regulatory functions of FOXP3, we have revised the manuscript. The updated text (Page 19, lines 385-388) now reads:

      "These findings collectively support the conclusion that FOXP3 contributes to transcriptional activation in Treg cells by promoting H3K4me3 deposition at target loci, while also regulating gene expression directly or indirectly through other epigenetic modifications.

      (4) For the in vitro suppression assay in Figure S4C, and the Treg transfer EAE and colitis experiments in Figure 4, the Tregs should be isolated from Cxxc1 fl/fl x Foxp3 cre/wt female heterozygous mice instead of Cxxc1 fl/fl x Foxp3 cre/cre (or cre/Y) mice. Tregs from the homozygous KO mice are already activated by the lymphoproliferative environment and could have vastly different gene expression patterns and homeostatic features compared to resting Tregs. Therefore, it's not a fair comparison between these activated KO Tregs and resting WT Tregs.

      Thank you for raising this insightful point regarding the potential activation status of Treg cells in homozygous knockout mice. To address this concern, we performed additional experiments using Treg cells isolated from Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/fl</sup> (hereafter referred to as “het-KO”) female mice and their littermate controls, Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/+</sup> (referred to as “het-WT”) mice.

      The results of these new experiments are now included in the manuscript (Page25, lines 507–509, Figure 6E and Figure S6A-E):

      (1) In the in vitro suppression assay, Treg cells from het-KO mice exhibited reduced suppressive function compared to het-WT Treg cells. This finding underscores the intrinsic defect in Treg cells suppressive capacity attributable to the loss of one Cxxc1 allele.

      (2) In the experimental autoimmune encephalomyelitis (EAE) model, Treg cells isolated from het-KO mice also demonstrated impaired suppressive function.

      (5) The manuscript didn't provide a potential mechanism for how CXXC1 strengthens broad H3K4me3-modified genomic regions. The authors should perform Foxp3 ChIP-seq or Cut-n-Taq with WT and Cxxc1 cKO Tregs to determine whether CXXC1 deletion changes Foxp3's binding pattern in Treg cells.

      Thank you for raising this important point. To address your suggestion, we performed CUT&Tag experiments and found that Cxxc1 deletion does not alter FOXP3 binding patterns in Treg cells. Most FOXP3-bound regions in WT Treg cells were similarly enriched in KO Treg cells, indicating that Cxxc1 deficiency does not impair FOXP3’s DNA-binding ability. These results have been added to the revised manuscript (Page 28, lines 567-575, Figure S8A-B) and are further discussed in the Discussion (Pages 28-29, lines 581-587).

      Reviewer #2 (Public review):

      FOXP3 has been known to form diverse complexes with different transcription factors and enzymes responsible for epigenetic modifications, but how extracellular signals timely regulate FOXP3 complex dynamics remains to be fully understood. Histone H3K4 tri-methylation (H3K4me3) and CXXC finger protein 1 (CXXC1), which is required to regulate H3K4me3, also remain to be fully investigated in Treg cells. Here, Meng et al. performed a comprehensive analysis of H3K4me3 CUT&Tag assay on Treg cells and a comparison of the dataset with the FOXP3 ChIP-seq dataset revealed that FOXP3 could facilitate the regulation of target genes by promoting H3K4me3 deposition.

      Moreover, CXXC1-FOXP3 interaction is required for this regulation. They found that specific knockdown of Cxxc1 in Treg leads to spontaneous severe multi-organ inflammation in mice and that Cxxc1-deficient Treg exhibits enhanced activation and impaired suppression activity. In addition, they have also found that CXXC1 shares several binding sites with FOXP3 especially on Treg signature gene loci, which are necessary for maintaining homeostasis and identity of Treg cells.

      The findings of the current study are pretty intriguing, and it would be great if the authors could fully address the following comments to support these interesting findings.

      Major points:

      (1) There is insufficient evidence in the first part of the Results to support the conclusion that "FOXP3 functions as an activator by promoting H3K4Me3 deposition in Treg cells". The authors should compare the results for H3K4Me3 in FOXP3-negative conventional T cells to demonstrate that at these promoter loci, FOXP3 promotes H3K4Me3 deposition.

      Thank you for this insightful comment. We have already performed additional experiments comparing H3K4Me3 levels between FOXP3-positive Treg cells and FOXP3-negative conventional T cells (Tconv). Please refer to Pages 18, lines 361-368, and Figure 1C and Figure S1C for the results. Our results show that H3K4Me3 abundance is higher at many Treg-specific gene loci in Treg cells compared to Tconv cells. This supports our conclusion that FOXP3 promotes H3K4Me3 deposition at these loci.

      (2) In Figure 3 F&G, the activation status and IFNγ production should be analyzed in Treg cells and Tconv cells separately rather than in total CD4+ T cells. Moreover, are there changes in autoantibodies and IgG and IgE levels in the serum of cKO mice?

      Thank you for your valuable suggestions. In response to your comment, we reanalyzed the data in Figures 3F and 3G to assess the activation status and IFN-γ production in Tconv cells. The updated analysis revealed that Cxxc1 deletion in Treg cells leads to increased activation and IFN-γ production in Tconv cells. Additionally, we corrected the analysis of IL-17A and IL-4 expression, which were upregulated in Tconv cells. These updated results are now included in the revised manuscript (Page 21, lines 429-431, Figure 3I and Figure S3E-F).

      Additionally, we examined autoantibodies and immunoglobulin levels in the serum of Cxxc1 cKO mice. Our data show a significant increase in serum IgG levels, accompanied by elevated IgG autoantibodies, indicating heightened autoimmune responses. In contrast, serum IgE levels remained largely unchanged. The results are detailed in the revised manuscript (Page 21, lines 421-423, Figure 3E and Figure S3B).

      (3) Why did Cxxc1-deficient Treg cells not show impaired suppression than WT Treg during in vitro suppression assay, despite the reduced expression of Treg cell suppression assay -associated markers at the transcriptional level demonstrated in both scRNA-seq and bulk RNA-seq?

      Thank you for your thoughtful comment. The absence of impaired suppression in Cxxc1-deficient Treg cells from homozygous knockout (KO) mice during the in vitro suppression assay, despite the reduced expression of Treg-associated markers at the transcriptional level (as demonstrated by scRNA-seq), can likely be explained by the activated state of these Treg cells. In homozygous KO mice, Treg cells are already activated due to the lymphoproliferative environment, resulting in gene expression patterns that differ from those of resting Treg cells. This pre-activation may obscure the effect of Cxxc1 deletion on their suppressive function in vitro.

      To address this limitation, we used heterozygous Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/fl</sup> (het-KO) female mice, along with their littermate controls, Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/+</sup> (het-WT) mice. In these heterozygous mice, we observed an impairment in Treg cell suppressive function in vitro, which was accompanied by the downregulation of several key Treg-associated genes, as confirmed by RNA-Seq analysis.

      These updated findings, based on the use of het-KO mice, are now incorporated into the revised manuscript (Page 25, lines 507–509, Figure 6E).

      (4) Is there a disease in which Cxxc1 is expressed at low levels or absent in Treg cells? Is the same immunodeficiency phenotype present in patients as in mice?

      This is indeed a very meaningful and intriguing question, and we are equally interested in understanding whether low or absent Cxxc1 expression in Treg cells is associated with any human diseases. However, despite an extensive review of the literature and available data, we found no reports linking Cxxc1 deficiency in Treg cells to immunodeficiency phenotypes in patients comparable to those observed in mice.

      Reviewer #3 (Public review):

      In the report entitled "CXXC-finger protein 1 associates with FOXP3 to stabilize homeostasis and suppressive functions of regulatory T cells", the authors demonstrated that Cxxc1-deletion in Treg cells leads to the development of severe inflammatory disease with impaired suppressive function. Mechanistically, CXXC1 interacts with Foxp3 and regulates the expression of key Treg signature genes by modulating H3K4me3 deposition. Their findings are interesting and significant. However, there are several concerns regarding their analysis and conclusions.

      Major concerns:

      (1) Despite cKO mice showing an increase in Treg cells in the lymph nodes and Cxxc1-deficient Treg cells having normal suppressive function, the majority of cKO mice died within a month. What causes cKO mice to die from severe inflammation?

      Considering the results of Figures 4 and 5, a decrease in the Treg cell population due to their reduced proliferative capacity may be one of the causes. It would be informative to analyze the population of tissue Treg cells.

      Thank you for your insightful observation regarding the mortality of cKO mice despite increased Treg cells in lymph nodes and the normal suppressive function of Cxxc1-deficient Treg cells.

      As suggested, we hypothesized that the reduction of tissue-resident Treg cells could be a key factor. Additional experiments revealed a significant decrease in Treg cell populations in the small intestine lamina propria (LPL), liver, and lung of cKO mice. These findings highlight the critical role of tissue-resident Treg cells in preventing systemic inflammation.

      This reduction aligns with Figures 4 and 5, which demonstrate impaired proliferation and survival of Cxxc1-deficient Treg cells. Together, these defects lead to insufficient Treg populations in peripheral tissues, escalating localized inflammation into systemic immune dysregulation and early mortality.

      These additional results have been incorporated into the revised manuscript (Page21, lines 424-427, Figure 3G and Figure S3C).

      (2) In Figure 5B, scRNA-seq analysis indicated that the Mki67+ Treg subset is comparable between WT and Cxxc1-deficient Treg cells. On the other hand, FACS analysis demonstrated that Cxxc1-deficient Treg shows less Ki-67 expression compared to WT in Figure 5I. The authors should explain this discrepancy.

      Thank you for pointing out the apparent discrepancy between the scRNA-seq and FACS analyses regarding Ki-67 expression in Cxxc1-deficient Treg cells.

      In Figure 5B, the scRNA-seq analysis identified the Mki67+ Treg subset as comparable between WT and Cxxc1-deficient Treg cells. This finding reflects the overall proportion of cells expressing Mki67 transcripts within the Treg population. In contrast, the FACS analysis in Figure 5I specifically measures Ki-67 protein levels, revealing reduced expression in Cxxc1-deficient Treg cells compared to WT.

      To resolve this discrepancy, we performed additional analyses of the scRNA-seq data to directly compare the expression levels of Mki67 mRNA between WT and Cxxc1-deficient Treg cells. The results revealed a consistent reduction in Mki67 transcript levels in Cxxc1-deficient Treg cells, aligning with the reduced Ki-67 protein levels observed by FACS.

      These new analyses have been included in the revised manuscript (Author response image 1) to clarify this point and demonstrate consistency between the scRNA-seq and FACS data.

      Author response image 1.

      Violin plots displaying the expression levels of Mki67 in T<sub>reg</sub> cells from Foxp3<sup>cre</sup> and Foxp3<sup>cre</sup>Cxxc1<sup>fl/fl</sup> mice.

      In addition, the authors concluded on line 441 that CXXC1 plays a crucial role in maintaining Treg cell stability. However, there appears to be no data on Treg stability. Which data represent the Treg stability?

      Thank you for your valuable comment. We agree that our wording in line 441 may have been too conclusive. Our data focus on the impact of Cxxc1 deficiency on Treg cell homeostasis and transcriptional regulation, rather than directly measuring Treg cell stability. Specifically, the downregulation of Treg-specific suppressive genes and upregulation of pro-inflammatory markers suggest a shift in Treg cell function, which points to disrupted homeostasis rather than stability.

      We have revised the manuscript to clarify that CXXC1 plays a crucial role in maintaining Treg cell function and homeostasis, rather than stability (Page 24, lines 489-491).

      (3) The authors found that Cxxc1-deficient Treg cells exhibit weaker H3K4me3 signals compared to WT in Figure 7. This result suggests that Cxxc1 regulates H3K4me3 modification via H3K4 methyltransferases in Treg cells. The authors should clarify which H3K4 methyltransferases contribute to the modulation of H3K4me3 deposition by Cxxc1 in Treg cells.

      We appreciate the reviewer’s insightful comment regarding the role of H3K4 methyltransferases in regulating H3K4me3 deposition by CXXC1 in Treg cells.

      CXXC1 has been reported to function as a non-catalytic component of the Set1/COMPASS complex, which includes the H3K4 methyltransferases SETD1A and SETD1B—key enzymes responsible for H3K4 trimethylation(1-4). Based on these findings, we propose that CXXC1 modulates H3K4me3 levels in Treg cells by interacting with and stabilizing the activity of the Set1/COMPASS complex.

      These revisions are further discussed in the Discussion (Page 30-31, lines 624-632).

      Furthermore, it would be important to investigate whether Cxxc1-deletion alters Foxp3 binding to target genes.

      Thank you for raising this important point. To address your suggestion, we performed CUT&Tag experiments and found that Cxxc1 deletion does not alter FOXP3 binding patterns in Treg cells. Most FOXP3-bound regions in WT Treg cells were similarly enriched in KO Treg cells, indicating that Cxxc1 deficiency does not impair FOXP3’s DNA-binding ability. These results have been added to the revised manuscript (Page 28, lines 567-575, Figure S8A-B) and are further discussed in the Discussion (Pages 28-29, lines 581-587).

      (4) In Figure 7, the authors concluded that CXXC1 promotes Treg cell homeostasis and function by preserving the H3K4me3 modification since Cxxc1-deficient Treg cells show lower H3K4me3 densities at the key Treg signature genes. Are these Cxxc1-deficient Treg cells derived from mosaic mice? If Cxxc1-deficient Treg cells are derived from cKO mice, the gene expression and H3K4me3 modification status are inconsistent because scRNA-seq analysis indicated that expression of these Treg signature genes was increased in Cxxc1-deficient Treg cells compared to WT (Figure 5F and G).

      Thank you for your insightful comment. To clarify, the Cxxc1-deficient Treg cells analyzed for H3K4me3 modifications in Figure 7 were derived from Cxxc1 conditional knockout (cKO) mice, not mosaic mice.

      Regarding the apparent inconsistency between reduced H3K4me3 levels and the increased expression of Treg signature genes observed in scRNA-seq analysis (Figure 5F and G), we believe this discrepancy can be attributed to distinct mechanisms regulating gene expression. H3K4me3 is an epigenetic mark that facilitates chromatin accessibility and transcriptional regulation, reflecting upstream chromatin dynamics. However, gene expression levels are influenced by a combination of factors, including transcriptional activators, downstream compensatory mechanisms, and the inflammatory environment in cKO mice.

      The upregulation of Treg signature genes in scRNA-seq data likely reflects an activated or pro-inflammatory state of Cxxc1-deficient Treg cells in response to systemic inflammation, as previously described in the manuscript. This contrasts with the intrinsic reduction in H3K4me3 levels at these loci, indicating a loss of epigenetic regulation by CXXC1.

      To further support this interpretation, RNA-seq analysis of Treg cells from Foxp3<sup>Cre/+</sup> Cxxc1<sup>fl/fl</sup> (“het-KO”) and their littermate Foxp3<sup>Cre/+</sup> Cxxc1<sup>fl/+</sup> (“het-WT”) female mice (Figure S6C) revealed a significant reduction in key Treg signature genes such as Icos, Ctla4, Tnfrsf18, and Nt5e in het-KO Treg cells. These results align with the diminished H3K4me3 modifications observed in cKO Treg cells, further underscoring the role of CXXC1 as an epigenetic regulator.

      In summary, while the gene expression changes observed in scRNA-seq may reflect adaptive responses to inflammation, the reduced H3K4me3 modifications directly highlight the critical role of CXXC1 in maintaining the epigenetic landscape essential for Treg cell homeostasis and function.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In Figure 7E, the y-axis scale for H3K4me3 peaks at the Ctla4 locus should be consistent between WT and cKO samples.

      We thank the reviewer for pointing out the inconsistency in the y-axis scale for the H3K4me3 peaks at the Ctla4 locus in Figure 7E. We have carefully revised the figure to ensure that the y-axis scale is now consistent between the WT and cKO samples.

      We appreciate the reviewer’s attention to this detail, as it enhances the rigor of the data presentation. Please find the updated Figure 7E in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      In lines 455 and 466, the name of Treg signature markers validated by flow cytometry should be written as protein name and capitalized.

      Thank you for pointing this out. We have carefully reviewed lines 455 and 466 and have revised the text to ensure that the Treg signature markers validated by flow cytometry are referred to using their protein names, with proper capitalization.

      Reviewer #3 (Recommendations for the authors):

      (1) On line 431, "Cxxc1-deficient cells" should be Cxxc1-deficient Treg cells".

      We thank the reviewer for highlighting this oversight. On line 431, we have revised "Cxxc1-deficient cells" to "Cxxc1-deficient Treg cells" to provide a more accurate and specific description. We appreciate the reviewer's attention to detail, as this correction improves the precision of our manuscript.

      (2) In Figure 4H, negative values should be removed from the y-axis.

      Thank you for your observation. We have revised Figure 4H to remove the negative values from the y-axis, as requested. This adjustment ensures a more accurate and meaningful representation of the data.

      (3) It is better to provide the lists of overlapping genes in Figure 7C.

      Thank you for your suggestion. We agree that providing the lists of overlapping genes in Figure 7C would enhance the clarity and reproducibility of the results. We have now included the gene lists as supplementary information (Supplementary Table 3) accompanying Figure 7C.

      (1) Lee, J. H. & Skalnik, D. G. CpG-binding protein (CXXC finger protein 1) is a component of the mammalian set1 histone H3-Lys4 methyltransferase complex, the analogue of the yeast Set1/COMPASS complex. Journal of Biological Chemistry 280, 41725-41731, doi:10.1074/jbc.M508312200 (2005).

      (2) Thomson, J. P., Skene, P. J., Selfridge, J., Clouaire, T., Guy, J., Webb, S., Kerr, A. R. W., Deaton, A., Andrews, R., James, K. D., Turner, D. J., Illingworth, R. & Bird, A. CpG islands influence chromatin structure via the CpG-binding protein Cfp1. Nature 464, 1082-U1162, doi:10.1038/nature08924 (2010).

      (3) Shilatifard, A. in Annual Review of Biochemistry, Vol 81 Vol. 81 Annual Review of Biochemistry (ed R. D. Kornberg)  65-95 (2012).

      (4) Brown, D. A., Di Cerbo, V., Feldmann, A., Ahn, J., Ito, S., Blackledge, N. P., Nakayama, M., McClellan, M., Dimitrova, E., Turberfield, A. H., Long, H. K., King, H. W., Kriaucionis, S., Schermelleh, L., Kutateladze, T. G., Koseki, H. & Klose, R. J. The SET1 Complex Selects Actively Transcribed Target Genes via Multivalent Interaction with CpG Island Chromatin. Cell Reports 20, 2313-2327, doi:10.1016/j.celrep.2017.08.030 (2017).

    1. eLife Assessment

      The authors use single molecule imaging and in vivo loop-capture genomic approaches to investigate estrogen mediated enhancer-target gene activation in human cancer cells. These potentially important results suggest that ER-alpha can, in a temporal delay, activate a non-target gene TFF3, which is in proximity to the main target gene TFF1, even though the estrogen responsive enhancer does not loop with the TFF3 promoter. To explain these results, the authors invoke a transcriptional condensate model. The claim of a temporal delay and effects of the target gene transcription on the non-target gene expression are supported by solid evidence but there is no direct evidence of the role of a condensate in mediating this effect. The reviewers appreciate that the authors have done a lot of work to strengthen the study. This work will be of interest to those studying transcriptional gene regulation and hormone-aggravated cancers.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript by Bohra et al. describes the indirect effects of ligand-dependent gene activation on neighboring non-target genes. The authors utilized single-molecule RNA-FISH (targeting both mature and intronic regions), 4C-seq, and enhancer deletions to demonstrate that the non-enhancer-targeted gene TFF3, located in the same TAD as the target gene TFF1, alters its expression when TFF1 expression declines at the end of the estrogen signaling peak. Since the enhancer does not loop with TFF3, the authors conclude that mechanisms other than estrogen receptor or enhancer-driven induction are responsible for TFF3 expression. Moreover, ERα intensity correlations show that both high and low levels of ERα are unfavorable for TFF1 expression. The ERa level correlations are further supported by overexpression of GFP-ERa. The authors conclude that transcriptional machinery used by TFF1 for its acute activation can negatively impact the TFF3 at peak of signaling but once, the condensate dissolves, TFF3 benefits from it for its low expression.

      Strengths:

      The findings are indeed intriguing. The authors have maintained appropriate experimental controls, and their conclusions are well-supported by the data.

      Weaknesses:

      There are some major and minor concerns that related to approach, data presentation and discussion. But the authors have greatly improved the manuscript during the revision work.

      Comments on latest version:

      The authors have done a lot of work for the revision. The manuscript has been greatly improved.

    3. Reviewer #3 (Public review):

      Summary:

      In this manuscript Bohra et al. measure the effects of estrogen responsive gene expression upon induction on nearby target genes using a TAD containing the genes TFF1 and TFF3 as a model. The authors propose that there is a sort competition for transcriptional machinery between TFF1 (estrogen responsive) and TFF3 (not responsive) such that when TFF1 is activated and machinery is recruited, TFF3 is activated after a time delay. The authors attribute this time delay to transcriptional machinery that was being sequestered at TFF1 becomes available to the proximal TFF3 locus. The authors demonstrate that this activation is not dependent on contact with the TFF1 enhancer through deletion, instead they conclude that it is dependent on a phase-separated condensate which can sequester transcriptional machinery. Although the manuscript reports an interesting observation that there is a dose dependence and time delay on the expression of TFF1 relative to TFF3, there is much room for improvement in the analysis and reporting of the data. Most importantly there is no direct test of condensate formation at the locus in the context of this study: i.e. dissolution upon the enhancer deletion, decay in a temporal manner, and dependence of TFF1 expression on condensate formation. Using 1,6' hexanediol to draw conclusion on this matter is not adequate to draw conclusions on the effect of condensates on a specific genes activity given current knowledge on its non-specificity and multitude of indirect effects. Thus, in my opinion the major claim that this effect of a time delayed expression of TFF3 being dependent on condensates in not supported by the current data.

      Strengths:

      The depends of TFF1 expression on a single enhancer and the temporal delay in TFF3 is a very interesting finding.

      The non-linear dependence of TFF1 and TTF3 expression on ER concentration is very interesting with potentially broader implications.

      The combined use of smFISH, enhancer deletion, and 4C to build a coherent model is a good approach.

      Weaknesses:

      There is no direct observation of a condensate at the TFF1 and TFF3 locus and how this condensate changes over time after E2 treatment, upon enhancer deletion, whether transcriptional machinery is indeed concentrated within it, and other claims on condensate function and formation made in the manuscript. The use of 1,6' HD is not appropriate to test this idea given how broadly it acts.

      Comments on latest version:

      I don't think the response to Reviewer 2's comment on LLPS condensates on TFF1 are adequate and given this point is essential to the claims of the manuscript they must be addressed. Namely, the data from Saravavanan, 2020 actually suggest that condensate formation at the locus is not very predictive and barely enriched over random spots. The claims in the manuscript on the dependence of the condensate being responsible for sequestering transcriptional machinery are quite strong and the crux of the current model. To continue to make this claim (which I don't think is necessary since there are other possible models) the authors must test if the condensate at his locus (1) shows time dependent behavior, (2) is not present or weakened at the locus in cells that show high TFF3 expression, (3) is indeed enriched for transcriptional machinery when TFF1 peaks. The use of 1,6 hexanediol is not appropriate as pointed out by reviewer 2 and is no longer considered as an appropriate experiment by many as the whole notion of LLPS forming nuclear condensates is now under question. Such condensates can form through a variety of mechanisms as reviewed for example by Mittaj and Pappu (A conceptual framework for understanding phase separation and addressing open questions and challenges, Molecular Cell, 2022). Furthermore, given the distance between TFF1 and TFF3 it is hard to imagine that if a condensate that concentrates machinery in a non-stoichiometric manner was forming how it would not boost expression on both genes and be just specific to one. There must be another mechanism in my opinion.

      I would recommend the authors remove this aspect of their manuscript/model and simply report their interesting findings that are actually supported by data: The temporal delay of TFF3 expression, the dependence on ER concentration, and the enhancer dependence.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Summary:

      The manuscript by Bohra et al. describes the indirect effects of ligand-dependent gene activation on neighboring non-target genes. The authors utilized single-molecule RNA-FISH (targeting both mature and intronic regions), 4C-seq, and enhancer deletions to demonstrate that the non-enhancer-targeted gene TFF3, located in the same TAD as the target gene TFF1, alters its expression when TFF1 expression declines at the end of the estrogen signaling peak. Since the enhancer does not loop with TFF3, the authors conclude that mechanisms other than estrogen receptor or enhancer-driven induction are responsible for TFF3 expression. Moreover, ERα intensity correlations show that both high and low levels of ERα are unfavorable for TFF1 expression. The ERa level correlations are further supported by overexpression of GFP-ERa. The authors conclude that transcriptional machinery used by TFF1 for its acute activation can negatively impact the TFF3 at peak of signaling but once, the condensate dissolves, TFF3 benefits from it for its low expression.

      Strengths:

      The findings are indeed intriguing. The authors have maintained appropriate experimental controls, and their conclusions are well-supported by the data.

      Weaknesses:

      There are some major and minor concerns that related to approach, data presentation and discussion. But I think they can be fixed with more efforts.

      We thank the reviewer for their positive comments on the paper. We have addressed all their specific recommendations below.  

      The deletion of enhancer reveals the absolute reliance of TFF1 on its enhancers for its expression. Authors should elaborate more on this as this is an important finding.

      We thank the reviewer for the comment. We have now added a more detailed discussion on the requirement of enhancer for TFF1 expression in the revised manuscript (line 368-385).  

      In Fig. 1, TFF3 expression is shown to be induced upon E2 signaling through qRT-PCR, while smFISH does not display a similar pattern. The authors attribute this discrepancy to the overall low expression of TFF3. In my opinion, this argument could be further supported by relevant literature, if available. Additionally, does GRO-seq data reveal any changes in TFF3 expression following estrogen stimulation? The GRO-seq track shown in Fig.1 should be adjusted to TFF3 expression to appreciate its expression changes.

      We have now included a browser shot image of TFF3 region showing GRO-Seq signal at E2 time course (Fig. S1C). We observed an increased transcription towards the 3’ end of TFF3 gene body at 3h.  The increased transcription at 3h, corroborates with smFISH data. The relative changes of TFF3 expression measured by qRT-PCR and smFISH for intronic transcripts are somewhat different, we speculate that such biased measurements that are dependent on PCR amplifications could be more for genes that express at low levels and smFISH using intronic probes may be a more sensitive assay to detect such changes.    

      Since the mutually exclusive relationship between TFF1 and TFF3 is based on snap shots in fixed cells, can authors comment on whether the same cell that expresses TFF1 at 1h, expresses TFF3 at 3h? Perhaps, the calculations taking total number of cells that express these genes at 1 and 3h would be useful.

      Like pointed out by the reviewer, since these are fixed cells, we cannot comment on the fate of the same cell at two time points. To further address this limitation, future work could employ cells with endogenous tags for TFF1 and TFF3 and utilize live cell imaging techniques. In a fixed cell assay, as the reviewer suggests, it can be investigated whether a similar fraction shows high TFF3 expression at 3h, as the fraction that shows high TFF1 expression at 1 h. To quantify the fractions as suggested by the reviewer, we plotted the fraction of cells showing high TFF1 and TFF3 expression at 1h and 3h. We identify truly high expressing cells by taking mean and one standard deviation (for single cell level data) at E2-1hr as the threshold for TFF1 (80 and above transcript counts) and mean and one standard deviation (for single cell level data) at E2-3hr as the threshold for TFF3 (36 and above transcript counts). The fraction with high TFF1 expression at 1h  (12.06 ± 2.1) is indeed comparable to that with high TFF3 expression at 3h (12.50 ± 2.0) (Fig. 2C and Author response image 1). We should note that if the transcript counts were normally distributed, a predetermined fraction would be expected to be above these thresholds and comparable fractions can arise just from underlying statistics. But in our experiments, this is unlikely to be the case given the many outliers that affect both the mean and the standard deviation, and the lack of normality and high dispersion in single cell distributions. Of course, despite the fractions being comparable, we cannot be certain if it is the same set of cells that go from high expression of TFF1 to high expression of TFF3, but definitely that is a possibility. We thank the reviewer for pointing out this comparison.

      Author response image 1.

      The graph represents the percent of cells that show high expression for TFF1 and TFF3 at 1h and 3h post E2 signaling. The threshold was collected by pooling in absolute RNA counts from 650 analyzed cells (as in Fig. 2C). The mean and standard deviation over single cell data were calculated. Mean plus one standard deviation was used to set the threshold for identifying high expressing cells. For TFF1, as it maximally expresses at 1h the threshold used was 80. For TFF3, as it maximally expresses at 3h the threshold used was 36. Fraction of cells expressing above 80 and 36 for TFF1 and TFF3 respectively were calculated from three different repeats. Mean of means and standard deviations from the three experiments are plotted here.

      Authors conclude that TFF3 is not directly regulated by enhancer or estrogen receptor. Does ERa bind on TFF3 promoter? 

      The ERa ChIP-seq performed at 1h and 3h of signaling suggests that TFF3 promoter is not bound by ERa as shown in supplementary Fig. 1B and S1B. However, one peak upstream to TFF1 promoter is visible and that is lost at 3h. 

      Minor comments:

      Reviewer’s comment -The figures would benefit from resizing of panels. There is very little space between the panels.

      We have now resized the figures in the revised manuscript.

      The discussion section could include an extrapolation on the relationship between ERα concentration and transcriptional regulation. Given that ERα levels have been shown to play a critical role in breast cancer, exploring how varying concentrations of ERα affect gene expression, including the differential regulation of target and non-target genes, would provide valuable insights into the broader implications of this study.

      This is a very important point that was missing from the manuscript. We have included this in the discussion in the revised manuscript (line 426-430).

      Reviewer #2:

      Summary:

      In this manuscript by Bohra et al., the authors use the well-established estrogen response in MCF7 cells to interrogate the role of genome architecture, enhancers, and estrogen receptor concentration in transcriptional regulation. They propose there is competition between the genes TFF1 and TFF3 which is mediated by transcriptional condensates. This reviewer does not find these claims persuasive as presented. Moreover, the results are not placed in the context of current knowledge.

      Strengths:

      High level of ERalpha expression seems to diminish the transcriptional response. Thus, the results in Fig. 4 have potential insight into ER-mediated transcription. Yet, this observation is not pursued in great depth however, for example with mutagenesis of ERalpha. However, this phenomenon - which falls under the general description of non monotonic dose response - is treated at great depth in the literature (i.e. PMID: 22419778). For example, the result the authors describe in Fig. 4 has been reported and in fact mathematically modeled in PMID 23134774. One possible avenue for improving this paper would be to dig into this result at the single-cell level using deletion mutants of ERalpha or by perturbing co-activators.

      We thank the reviewer for pointing us to the relevant literature on our observation which will enhance the manuscript. We have discussed these findings in relations to ours in the discussion section (Line 400-413). We thank the reviewer for insight on non-monotonic behavior.

      Weaknesses:

      There are concerns with the sm-RNA FISH experiments. It is highly unusual to see so much intronic signal away from the site of transcription (Fig. 2) (PMID: 27932455, 30554876), which suggests to me the authors are carrying out incorrect thresholding or have a substantial amount of labelling background. The Cote paper cited in the manuscript is likewise inconsistent with their findings and is cited in a misleading manner: they see splicing within a very small region away from the site of transcription. 

      We thank the reviewer for this comment, and apologize if they feel we misrepresented the argument from Cote et al. This has now been rectified in the manuscript. However, we do not agree that the intronic signals away from the site of transcription are an artefact. First, the images presented here are just representative 2D projections of 3D Z-stacks; whereas the full 3D stack is used for spot counting using a widely-used algorithm that reports spot counts that are constant over wide range of thresholds (Raj et al., 2008). The veracity of automated counts was first verified initially by comparison to manual counts. Even for the 2D representations the extragenic intronic signals show up at similar thresholds to the transcription sites. 

      The signal is not non-specific arising from background labeling, explained by following reasons:

      • To further support the time-course smFISH data and its interpretation without depending on the dispersed intronic signal, we have analyzed the number of alleles firing/site of transcription at a given time in a cell under the three conditions. We counted the sites of transcription in a given cell and calculated the percentage of cells showing 1,2,3,4 or >4 sites. We see that the percent of cells showing a single site of transcription for TFF1 is very high in uninduced cells and this decreases at 1h. At 1h, the cells showing 2, 3 and 4 sites of transcription increase which again goes down at 3h (Author response image 2A). This agrees with the interpretation made from mean intronic counts away from the site of transcription. Similarly, for TFF3, the number of cells showing 2,3 and 4 sites of transcription increase slightly at 3hr compared to uninduced and 1hr (Author response image 2B).  We can also see that several cells have no alleles firing at a given time as has been quantified in the graphs on right showing total fraction of cells with zero versus non-zero alleles firing (Author response image 2A-B). A non-specific signal would be present in all cells.

      • There is literature on post-transcriptional splicing of RNA beyond our work, which suggests that intronic signal can be found at relatively large distances away from the site of transcription. Waks et al. showed that some fraction of unspliced RNA could be observed up to 6-10 microns away from the site of transcription suggesting that there can be a delay between transcription and (alternative) splicing (Waks et al., 2011). Pannuclear disperse intronic signals can arise as there can be more than one allele firing at a time in different nuclear locations. The spread of intronic transcripts in our images is also limited in cells in which only 1 allele is firing at E2-1 hour (Author response image 2C) or uninduced cells (Author response image 2D). Furthermore, Cote et al. discuss that “Of note, we see that increased transcription level correlates with intron dispersal, suggesting that the percentage of splicing occurring away from the transcription site is regulated by transcription level for at least some introns. This may explain why we observe posttranscriptional splicing of all genes we measured, as all were highly expressed.” This is in line with our interpretation that intron signal dispersal can occur in case of posttranscriptional splicing (Coté et al., 2023). Additionally, other studies have suggested that transcripts in cells do not necessarily undergo co-transcriptional splicing which leads us to conclude that intronic signal can be found farther away from the site of transcription. Coulon et al. showed that splicing can occur after transcript release from the site and suggested that no strict checkpoint exists to ensure intron removal before release which results in splicing and release being kinetically uncoupled from each other (Coulon et al., 2014). Similarly, using live-cell imaging, it was shown that splicing is not always coupled with transcription, and this could depend on the nature and structural features of transcript (such as blockage of polypyrimidine tract which results in delayed recognition) (Vargas et al., 2011). Drexler  et al. showed that as opposed to drosophila transcripts that are shorter, in mammalian cells, splicing of the terminal intron can occur post-transcriptionally (Drexler et al., 2020). Using RNA polymerase II ChIP-Seq time course data from ERα activation in the MCF-7 cells, Honkela et al. showed that large number of genes can show significant delays between the completion of transcription and mRNA production (Honkela et al., 2015). This was attributed to faster transcription of shorter genes which results in splicing  delays suggesting rapid completion of transcription on shorter genes can lead to splicing-associated delays (Honkela et al., 2015). More recently, comparisons of nascent and mature RNA levels suggested a time lapse between transcription and splicing for the genes that are early responders during signaling (Zambrano et al., 2020). The presence of significant numbers of TFF1 nascent RNA in the nucleus in our data corroborates with above observations. 

      • Uniform intensities across many transcripts suggests these are true signal arising from RNA molecules which would not be the case for non-specific, background signal (Author response image 2E).

      • Splicing occurs in the nucleus and intron containing pre-transcripts should be nuclear localized. Thus, intronic signals should remain localized to the nucleus unlike the mature mRNA which translocate to the cytoplasm after processing and thus exonic signals can be found both in the nucleus and the cytoplasm. In keeping with this, we observe no signal in the cytoplasm for the intronic probes and it remains localized within the nucleus as expected and can be seen in Author response image 2F, while exonic signals are observed in both compartments. This suggests to us that the signal is coming from true pre-transcripts. There is no reason for non-specific background labelling to remain restricted to the nucleus.

      • We observe that the mean intronic label counts for both the genes TFF1 and TFF3 increases upon E2-induction compared to uninduced condition (Fig. 2B). Similarly, the mean intronic count for both genes reduce drastically in the TFF1-enhancer deleted cells (Fig. 3C, D). This change in the number of intronic signal specifically on induction and enhancer deletion suggests that the signal is not an artefact and arises from true nascent transcripts that are sensitive to stimulus or enhancer deletion.

      • We expect colocalization of intronic signal with exonic signals in the nucleus, while there can be exonic signals that do not colocalize with intronic, representing more mature mRNA. Indeed, we observe a clear colocalization between the intronic and exonic signals in the nucleus, while exonic signals can occur independent of intronic both in the nucleus and the cytoplasm. This clearly demonstrates that the intronic signals in our experiments are specific and not simply background labelling (Author response image 2G).

      These studies and the arguments above lead us to conclude that the presence of intronic transcripts in the nucleus, away from the site of transcription is not an artefact. We hope the reviewer will agree with us. These analyses have now been included in the manuscript as Supplementary Figure 6 and have been added in the manuscript at line numbers 106-111, 201204,  215-217 and line 231-235. We thank the reviewer for raising this important point.

      Author response image 2.

      Dynamic induction and RNA localization of TFF1 and TFF3 transcription across cell populations using smRNA FISH A. Bar graph depicting the percentage of cells with 1,2,3,4, or greater than 4 sites of transcription for TFF1 (left) is shown. The graph shows the mean of means from different repeats of the experiment, and error bars denote SEM (n>200, N=3). Only the cells with at least one allele firing were counted and cells with no alleles were not included in this. The graph on right shows the number of cells with zero or non-zero number of alleles firing. B. Bar graph depicting the percentage of cells with 1,2,3,4 or greater than 4 sites of transcription for TFF3 (left) is shown. The graph shows the mean of means from different repeats of the experiment, and error bars denote SEM (n>200, N=3). Only the cells with at least one allele firing were counted and cells with no alleles were not included in this. The graph in the middle shows the number of cells with 2,3,4 or greater than 4 sites of transcription for TFF3.The graph on the right shows the number of cells with zero or non-zero number of alleles firing. C. Images from single molecule RNA FISH experiment showing transcripts for InTFF1 in cells induced for 1 hour with E2. The image shows that when a single allele of TFF1 is firing, the transcripts show a more spatially restricted localisation. The scale bar is 5 microns. D. Images from single molecule RNA FISH experiment showing transcripts for InTFF1 in uninduced cells. The image shows that when a single allele of TFF1 is firing and transcription is low, the transcripts show a more spatially restricted localisation. The scale bar is 5 microns. E. Line profile through several transcripts in the nucleus show uniform and similar intensities indicating that these are true signals. F. 60X Representative images from a single molecule RNA FISH experiment showing transcripts for InTFF1 and ExTFF1 (top) and InTFF3 and ExTFF3 (bottom). The image shows that there is no intronic signal in the cytoplasm, while exonic signals can be found both in the nucleus and the cytoplasm. The scale bar is 5 microns. G. 60X Representative images from single molecule RNA FISH experiment showing transcripts for InTFF1 and ExTFF1. The image shows that all intronic signals are colocalized with exonic signals, but all exonic signals are expectedly not colocalized with intronic signals, representing more mature mRNA. The scale bar is 5 microns.

      One substantial way to improve the manuscript is to take a careful look at previous single cell analysis of the estrogen response, which in some cases has been done on the exact same genes (PMID: 29476006, 35081348, 30554876, 31930333). In some of these cases, the authors reach different conclusions than those presented in the present manuscript. Likewise, there have been more than a few studies that have characterized these enhancers (the first one I know of is: PMID 18728018). Also, Oh et al. 2021 (cited in the manuscript) did show an interaction between TFF1e and TFF3, which seems to contradict the conclusion from Fig. 3. In summary, the results of this paper are not in dialogue with the field, which is a major shortcoming. 

      We thank the reviewer for pointing out these important studies. The studies from Prof. Larson group are particularly very insightful (Rodriguez et al., 2019). We have now included this in the discussion (line 106-111 and line 420-424) where we suggest the differences and similarities between our, Larson’s group and also Mancini’s group (Patange et al., 2022; Stossi et al., 2020). 

      The 4C-Seq data from the manuscript Oh et al. 2021 is exactly consistent with our observation from Fig 3 as they also observed little to no interaction between TFF1e and TFF3p in WT cells, only upon TFF1p deletion, did the TFF1e become engaged with the TFF3p. In agreement with this, we also observe little to no interaction between TFF1e and TFF3p in WT cells (Fig.3A). This is also consistent with our competition model for resources between these two genes. Oh et al. shows interaction between TFF1e and TFF3 when the TFF1 promoter is deleted showing that when the primary promoter is not available the enhancer is retargeted to the next available gene (Oh et al., 2021). It does not show that in WT or at any time point of E2 signalling does TFF1e and TFF3 interact.

      In the opinion of this reviewer, there are few - if any - experiments to interrogate the existence of LLPS for diffraction-limited spots such as those associated with transcription. This difficulty is a general problem with the field and not specific to the present manuscript. For example, transient binding will also appear as a dynamic 'spot' in the nucleus, independently of any higher-order interactions. As for Fig. 5, I don't think treating cells with 1,6 hexanediol is any longer considered a credible experiment. For example, there are profound effects on chromatin independent of changes in LLPS (PMID: 33536240).  

      We are cognizant of and appreciate the limitations pointed out by the reviewer. We and others have previously shown that ERa forms condensates on TFF1 chromatin region using ImmunoFISH assay (Saravanan et al., 2020).  The data below shows the relative mean ERα intensity on TFF1 FISH spots and random regions clearly showing an appearance of the condensate at the TFF1 site. Further, the deletion of TFF1e causes the reduction in size of this condensate. Thus, we expect that these ERα condensates are characterized by higher-order interactions and become disrupted on treatment with 1,6-hexanediol. These condensates are the size of below micron as mentioned by the reviewer, but most TF condensates are of the similar sizes. We agree with the reviewer that 1,6- hexanediol treatment is a brute-force experiment with several irreversible changes to the chromatin. Although we have tried to use it at a low concentration for a short period of time and it has been used in several papers (Chen et al., 2023; Gamliel et al., 2022). The opposite pattern of TFF1 vs. TFF3 expression upon 1,6- hexanediol treatment suggests that there is specificity. Further, to perturb condensates, mutants of ERa can be used (N-terminus IDR truncations) however, the transcriptional response of these mutants is also altered due to perturbed recruitment of coactivators that recognize Nterminus of ER, restricting the distinction between ERa functions and condensate formation.

      References:

      Chen, L., Zhang, Z., Han, Q., Maity, B. K., Rodrigues, L., Zboril, E., Adhikari, R., Ko, S.-H., Li, X., Yoshida, S. R., Xue, P., Smith, E., Xu, K., Wang, Q., Huang, T. H.-M., Chong, S., & Liu, Z. (2023). Hormone-induced enhancer assembly requires an optimal level of hormone receptor multivalent interactions. Molecular Cell, 83(19), 3438-3456.e12. https://doi.org/10.1016/j.molcel.2023.08.027

      Coté, A., O’Farrell, A., Dardani, I., Dunagin, M., Coté, C., Wan, Y., Bayatpour, S., Drexler, H. L., Alexander, K. A., Chen, F., Wassie, A. T., Patel, R., Pham, K., Boyden, E. S., Berger, S., Phillips-Cremins, J., Churchman, L. S., & Raj, A. (2023). Post-transcriptional splicing can occur in a slow-moving zone around the gene. eLife, 12. https://doi.org/10.7554/eLife.91357.2

      Coulon, A., Ferguson, M. L., de Turris, V., Palangat, M., Chow, C. C., & Larson, D. R. (2014). Kinetic competition during the transcription cycle results in stochastic RNA processing. eLife, 3, e03939. https://doi.org/10.7554/eLife.03939

      Drexler, H. L., Choquet, K., & Churchman, L. S. (2020). Splicing Kinetics and Coordination Revealed by Direct Nascent RNA Sequencing through Nanopores. Molecular Cell, 77(5), 985-998.e8. https://doi.org/10.1016/j.molcel.2019.11.017

      Gamliel, A., Meluzzi, D., Oh, S., Jiang, N., Destici, E., Rosenfeld, M. G., & Nair, S. J. (2022). Long-distance association of topological boundaries through nuclear condensates. Proceedings of the National Academy of Sciences of the United States of America, 119(32), e2206216119. https://doi.org/10.1073/pnas.2206216119

      Honkela, A., Peltonen, J., Topa, H., Charapitsa, I., Matarese, F., Grote, K., Stunnenberg, H. G., Reid, G., Lawrence, N. D., & Rattray, M. (2015). Genome-wide modeling of transcription kinetics reveals patterns of RNA production delays. Proceedings of the National Academy of Sciences of the United States of America, 112(42), 13115. https://doi.org/10.1073/pnas.1420404112

      Oh, S., Shao, J., Mitra, J., Xiong, F., D’Antonio, M., Wang, R., Garcia-Bassets, I., Ma, Q., Zhu, X., Lee, J.-H., Nair, S. J., Yang, F., Ohgi, K., Frazer, K. A., Zhang, Z. D., Li, W., & Rosenfeld, M. G. (2021). Enhancer release and retargeting activates disease-susceptibility genes. Nature, 595(7869), Article 7869. https://doi.org/10.1038/s41586-021-03577-1

      Patange, S., Ball, D. A., Wan, Y., Karpova, T. S., Girvan, M., Levens, D., & Larson, D. R. (2022). MYC amplifies gene expression through global changes in transcription factor dynamics. Cell Reports, 38(4). https://doi.org/10.1016/j.celrep.2021.110292

      Raj, A., van den Bogaard, P., Rifkin, S. A., van Oudenaarden, A., & Tyagi, S. (2008). Imaging individual mRNA molecules using multiple singly labeled probes. Nature Methods, 5(10), Article 10. https://doi.org/10.1038/nmeth.1253

      Rodriguez, J., Ren, G., Day, C. R., Zhao, K., Chow, C. C., & Larson, D. R. (2019). Intrinsic Dynamics of a Human Gene Reveal the Basis of Expression Heterogeneity. Cell, 176(1–2), 213-226.e18. https://doi.org/10.1016/j.cell.2018.11.026

      Saravanan, B., Soota, D., Islam, Z., Majumdar, S., Mann, R., Meel, S., Farooq, U., Walavalkar, K., Gayen, S., Singh, A. K., Hannenhalli, S., & Notani, D. (2020). Ligand dependent gene regulation by transient ERα clustered enhancers. PLOS Genetics, 16(1), e1008516. https://doi.org/10.1371/journal.pgen.1008516

      Stossi, F., Dandekar, R. D., Mancini, M. G., Gu, G., Fuqua, S. A. W., Nardone, A., De Angelis, C., Fu, X., Schiff, R., Bedford, M. T., Xu, W., Johansson, H. E., Stephan, C. C., & Mancini, M. A. (2020). Estrogeninduced transcription at individual alleles is independent of receptor level and active conformation but can be modulated by coactivators activity. Nucleic Acids Research, 48(4), 1800. https://doi.org/10.1093/nar/gkz1172

      Vargas, D. Y., Shah, K., Batish, M., Levandoski, M., Sinha, S., Marras, S. A. E., Schedl, P., & Tyagi, S. (2011). Single-Molecule Imaging of Transcriptionally Coupled and Uncoupled Splicing. Cell, 147(5), 1054–1065. https://doi.org/10.1016/j.cell.2011.10.024

      Waks, Z., Klein, A. M., & Silver, P. A. (2011). Cell-to-cell variability of alternative RNA splicing. Molecular Systems Biology, 7(1), 506. https://doi.org/10.1038/msb.2011.32

      Zambrano, S., Loffreda, A., Carelli, E., Stefanelli, G., Colombo, F., Bertrand, E., Tacchetti, C., Agresti, A., Bianchi, M. E., Molina, N., & Mazza, D. (2020). First Responders Shape a Prompt and Sharp NF-κB-Mediated Transcriptional Response to TNF-α. iScience, 23(9), 101529. https://doi.org/10.1016/j.isci.2020.101529

    1. eLife Assessment

      This important study developed a mathematical model to predict biological age by leveraging physiological traits across multiple organ systems. The results presented are convincing, utilizing comprehensive data-driven approaches. However, additional external validation could further strengthen its generalizability. The model provides a way to identify environmental and genetic factors impacting aging and lifespan, revealing new factors potentially affecting aging. It also shows promise for evaluating therapeutics aimed at prolonging a healthy lifespan.

    2. Reviewer #1 (Public review):

      In this study, the authors developed a mathematical model to predict human biological ages using physiological traits. This model provides a way to identify environmental and genetic factors that impact aging and lifespan.

      Strength:

      (1) The topic addressed by the authors - human age predication using physiological traits - is an extremely interesting, important, and challenging question in the aging field. One of the biggest challenges is the lack of well-controlled data from a large number of humans. However, the authors took this challenge and tried their best to extract useful information from available data.<br /> (2) Some of the findings can provide valuable guidelines for future experimental design for human and animal studies. For example, it was found that this mathematical model can best predict age when all different organ and physiological systems are sampled. This finding makes scenes in general, but can be, and have been, neglected when people use molecular markers to predict age. Most of those studies have used only one molecular trait or different traits from one tissue.

      Weakness:

      (1) As I mentioned above, the Biobank data used here are not designed for this current study, so there are many limitations for model development using these data, e.g., missing data points and irrelevant measurements for aging. This is a common caveat for human studies and has been discussed by the authors.<br /> (2) There is no validation dataset to verify the proposed model. The authors suggested that human biological age can be predicted with a high accuracy using 12 simple physiological measurements. It will be super useful and convincing if another biobank dataset containing those 12 traits can be applied to the current model.

      Comments on revisions:

      In this revision, the authors improved the manuscript by adding discussion of two main weaknesses about human data limitation and model validation. My several other specific concerns and suggestions are all properly resolved.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors developed a mathematical model to predict human biological ages using physiological traits. This model provides a way to identify environmental and genetic factors that impact aging and lifespan.

      Strengths:

      (1) The topic addressed by the authors - human age predication using physiological traits - is an extremely interesting, important, and challenging question in the aging field. One of the biggest challenges is the lack of well-controlled data from a large number of humans. However, the authors took this challenge and tried their best to extract useful information from available data.

      Authors thank an anonymous reviewer for agreeing that physiological clock building and analysis is an interesting and important even though challenging task.

      (2) Some of the findings can provide valuable guidelines for future experimental design for human and animal studies. For example, it was found that this mathematical model can best predict age when all different organ and physiological systems are sampled. This finding makes sense in general but can be, and has been, neglected when people use molecular markers to predict age. Most of those studies have used only one molecular trait or different traits from one tissue.

      Authors thank an anonymous reviewer for highlighting the importance of the approach we employ to sample traits for biological age prediction from multiple organs and systems, which ultimately provides more wholistic information

      Weaknesses:

      (1) As I mentioned above, the Biobank data used here are not designed for this current study, so there are many limitations for model development using these data, e.g., missing data points and irrelevant measurements for aging. This is a common caveat for human studies and has been discussed by the authors.

      Thank you for pointing out the caveats. Indeed, most databases and datasets including the UKBB that we use here have missing or inaccurate entries. We do discuss it in the text, as well as suggest and employ strategies to mitigate these caveats. We now updated the text to highlight these issues even further. Specifically, in the second paragraph of the “Results” section, we added the following text: “Most large human databases and datasets, including UKBB, have certain limitations, such as incomplete or missing data points. Therefore, before proceeding to modelling aging, we needed to address the following three issues:”

      (2) There is no validation dataset to verify the proposed model. The authors suggested that human biological age can be predicted with high accuracy using 12 simple physiological measurements. It will be super useful and convincing if another biobank dataset containing those 12 traits can be applied to the current model.

      Thank you for this comment. Indeed, having a replication cohort would be quite valuable. As of today, there is no comparable dataset to verify performance of the clock model or to attempt to validate GWAS results. The closest possible is the NIH-led research program “All Of Us”, which aims to collect data on 1 million people, which unfortunately is not available to for-profit companies. It is theoretically possible to rebuild a clock only using a small number of phenotypes present in both datasets with the goal of training it on one dataset and test-applying it to another, but this won’t ultimately address the accuracy of the wholistic physiological clock presented here. We hope academic labs will utilize our clock-modeling approach and apply it to datasets currently unavailable to us and publish their findings.

      To strengthen the credentials of our biological clock, we would like to remind the reviewer that we performed 10 rounds of validation, where, in each round, 10% of the data were left out from the model training such that the clock was created using remaining 90%. The model was subsequently tested on the 10% that was left out. Over 10 rounds, different 10% of data were left out and statistics for this 10-fold cross-validation age available in the supplementary materials. We have now updated the text to make this validation more apparent.

      Specifically, we added to the "Results” section, “A mathematical model to predict age” subsection, third paragraph, the following text: “Specifically, we performed 10 rounds of cross-validation, where 10% of data were held out and the remaining 90% used for training. Over 10 rounds, different 10% were held out for validation. In each case, the findings were validated in the test set. Full statistics and approach are described in supplementary computational methods.”

      Additionally, the details of this cross-validation are described in detail in supplementary methods.

      Additionally, we compared published GWAS results obtained for human aging clocks using modalities that were different yet relevant to human health. Specifically, we looked at GWAS for “Epigenetic Blood Age Acceleration” (Lu et al., 2018), ML-imaging-based human retinal aging clock (Ahadi et al., 2023), PhenoAgeAcceleration and BioAgeAcceleration (Kuo et al., 2021), and the ∆Age GWAS that we presented in our manuscript. We now describe the results of this comparison in our manuscript. Briefly, there is no overlap between GWAS results for any two of these published clocks built via different modalities – retina, DNA methylation, or physiological functions (between each other or with our model). However, there is a significant genetic overlap (p<10E-8) between clocks built using human phenotypic measures in a cohort of National Health and Nutrition Examination Survey (NHANES) III in the United States (7 variables) and ∆Age from Physiological clock from UKBB that we describe here (121 variables), further validating our approach. It is interesting to consider the reasons why genetic associations for human aging built using different modalities do not appear to have common genetic corelates, something we also now discuss in our manuscript.

      Specifically, we added to the "Results” section, “Genetic loci associated with biological age” subsection, third paragraph, the following text: “Additionally, we compared our ∆Age GWAS association results with similar GWAS studies that were performed for other biological clocks. For example, (McCartney et al., 2021) used DNA methylation data on 40,000 individuals to compute biological age called GrimAge. After that they calculated an intrinsic epigenetic age acceleration (IEAA, a value similar to ∆Age, which measured a deviation of biological age from chronological age) and performed GWAS.” Additionally, we added to the “Discussion” section, “Broader implications of the model for physiological aging” subsection, fourth paragraph, the following text: “To further analyze the meaning of genetic associations with ∆Age that we described above, we compared several published GWAS results obtained for human aging clocks using different health modalities. Specifically, we looked at GWAS for “Epigenetic Blood Age Acceleration” (Lu et al., 2018), ML-imaging-based human retinal aging clock (Ahadi et al., 2023), PhenoAgeAcceleration and BioAgeAcceleration (Kuo et al., 2021), and the ∆Age GWAS we presented in our manuscript. Surprisingly, we discovered that there is no overlap between GWAS results for any two of these clocks built via different modalities – retina, DNA methylation, or physiological functions. However, there is a significant genetic overlap between clocks built using human phenotypic measures and our ∆Age model we describe. For example, the Biological Age Clock Acceleration calculated using HbA1c, Albumin, Cholesterol, FEV, Urea nitrogen, SBP, and Creatinine (Levine, 2013) in a US cohort [from National Health and Nutrition Examination Survey (NHANES)] yielded 16 significant hits in the GWAS analysis, five of which were also significant in our GWAS for UKBB based ∆Age. These five common loci were close to the following genes - APOB, PIK3CG, TRIB1, SMARCA4, and APOE. The significance of this overlap is p < 10<sup>-8</sup>, suggesting that the ∆Age model we propose might be translatable to other cohorts of people.

      An interesting question to consider is why GWAS results from other clock modalities, such as DNA methylation and retinal imaging do not yield any genetic similarities to each other or to physiological and biological clocks. It is possible that these modalities of age assessment depend on completely genetically independent biological processes. For example, in a simplified manner - blood composition might be heavily weighted for DNA methylation, vascular structure for retinal scans, and muscle/bone/kidney health for physiological clocks. Data from model organisms suggest the master regulators of aging exist, and APOE is the best genetic variant known to influence human aging. Interestingly, only the biological and physiological clock models that we propose here pick it up as a hit. Alternatively, it is also possible that the true master regulators of aging rate are under stringent purifying selection; for example, due to an important role in development, and therefore, do not have genetic variability in human populations examined. As such, they could not be identified as hits in any GWAS studies.”

      Reviewer #2 (Public Review):

      In this manuscript, Libert et al. develop a model to predict an individual's age using physiological traits from multiple organ systems. The difference between the predicted biological age and the chronological age -- ∆Age, has an effect equivalent to that of a chronological year on Gompertz mortality risk. By conducting GWAS on ∆Age, the authors identify genetic factors that affect aging and distinguish those associated with age-related diseases. The study also uncovers environmental factors and employs dropout analysis to identify potential biomarkers and drivers for ∆Age. This research not only reveals new factors potentially affecting aging but also shows promise for evaluating therapeutics aimed at prolonging a healthy lifespan. This work represents a significant advancement in data-driven understanding of aging and provides new insights into human aging. Addressing the points raised would enhance its scientific validity and broaden its implications.

      Thank you!

      Major points:

      (1) Enhance the description and clarity of model evaluation.

      The manuscript requires additional details regarding the model's evaluation. The authors have stated "To develop a model that predicts age, we experimented with several algorithms, including simple linear regression, Gradient Boosting Machine (GBM) and Partial Least Squares regression (PLS). The outcomes of these approaches were almost identical". It is currently unclear whether the 'almost identical outcomes' mentioned refer to the similarity in top contribution phenotypes, the accuracy of age prediction, or both. To resolve this ambiguity, it would be beneficial to include specific results and comparisons from each of these models.

      Thank you for this comment. We now describe details of the model selection and provide data on outcome caparisons. Briefly, different approaches have different advantages and limitations; however, we chose one approach, and did not develop and analyze several independent models in parallel in order to not artificially inflate our False Discovery Rate (FDR). However, we now provide rationale and comparative performance of these three approaches. Specifically, we added to the "Results” section, “A mathematical model to predict age” subsection, first paragraph the following text: “Different approaches have different advantages and limitations; however, we decided to choose one approach, and not develop and analyze several independent models in parallel in order to not artificially inflate the False Discovery Rate (FDR). We ultimately selected PLS regression because it enabled us to determine the number and composition of components required to predict age optimally from the data, which provides additional insights into the biology of human aging. But before making this selection, we compared the performance of the three approaches. The outcomes of PLS and linear regression were almost identical (R-squared between ∆Age values derived by these two methods was 0.99, meaning that if one model were to predict an individual was 62 years old, the other model would have the same prediction). This similarity is likely due to the small number of predictors (121 phenotypes) and comparatively large number of participants (over 400,000). The correlation between GBM model outcomes and PLS (and linear regression) was slightly smaller (R-squared = 0.87). The reason for the lower correlation is likely the need for imputation in PLS and linear regression models. The GBM model tolerates missing data, whereas linear regression and PLS methods require imputation or removal of individuals with too many datapoints missing, an approach we describe in more detail below.”

      Additionally, after we obtained associations of ∆Age values with genetical loci, which formed the candidate base for gene targets to influence human aging (figure 5b), we verified the top association obtained via the PLS model in Linear and GBM models. All the top candidates that we verified had statistically significant associations in all the models of ∆Age (CST3, APOE, HLA locus, CPS1, PIK3CG, IGF1). The precise strengths of the associations were different, but that is to be expected given that linear datasets had some data imputed while GBM model was built with missing values. We believe that due to small number of predictors (121) compared to a vastly larger number of individuals (over 400,000), the differences the three models introduced to final outcomes were quite small.

      To convey this message, we added to the "Discussion” section, “Broader implications of the model for physiological aging” subsection, 7th paragraph, the following text: “It is interesting to note that the three approaches we used to generate age prediction model (PLS, GBM, and linear regression) yielded very similar or identical results in performance. We chose to settle on one approach (PLS) to not artificially inflate the False Discovery Rate (FDR); however, we verified that the top genetic loci associations obtained via the PLS model were also obtained in the GBM and linear models. Specifically, the top candidates (CST3, APOE, HLA locus, CPS1, PIK3CG, IGF1) identified in the PLS approach had statistically significant associations in all the models of ∆Age. It is likely that due to the small number of predictors (121) compared to a vastly larger number of individuals (over 400,000), the differences that these models introduce to final outcomes are quite small, which increases our confidence in the results.”

      Furthermore, the authors mention "to test for overfitting, a PLS model had been generated on randomly selected 90% of individuals and tested on the remaining 10% with similar results". To comprehensively assess the model's performance, it is crucial to provide detailed results for both the test and validation datasets. This should at least include metrics such as correlation coefficients and mean squared error for both training and test datasets.

      Thank you for bringing up this point. The detailed description, details and statistics of cross-validation procedure is described in supplementary computational methods. Briefly, across 10 rounds of validation the Root Mean Square Error of Prediction (RMSEP) did not exceed 4.81 for females when all 9 PLS components were considered, and RMSEP form males was 5.1 when all 11 components were considered. The variation of RMSEP between different datasets was less than 0.1. We have now updated the text to make this validation more apparent. Specifically, we added to the "Results” section, “A mathematical model to predict age” subsection, third paragraph the following text: “Specifically, we performed 10 rounds of cross-validation, where 10% of data were held out and the remaining 90% used for training. Over 10 rounds, different 10% were held out for validation. In each case, the findings were validated in the test set. Full statistics and approach are described in supplementary computational methods.”

      (2) External validation and generalization of results

      To enhance the robustness and generalizability of the study's findings, it is crucial to perform external validation using an independent population. Specifically, conducting validation with the participants of the 'All of Us' research program offers a unique opportunity. This diverse and extensive cohort, distinct from the initial study group, will serve as an independent validation set, providing insights into the applicability of the study's conclusions across varied demographics.

      Thank you for this comment. As we mentioned above, we agree that having a replication cohort would be very valuable for this study, as well as many other studies that stem from UKBB dataset. However, yet, there is no comparable dataset to verify performance of the clock or to attempt to validate GWAS results. The closest possible is NIH-led research program “All Of Us”, which aims to collect data on 1 million people, which unfortunately is not available to for-profit companies. It is theoretically possible to rebuild a clock only using the small number of phenotypes present in both datasets with the goal of training it on one dataset and test-applying it to another, but that approach would not ultimately be informative about the accuracy of the complete physiological clock presented here. We hope academic labs will utilize our clock approach and apply it to datasets currently unavailable to us and publish their findings. For the detailed response on this issue, please see the response to the second comment of the first reviewer above.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Specific questions/suggestions:<br /> - It looks like the ages of participants are enriched around 60 years (Fig. 1, Fig 3b). Can authors clarify whether age distribution affects the correlation tests (e.g. correlation in Fig 2)?

      Indeed, the distribution of people by age is enriched by 60–65-year-olds and is depleted at younger and older ages. Such a distribution influences the uncertainty of correlations that we compute, with error bars being larger for 40- and 70-year-olds and lower for 50- and 60-year-olds. The example of this can be seen on figure 1F. Figures 2a,b,g,h mostly deal with the correlation of phenotypes with each other and thus are not influenced by age. For other computations, such age prediction, it is theoretically possible that if age determinants among 65-year-olds differ from those for 40- or 80-year-olds, the calculated contributions would be skewed to increase accuracy in the middle of distribution at the expense of the ends. ∆Age, however, was explicitly normalized for each age cohort (Fig. 3a) to avoid “birth cohort” bias, therefore minimizing the effect of uneven distribution on further analysis, such as GWAS. We now acknowledge and describe this feature of UKBB dataset in the first paragraph of the “Results” section.

      - Phenotypic variation usually increases during aging. However, the authors showed that delta-age and age are not correlated (Figure 3a), suggesting that biological variation does not increase during aging in their analysis. Can authors provide more evidence supporting their findings? Is this phenomenon affected by their normalization method?

      Thank you for this comment. We find that there is no strict rule for phenotypic variation change with age. Certain phenotypes, such as blood pressure (Fig. 1a) or SHGB (Fig. 1d), indeed increase in variation with advanced age, however many others, such as grip strength (Fig. 1b) and BMI do not change in variation, and certain phenotypes even decrease their variation with age. As we stated above, in order to minimize the possible effect of “birth cohort” bias on subsequent analysis, as well as uneven distribution of people across ages, ∆Age was normalized per age cohort. Additionally, purifying selection likely also limits how far most physiological factors can deviate. For example, people with too high or too low blood pressures would simply perish, which would limit continuous increase in variation. 

      - Authors correlate GWAS data with delta-age (Figure 4). It would be important to show whether the delta-age from young and old participants correlates with GWAS patterns in a similar manner. If not, the authors have to consider how age differences affect delta-age and the GWAS correlation. For example, the authors mentioned that APOE genotype influences age-delta even in the 40-year-old group (Figure 4f). If the APOE genotype already shows high delta-age in the 40-year-old group, how does aging affect the delta-age distribution?

      Thank you for this comment. It is an interesting question to understand how age influences GWAS hits identified through ∆Age. At the same time, one must remember that our dataset is cross-sectional in nature and “different age” in reality is a subset of different people, which lived in different times with different exposures to environments and different standards of medical care (which are evolving over time). We specifically attempted to factor age and this “cohort effect” out of our analysis and presented Figure 4f simply as an illustration that APOE variants seem to influence human aging at any age, which challenges the theory proposed by previous studies that APOE is implicated in aging simply because APOE4 carriers likely die from Alzheimer disease and are thus excluded from the oldest cohorts. To investigate the question raised by the reviewer it is possible to do GWAS on age, however one must keep in mind the limitations associated with interpreting those results; as “age” in reality (in this cross-sectional cohort) also represents changes in population composition, changes in the environment, food quality, early life care, medical care, social habits, and other parameters associated with changing society.

      - For the discussion part, it would be great if the authors could add one section to provide guidelines for future human and lab animal studies based on observations from the current study. For example, what physiological traits are most useful, and what can be further added when collecting human data?

      Thank you for the great suggestion. We now propose and discuss certain experiments that can be performed in humans and animals to better differentiate between drivers and markers of aging.

      - In line 479, I found the statement "It is possible that synapse function accounts for the association of computer gaming with ΔAge" came from nowhere, and suggest removing it.

      Done—thank you.

      - Minor. Line 155. Is it a wrong citation of table S2c, 2d as there are only 2a and 2b?<br />

      Thank you, corrected.

      Reviewer #2 (Recommendations For The Authors):

      (1) Between lines 300-305, there is a missing reference to Figure 3e.

      Thank you, corrected.

      (2) For Figures 4a and 4c, please add the lambda statistic to the QQ plots.

      Thank you, we have added lambda inflation factors to the QQ plots.

      (3) In line 384, the p-value cut-off is mentioned as 10-9. However, this does not seem to be consistently represented in Figures 4b and 4d, where the gray lines do not align with this threshold. Please adjust these figures to accurately reflect the mentioned p-value cut-off.

      Thank you, corrected.

      (4) Clarification for Figure 5a. Add titles and correlation coefficients to Figure 5a to clearly define what the clusters represent. Please also add a discussion to explain why the cluster 10 (general health) dropout model can affect ∆Age compared to the full model, with some individuals showing a 5-year difference. Furthermore, despite the substantial effect of removing cluster 10 on ΔAge, all the top loci remain unchanged in terms of effect sizes and p-values compared to the full model.

      We have added the titles and correlation coefficients to the Figure 5a. Thank you for these suggestions, it makes the presentation of data much clearer. It is an interesting observation that whereas dropping out cluster 10 resulted in quite significant changes of ∆Age distribution, the genetic signature as determined by GWAS did not change much. The most obvious explanation is that many parameters in this category are influenced by environment more than by genetics, therefore genetic signature did not change much after the cluster removal. We now mention this observation in the text. Specifically, in the subsection “Cluster-dropout analysis enriches for GWAS hits that influence aging globally”, we added the following text: “Another interesting observation is that degree by which certain cluster contributes to the model does not necessarily correlate with how much this cluster contributes to genetic signature of human aging. For example, while dropping out cluster 10 (General Health) resulted in quite significant changes of ∆Age distribution (R<sup>2</sup>=0.88), the genetic signature as determined by GWAS did not change substantially. The most likely explanation is that many parameters in this category are influenced by environment more strongly than by genetics; for example, not as much as caused by cluster 1 (muscle-related) removal.”

      (5) Discussion on drivers and markers. Given the theoretical nature of the study, it would be beneficial to propose potential experimental validations for your findings. Even if these validations have not been performed, suggesting them would greatly enhance the value of the discussion.

      Thank you, it is a great idea. We now propose and discuss certain experiments that can be performed in humans and animals to better differentiate between drivers and markers of aging. Specifically, in the subsection “Cluster-dropout analysis enriches for GWAS hits that influence aging globally”, we added the following text: “To definitively distinguish whether a gene is a driver or a marker of aging, an experiment would need to be performed. It is possible that certain gene activities are influenced by existing FDA-approved medications, and retrospective analyses of human cohorts who take certain medications can be performed. More likely, however, an animal model would need to be employed, where animals with candidate genes modified via genetic means are investigated for lifespan and onset and progression of age-associated conditions. For example, one can engineer a mouse with a conditional allele of Cystatin-C and evaluate how changes in dosage of this protein influence various phenotypes of aging.”

    1. eLife Assessment

      This potentially useful study introduces an orthogonal approach for detecting RNA modification, without chemical modification of RNA, which often results in RNA degradation and therefore loss of information. Compared to previous versions, the most recent one is improved and sufficiently aligned with the standards of the field to merit consideration by the research community, making the evidence solid according to said standards. Nevertheless, uncertainty regarding false positive and false negative rates remains, as it does for some of the alternative approaches. With more rigorous validation, the approach might be of particular interest for sites in RNA molecules where modifications are rare.

    2. Reviewer #2 (Public review):

      The fledgling field of epitranscriptomics has encountered various technical roadblocks with implications as to the validity of early epitranscriptomics mapping data. As a prime example, the low specificity of (supposedly) modification-specific antibodies for the enrichment of modified RNAs, has been ignored for quite some time and is only now recognized for its dismal reproducibility (between different labs), which necessitates the development of alternative methods for modification detection. Furthermore, early attempts to map individual epitranscriptomes using sequencing-based techniques are largely characterized by the deliberate avoidance of orthogonal approaches aimed at confirming the existence of RNA modifications that have been originally identified.

      Improved methodology, the inclusion of various controls, and better mapping algorithms as well as the application of robust statistics for the identification of false-positive RNA modification calls have allowed revisiting original (seminal) publications whose early mapping data allowed making hyperbolic claims about the number, localization and importance of RNA modifications, especially in mRNA. Besides the existence of m6A in mRNA, the detectable incidence of RNA modifications in mRNAs has drastically dropped.

      As for m5C, the subject of the manuscript submitted by Zhou et al., its identification in mRNA goes back to Squires et al., 2012 reporting on >10.000 sites in mRNA of a human cancer cell line, followed by intermittent findings reporting on pretty much every number between 0 to > 100.000 m5C sites in different human cell-derived mRNA transcriptomes. The reason for such discrepancy is most likely of a technical nature. Importantly, all studies reporting on actual transcript numbers that were m5C-modified relied on RNA bisulfite sequencing, an NGS-based method, that can discriminate between methylated and non-methylated Cs after chemical deamination of C but not m5C. RNA bisulfite sequencing has a notoriously high background due to deamination artifacts, which occur largely due to incomplete denaturation of double-stranded regions (denaturing-resistant) of RNA molecules. Furthermore, m5C sites in mRNAs have now been mapped to regions that have not only sequence identity but also structural features of tRNAs. Various studies revealed that the highly conserved m5C RNA methyltransferases NSUN2 and NSUN6 do not only accept tRNAs but also other RNAs (including mRNAs) as methylation substrates, which in combination account for most of the RNA bisulfite-mapped m5C sites in human mRNA transcriptomes. Is m5C in mRNA only a result of the Star activity of tRNA or rRNA modification enzymes, or is their low stoichiometry biologically relevant?

      In light of the short-comings of existing tools to robustly determine m5C in transcriptomes, other methods, like DRAM-seq, aiming to map m5C independently of ex situ RNA treatment with chemicals, are needed to arrive at a more solid "ground state", from which it will be possible to state and test various hypotheses as to the biological function of m5C, especially in lowly abundant RNAs such as mRNA.

      Importantly, the identification of >10.000 sites containing m5C increases through DRAM-Seq, increases the number of potential m5C marks in human cancer cells from a couple of 100 (after rigorous post-hoc analysis of RNA bisulfite sequencing data) by orders of magnitude. This begs the question, whether or not the application of these editing tools results in editing artefacts overstating the number of actual m5C sites in the human cancer transcriptome.

      [Editors' note: earlier reviews have been provided here: https://doi.org/10.7554/eLife.98166.3.sa1; https://doi.org/10.7554/eLife.98166.2.sa1; https://doi.org/10.7554/eLife.98166.1.sa1]

    3. Author response:

      The following is the authors’ response to the original reviews.

      Responses to Reviewer’s Comments:  

      To Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C. 

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>Cmodified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PARCLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.

      Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.) 

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup> control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing.

      This confounds the interpretation of our experimental data.

      As demonstrated in Author response image 1A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purificationusing multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Author response image 1B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Author response image 1C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening highconfidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

      Responses to Reviewer’s Comments:  

      To Reviewer #3:

      The authors have again tried to address the former concern by this reviewer who questioned the specificity of both m<sup>5</sup>C reader proteins towards modified RNA rather than unmodified RNA. The authors chose to do RNA pull down experiments which serve as a proxy for proving the specificity of ALYREF and YBX1 for m<sup>5</sup>C modified RNAs. Even though this reviewer asked for determining the enrichment factor of the reader-base editor fusion proteins (as wildtype or mutant for the identified m<sup>5</sup>C specificity motif) when presented with m<sup>5</sup>C-modified RNAs, the authors chose to use both reader proteins alone (without the fusion to an editor) as wildtype and as respective m<sup>5</sup>C-binding mutant in RNA in vitro pull-down experiments along with unmodified and m<sup>5</sup>C-modified RNA oligomers as binding substrates. The quantification of these pull-down experiments (n=2) have now been added, and are revealing that (according to SFigure 1 E and G) YBX1 enriches an RNA containing a single m<sup>5</sup>C by a factor of 1.3 over its unmodified counterpart, while ALYREF enriches by a factor of 4x. This is an acceptable approach for educated readers to question the specificity of the reader proteins, even though the quantification should be performed differently (see below).

      Given that there is no specific sequence motif embedding those cytosines identified in the vicinity of the DRAM-edits (Figure 3J and K), even though it has been accepted by now that most of the m<sup>5</sup>C sites in mRNA are mediated by NSUN2 and NSUN6 proteins, which target tRNA like substrate structures with a particular sequence enrichment, one can conclude that DRAM-Seq is uncovering a huge number of false positives. This must be so not only because of the RNA bisulfite seq data that have been extensively studied by others, but also by the following calculations: Given that the m<sup>5</sup>C/C ratio in human mRNA is 0.02-0.09% (measured by mass spec) and assuming that 1/4 of the nucleotides in an average mRNA are cytosines, an mRNA of 1.000 nucleotides would contain 250 Cs. 0.02- 0.09% m<sup>5</sup>C/C would then translate into 0.05-0.225 methylated cytosines per 250 Cs in a 1000 nt mRNA. YBX1 would bind every C in such an mRNA since there is no m<sup>5</sup>C to be expected, which it could bind with 1.3 higher affinity. Even if the mRNAs would be 10.000 nt long, YBX1 would bind to half a methylated cytosine or 2.25 methylated cytosines with 1.3x higher affinity than to all the remaining cytosines (2499.5 to 2497.75 of 2.500 cytosines in 10.000 nt, respectively). These numbers indicate a 4999x to 1110x excess of cytosine over m<sup>5</sup>C in any substrate RNA, which the "reader" can bind as shown in the RNA pull-downs on unmodified RNAs. This reviewer spares the reader of this review the calculations for ALYREF specificity, which is slightly higher than YBX1. Hence, it is up to the capable reader of these calculations to follow the claim that this minor affinity difference allows the unambiguous detection of the few m<sup>5</sup>C sites in mRNA be it in the endogenous scenario of a cell or as fusion-protein with a base editor attached? 

      We sincerely appreciate the reviewer’s rigorous analysis. We would like to clarify that in our RNA pulldown assays, we indeed utilized the full DRAM system (reader protein fused to the base editor) to reflect the specificity of m<sup>5</sup>C recognition. As previously suggested by the reviewer, to independently validate the m<sup>5</sup>C-binding specificity of ALYREF and YBX1, we performed separate pulldown experiments with wild-type and mutant reader proteins (without the base editor fusion) using both unmodified and m<sup>5</sup>C-modified RNA substrates. This approach aligns with established methodologies in the field (doi:10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y). We have revised the Methods section (line 230) to explicitly describe this experimental design.

      Although the m<sup>5</sup>C/C ratios in LC/MS-assayed mRNA are relatively low (ranging from 0.02% to 0.09%), as noted by the reviewer, both our data and previous studies have demonstrated that ALYREF and YBX1 preferentially bind to m<sup>5</sup>C-modified RNAs over unmodified RNAs, exhibiting 4-fold and 1.3-fold enrichment, respectively (Supplementary Figure 1E–1G). Importantly, this specificity is further enhanced in the DRAM system through two key mechanisms: first, the fusion of reader proteins to the deaminase restricts editing to regions near m<sup>5</sup>C sites, thereby minimizing off-target effects; second, background editing observed in reader-mutant or deaminase controls (e.g., DRAM<sup>mut</sup>-CBE in Figure 2D) is systematically corrected for during data analysis.

      We agree that the theoretical challenge posed by the vast excess of unmodified cytosines. However, our approach includes stringent controls to alleviate this issue. Specifically, sites identified in NSUN2/NSUN6 knockout cells or reader-mutant controls are excluded (Figure 3F), which significantly reduces the number of false-positive detections. Additionally, we have observed deamination changes near high-confidence m<sup>5</sup>C methylation sites detected by RNA bisulfite sequencing, both in first-generation and high-throughput sequencing data. This observation further substantiates the validity of DRAM-Seq in accurately identifying m<sup>5</sup>C sites.

      We fully acknowledge that residual false positives may persist due to the inherent limitations of reader protein specificity, as discussed in line 299-301 of our manuscript. To address this, we plan to optimize reader domains with enhanced m<sup>5</sup>C binding (e.g., through structure-guided engineering), which is also previously implemented in the discussion of the manuscript.

      The reviewer supports the attempt to visualize the data. However, the usefulness of this Figure addition as a readable presentation of the data included in the supplement is up to debate.

      Thank you for your kind suggestion. We understand the reviewer's concern regarding data visualization. However, due to the large volume of DRAM-seq data, it is challenging to present each mutation site and its characteristics clearly in a single figure. Therefore, we chose to categorize the data by chromosome, which not only allows for a more organized presentation of the DRAM-seq data but also facilitates comparison with other database entries. Additionally, we have updated Supplementary Tables 2 and 3 to provide comprehensive information on the mutation sites. We hope that both the reviewer and editors will understand this approach. We will, of course, continue to carefully consider the reviewer's suggestions and explore better ways to present these results in the future.

      (3) A set of private Recommendations for the Authors that outline how you think the science and its presentation could be strengthened

      NEW COMMENTS to TEXT:

      Abstract:

      "5-Methylcytosine (m<sup>5</sup>C) is one of the major post-transcriptional modifications in mRNA and is highly involved in the pathogenesis of various diseases."

      In light of the increasing use of AI-based writing, and the proof that neither DeepSeek nor ChatGPT write truthfully statements if they collect metadata from scientific abstracts, this sentence is utterly misleading.

      m<sup>5</sup>C is not one of the major post-transcriptional modifications in mRNA as it is only present with a m<sup>5</sup>C/C ratio of 0.02- 0.09% as measured by mass-spec. Also, if m<sup>5</sup>C is involved in the pathogenesis of various diseases, it is not through mRNA but tRNA. No single published work has shown that a single m<sup>5</sup>C on an mRNA has anything to do with disease. Every conclusion that is perpetuated by copying the false statements given in the many reviews on the subject is based on knock-out phenotypes of the involved writer proteins. This reviewer wishes that the authors would abstain from the common practice that is currently flooding any scientific field through relentless repetitions in the increasing volume of literature which perpetuate alternative facts.

      We sincerely appreciate the reviewer’s insightful comments. While we acknowledge that m<sup>5</sup>C is not the most abundant post-transcriptional modification in mRNA, we believe that research into m<sup>5</sup>C modification holds considerable value. Numerous studies have highlighted its role in regulating gene expression and its potential contribution to disease progression. For example, recent publications have demonstrated that m<sup>5</sup>C modifications in mRNA can influence cancer progression, lipid metabolism, and other pathological processes (e.g., PMID: 37845385; 39013911; 39924557; 38042059; 37870216).

      We fully agree with the reviewer on the importance of maintaining scientific rigor in academic writing. While m<sup>5</sup>C is not the most abundant RNA modification, we cannot simply draw a conclusion that the level of modification should be the sole criterion for assessing its biological significance. However, to avoid potential confusion, we have removed the word “major”.

      COMMENTS ON FIGURE PRESENTATION:

      Figure 2D:

      The main text states: "DRAM-CBE induced C to U editing in the vicinity of the m<sup>5</sup>C site in AP5Z1 mRNA, with 13.6% C-to-U editing, while this effect was significantly reduced with APOBEC1 or DRAM<sup>mut</sup>-CBE (Fig.2D)." The Figure does not fit this statement. The seq trace shows a U signal of about 1/3 of that of C (about 30%), while the quantification shows 20+ percent

      Thank you for your kind suggestion. Upon visual evaluation, the sequencing trace in the figure appears to suggest a mutation rate closer to 30% rather than 22%. However, relying solely on the visual interpretation of sequencing peaks is not a rigorous approach. The trace on the left represents the visualization of Sanger sequencing results using SnapGene, while the quantification on the right is derived from EditR 1.0.10 software analysis of three independent biological replicates. The C-to-U mutation rates calculated were 22.91667%, 23.23232%, and 21.05263%, respectively. To further validate this, we have included the original EditR analysis of the Sanger sequencing results for the DRAM-CBE group used in the left panel of Figure 2D (see Author response image 2). This analysis confirms an m<sup>5</sup>C fraction (%) of 22/(22+74) = 22.91667, and the sequencing trace aligns well with the mutation rate we reported in Figure 2D. In conclusion, the data and conclusions presented in Figure 2D are consistent and supported by the quantitative analysis.

      Author response image 2.

      Figure 4B: shows now different numbers in Venn-diagrams than in the same depiction, formerly Figure 4A

      We sincerely thank the reviewer for pointing out this issue, and we apologize for not clearly indicating the changes in the previous version of the manuscript. In response to the initial round of reviewer comments, we implemented a more stringent data filtering process (as described in Figure 3F and method section) : "For high-confidence filtering, we further adjusted the parameters of Find_edit_site.pl to include an edit ratio of 10%–60%, a requirement that the edit ratio in control samples be at least 2-fold higher than in NSUN2 or NSUN6knockout samples, and at least 4 editing events at a given site." As a result, we made minor adjustments to the Venn diagram data in Figure 4A, reducing the total number of DRAM-edited mRNAs from 11,977 to 10,835. These changes were consistently applied throughout the manuscript, and the modifications have been highlighted for clarity. Importantly, these adjustments do not affect any of the conclusions presented in the manuscript.

      Figure 4B and D: while the overlap of the DRAM-Seq data with RNA bisulfite data might be 80% or 92%, it is obvious that the remaining data DRAM seq suggests a detection of additional sites of around 97% or 81.83%. It would be advised to mention this large number of additional sites as potential false positives, unless these data were normalized to the sites that can be allocated to NSUN2 and NSUN6 activity (NSUN mutant data sets could be substracted).

      Thank you for pointing this out. The Venn diagrams presented in Figure 4B and D already reflect the exclusion of potential false-positive sites identified in methyltransferasedeficient datasets, as described in our experimental filtering process, and they represent the remaining sites after this stringent filtering. However, we acknowledge that YBX1 and ALYREF, while preferentially binding to m<sup>5</sup>C-modified RNA, also exhibit some affinity for unmodified RNA. Although we employed rigorous controls, including DRAM<sup>mut</sup> and deaminase groups, to minimize false positives, the possibility of residual false positives cannot be entirely ruled out. Addressing this limitation would require even more stringent filtering methods, as discussed in lines 299–301 of the manuscript. We are committed to further optimizing the DRAM system to enhance the accuracy of transcriptome-wide m<sup>5</sup>C analysis in future studies.

      SFigure 1: It is clear that the wild type version of both reader proteins are robustly binding to RNA that does not contain m<sup>5</sup>C. As for the calculations of x-fold affinity loss of RNA binding using both ALYREF -mut or YBX1 -mut, this reviewer asks the authors to determine how much less the mutated versions of the proteins bind to a m<sup>5</sup>C-modified RNAs. Hence, a comparison of YBX1 versus YBX1 -mut (ALYREF versus ALYREF -mut) on the same substrate RNA with the same m<sup>5</sup>C-modified position would allow determining the contribution of the so-called modification binding pocket in the respective proteins to their RNA binding. The way the authors chose to show the data presently is misleading because what is compared is the binding of either the wild type or the mutant protein to different RNAs.

      We appreciate the reviewer’s valuable feedback and apologize for any confusion caused by the presentation of our data. We would like to clarify the rationale behind our approach. The decision to present the wild-type and mutant reader proteins in separate panels, rather than together, was made in response to comments from Reviewer 2. Below, we provide a detailed explanation of our experimental design and its justification.

      First, we confirmed that YBX1 and ALYREF exhibit stronger binding affinity to m<sup>5</sup>Cmodified RNA compared to unmodified RNA, establishing their role as m<sup>5</sup>C reader proteins. Next, to validate the functional significance of the DRAM<sup>mut</sup> group, we demonstrated that mutating key amino acids in the m<sup>5</sup>C-binding pocket significantly reduces the binding affinity of YBX1<sup>mut</sup> and ALYREF<sup>mut</sup> to m<sup>5</sup>C-modified RNA. This confirms that the DRAM<sup>mut</sup> group effectively minimizes false-positive results by disrupting specific m<sup>5</sup>C interactions.

      Crucially, in our pull-down experiments, both the wild-type and mutant proteins (YBX1/YBX1<sup>mut</sup> and ALYREF/ALYREF<sup>mut</sup>) were incubated with the same RNA sequences. To avoid any ambiguity, we have included the specific RNA sequence information in the Methods section (lines 463–468). This ensures a assessment of the reduced binding affinity of the mutant versions relative to the wild-type proteins, even though they are presented in separate panels.

      We hope this explanation clarifies our approach and demonstrates the robustness of our findings. We sincerely appreciate the reviewer’s understanding and hope this addresses their concerns.

      SFigure 2C: first two panels are duplicates of the same image.

      Thank you for pointing this out. We sincerely apologize for incorrectly duplicating the images. We have now updated Supplementary Figure 2C with the correct panels and have provided the original flow cytometry data for the first two images. It is important to note that, as demonstrated by the original data analysis, the EGFP-positive quantification values (59.78% and 59.74%) remain accurate. Therefore, this correction does not affect the conclusions of our study. Thank you again for bringing this to our attention.

      Author response image 3.

      SFigure 4B: how would the PCR product for NSUN6 be indicative of a mutation? The used primers seem to amplify the wildtype sequence.

      Thank you for your kind suggestion. In our NSUN6<sup>-/-</sup> cell line, the NSUN6 gene is only missing a single base pair (1bp) compared to the wildtype, which results in frame shift mutation and reduction in NSUN6 protein expression. We fully agree with the reviewer that the current PCR gel electrophoresis does not provide a clear distinction of this 1bp mutation. To better illustrate our experimental design, we have included a schematic representation of the knockout sequence in SFigure 4B. Additionally, we have provided the original sequencing data, and the corresponding details have been added to lines 151-153 of the manuscript for further clarification.

      Author response image 4.

      SFigure 4C: the Figure legend is insufficient to understand the subfigure.

      Thank you for your valuable suggestion. To improve clarity, we have revised the figure legend for SFigure 4C, as well as the corresponding text in lines 178-179. We have additionally updated the title of SFigure 4 for better clarity. The updated SFigure 4C now demonstrates that the DRAM-edited mRNAs exhibit a high degree of overlap across the three biological replicates.

      SFigure 4D: the Figure legend is insufficient to understand the subfigure.

      Thank you for your kind suggestion. We have revised the figure legend to provide a clearer explanation of the subfigure. Specifically, this figure illustrates the motif analysis derived from sequences spanning 10 nucleotides upstream and downstream of DRAMedited sites mediated by loci associated with NSUN2 or NSUN6. To enhance clarity, we have also rephrased the relevant results section (lines 169-175) and the corresponding discussion (lines 304-307).

      SFigure 7: There is something off with all 6 panels. This reviewer can find data points in each panel that do not show up on the other two panels even though this is a pairwise comparison of three data sets (file was sent to the Editor) Available at https://elife-rp.msubmit.net/elife-rp_files/2025/01/22/00130809/02/130809_2_attach_27_15153.pdf

      Response: We thank the reviewer for pointing this out. We would like to clarify the methodology behind this analysis. In this study, we conducted pairwise comparisons of the number of DRAM-edited sites per gene across three biological replicates of DRAM-ABE or DRAM-CBE, visualized as scatterplots. Each data point in the plots corresponds to a gene, and while the same gene is represented in all three panels, its position may vary vertically or horizontally across the panels. This variation arises because the number of mutation sites typically differs between replicates, making it unlikely for a data point to occupy the exact same position in all panels. A similar analytical approach has been used in previous studies on m6A (PMID: 31548708). To address the reviewer’s concern, we have annotated the corresponding positions of the questioned data points with arrows in Author response image 5.

      Author response image 5.

    1. eLife Assessment

      The research presents valuable findings on the impact of FRMD8 loss on tumor progression and resistance to tamoxifen therapy. Through a series of convincing and systematic experiments, the author thoroughly investigates the role of FRMD8 in breast cancer and its underlying regulatory mechanisms. The study confirms that FRMD8 holds potential as a therapeutic target for reversing tamoxifen resistance, offering helpful insights for future treatment strategies.

    2. Reviewer #1 (Public review):

      Summary:

      Tamoxifen resistance is a common problem in partially ER-positive patients undergoing endocrine therapy, and this manuscript has important research significance as it is based on clinical practical issues. The manuscript discovered that the absence of FRMD8 in breast epithelial cells can promote the progression of breast cancer, thus proposing the hypothesis that FRMD8 affects tamoxifen resistance and validated this hypothesis through a series of experiments. The manuscript has certain theoretical reference value.

      Strengths:

      At present, research on the role of FRMD8 in breast cancer is very limited. This manuscript leverages the MMTV-Cre+;Frmd8fl/fl;PyMT mouse model to study the role of FRMD8 in tamoxifen resistance, and single-cell sequencing technology discovered the interaction between FRMD8 and ESR1. At the mechanistic level, this manuscript has demonstrated two ways in which FRMD8 affects ERα, providing some new insights into the development of ER-positive breast cancer in patients who are resistant to tamoxifen.

      Limitations:

      Whether FRMD8 can become a biomarker should be verified in large clinical samples or clinical data.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript presents a valuable finding on the impact of FRMD8 loss on tumor progression and the resistance to tamoxifen therapy. The author conducted systematic experiments to explore the role of FRMD8 in breast cancer and its potential regulatory mechanisms, confirming that FRMD8 could serve as a potential target to revere tamoxifen resistance.

      The research is logically coherent and persuasive. The results support their conclusions and have achieved the research objectives.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Tamoxifen resistance is a common problem in partially ER-positive patients undergoing endocrine therapy, and this manuscript has important research significance as it is based on clinical practical issues. The manuscript discovered that the absence of FRMD8 in breast epithelial cells can promote the progression of breast cancer, thus proposing the hypothesis that FRMD8 affects tamoxifen resistance and validating this hypothesis through a series of experiments. The manuscript has a certain theoretical reference value.

      Strengths:

      At present, research on the role of FRMD8 in breast cancer is very limited. This manuscript leverages the MMTV-Cre+;Frmd8fl/fl;PyMT mouse model to study the role of FRMD8 in tamoxifen resistance, and single-cell sequencing technology discovered the interaction between FRMD8 and ESR1. At the mechanistic level, this manuscript has demonstrated two ways in which FRMD8 affects ERα, providing some new insights into the development of ER-positive breast cancer in patients who are resistant to tamoxifen.

      Weaknesses:

      This manuscript repeatedly emphasizes the role of FRMD8/FOXO3A in tamoxifen resistance in ER-positive breast cancer, but the specific mechanisms have not yet been fully elucidated. Whether FRMD8 can become a biomarker should be verified in large clinical samples or clinical data.

      We appreciate your recognition and valuable suggestions. The proliferation of ERα-positive breast cancer cells is contingent upon the expression of ERα. Tamoxifen, a selective estrogen receptor modulator, competitively binds to ERα, thereby inhibiting the activation of the proliferation signaling pathway. Previous studies have demonstrated that the downregulation of ERα expression results in a reduction in the sensitivity of breast cancer cells to tamoxifen (PMID: 15894097; PMID: 922747). Our study revealed the molecular mechanism by which FRMD8 regulates ERα expression through FOXO3A and UBE3A, and thus FRMD8 deficiency is a cause of tamoxifen treatment resistance. 

      In this study, our results showed that low expression of FRMD8 predicts poor prognosis in breast cancer patients. We agree with this reviewer and will validate the role of FRMD8 in more patient samples and expand its application in different cancer types.

      Reviewer #2 (Public review):

      Summary:

      The manuscript presents a valuable finding on the impact of FRMD8 loss on tumor progression and the resistance to tamoxifen therapy. The author conducted systematic experiments to explore the role of FRMD8 in breast cancer and its potential regulatory mechanisms, confirming that FRMD8 could serve as a potential target to revere tamoxifen resistance.

      Strengths:

      The majority of the research is logically clear, smooth, and persuasive.

      Weaknesses:

      Some research in the article lacks depth and some sentences are poorly organized.

      Thank you for your helpful suggestion. We have carefully revised the manuscript again. 

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      This manuscript suggests that the resistance of tamoxifen in breast cancer is linked to the loss of function of FRMD8. This is a relatively good and valuable contribution. However, there are several points that confused me.

      (1) The subfigures with important conclusions should include quantitative analysis, for example, Figure 4D, 4E, and 6A. In Figure 6F, which subtypes of normal and tumor tissues were investigated.

      Thank you for your helpful suggestions. We have quantified the bands in Figure 4D, 4E, and 6A and labelled them in the figures. 

      We have also provided details of the tumor samples in Table S3 and the “Materials and Methods” section. The majority of tumor tissues are invasive ductal carcinomas.

      (2) In the luminal epithelium-specific Frmd8 knockout mice (MMTV-Cre+; Frmd8fl/fl), the authors demonstrated that the loss of FRMD8 promotes the growth of breast tumors. In Figure 3A, the expression of ERα and PR in tumors is nearly negative. However, why was the validation of the mechanism performed in breast tumor cell lines and not in epithelial cells?

      Thanks for the question. Early-stage mammary tumors in MMTV-PyMT mice express ERα, while ERα is negative in advanced tumors of MMTV-PyMT mice. Figure 3A shows the results of tumors from four-month-old mice. Meanwhile, our supplementary results showed that loss of Frmd8 decreased ERα expression also in normal and atypical hyperplasia mammary tissues from 7-week-old MMTV-PyMT mice, when the mice had no palpable tumors and ERα is positive (Fig. S3E). We believe that the absence of FRMD8 contributes to the acceleration of the malignant progression during the dynamic evolution of breast cancer. Limited by the difficulty of transfection in breast normal epithelial cell line (MCF10A), we explored the subsequent mechanisms mainly in breast cancer cells and HEK293, a human embryonic kidney cell line. Besides, Figure S3E also showed the regulation of ERα expression by Frmd8 in mouse mammary

      epithelial cells.

      (3) To explore the mechanism by which FRMD8 inhibits ERα degradation, what is the reason for choosing HEK293A?

      Thank you for the good question. HEK293 cell line is commonly used in mechanistic studies. We also employed the breast cancer cell line T47D to verify the observations in HEK293 cells. Furthermore, the mass spectrometry result of HEK293A cells presented in Figure 5E was an additional experiment performed when we were exploring the regulation of the cell cycle by FRMD8, which is published in Cell Reports (PMID: 37527040). Based on the mass spectrometry result, we assumed that FRMD8 may influence ERα degradation mediated by UBE3A.

      Reviewer #2 (Recommendations for the authors):

      Introduction

      (1) In order for the reader to better understand the content of the article, it is better to briefly describe the role of ERα in the progression of breast cancer.

      Thank you for your suggestion. We have provided a brief description of the role of ERα in the introduction of revised manuscript:

      “ERα is a ligand-activated transcription factor that is activated by oestrogen, and promotes cell proliferation during breast cancer development (Harbeck et al., 2019).”

      (2) As ESR1 is mentioned in the second paragraph, a brief description of the relationship between ESR1 and ERα can make the article more logical.

      Thank you for the suggestion. We have added the description in the introduction:

      “Multiple transcription factors, such as AP-2γ, FOXO3, FOXM1, and GATA3, have been reported to bind to the promoter region of ESR1, the gene encoding ERα, and participate in transcriptional regulation of ESR1(Jia et al., 2019; Koš et al., 2001).”

      (3) In the text, there are two variations of the term FRMD8: 'FRMD8' and 'Frmd8'. It is best to standardize on one form throughout the document.

      We apologize for any confusion. The terms "FRMD8" and "Frmd8" are used to indicate proteins derived from human and mouse, respectively.

      Results

      (4) In Figure 2L, there is no noticeable difference in the expression levels of Pgr and Esr1 between the Cre+ tumor and Cre- tumor groups. Figure S2E is more suitable for inclusion in the main text compared to Figure 2L.

      Thank you for this suggestion. ERα and PR are positive in early-stage mammary tumors of MMTV-PyMT mice, while ERα and PR are gradually lost as the tumor progresses. In figure 2, mammary tumors from 4-month-old MMTV-PyMT mice were subjected to scRNA-seq analysis. Since the expression of ERα was very low in tumor cells at this time, there appears to be no difference between the two groups. We have exchanged Figure 2L and Figure S2E in the manuscript.

      (5) The CNV score can be used to assess the malignancy of cells, it would be better to compare the malignancy levels between the two groups.

      This is a very good suggestion. However, copy number variations usually occur randomly and have a high degree of heterogeneity. Due to the limited sample size in our study, we did not compare the difference between the two groups.

      (6) Enrichment analysis is crucial for single-cell sequencing studies. It is recommended to perform differential gene analysis and enrichment analysis between the Cre+ and Cre- groups to further explore the impact of FRMD8 deficiency on the functions of malignant cells.

      Thank you for your suggestion. We have performed differential gene analysis and biological process enrichment analysis on the results of scRNA sequence using the gene ontology (GO) database. Our results showed that upregulated genes in luminal progenitor (Lp) epithelial cells were enriched in epithelial cell proliferation and transmembrane receptor protein serine/threonine kinase signaling pathways, suggesting that Frmd8 deficiency significantly promotes epithelial cells proliferation in MMTV-PyMT mice.

      Author response image 1.

      (7) The coherent logic in lines 300 to 308 should be that FRMD8 is expressed at higher levels in normal Hsd epithelial cells in mice, hence further verification was conducted to examine the expression levels of FRMD8 in various human breast cancer cell lines.

      We have revised the figures and text as suggested.  

      Discussion

      (8) In lines 352 to 360, the background narrative in the first half seems to have little connection with the research findings in the second half; it is suggested to reorganize the language of this section.

      Thank you for the advice. We have rewritten this paragraph in the manuscript:

      “In MMTV-PyMT mice, early-stage mammary tumors express ERα and PR, but these receptors are gradually lost as the tumor progresses (Lapidus et al., 1998). Our scRNA-seq results revealed that mammary tumor epithelial cells in MMTV-PyMT mice fall into four clusters, with only Hsd epithelial cells showing ERα and PR expression. Additionally, Hsd epithelial cells exhibited the lowest CNV score, indicating a closer resemblance to normal epithelial cells. The loss of Frmd8 reduced the proportion of Hsd epithelial cells and led to a downregulation of ERα and PR expression, implying that Frmd8 deficiency promotes the loss of luminal features in the mammary gland and accelerates mammary tumor progression.”

      (9) As stated in the result section, the depletion of FRMD8 may lead to the decrease of the Hsd epithelial cells proportion, it might be beneficial to discuss the significance of this finding.

      We have added a discussion of the Hsd epithelial cell proportion in the third paragraph of this section (please refer to the above question (8) ).

      Figures

      (10) The structural layout of Figure 4 should be reorganized to make it more aesthetically pleasing.

      Thank you for this suggestion. We have rearranged Figure 4 as suggested.

    1. eLife Assessment

      This study presents valuable findings on the control of survival and maintenance of a specific set of brain resident immune cells. The authors generate a new animal model to enable sophisticated analysis of cell function in vivo. The sophisticated knock-in/knock-out alleles are compelling, although the work would ultimately be strengthened with further mechanistic analyses.